Abrupt computer restart - Tasks stuck - Kernel not found

Message boards : Number crunching : Abrupt computer restart - Tasks stuck - Kernel not found
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 · Next

AuthorMessage
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33274 - Posted: 30 Sep 2013, 4:09:25 UTC
Last modified: 30 Sep 2013, 4:11:43 UTC

I recently had a power outage here, where the computer lost power while it had been working on BOINC.

When I turned the computer on again, for the 2 Long-run GPUGrid tasks, they got stuck in a continual "Driver reset" loop, where I was getting continuous Windows balloons saying the driver had successfully recovered, over and over. I looked at the stderr.txt file in the slots directory, and remember seeing:
Kernel not found# SWAN swan_assert 0
... over and over, along with each retry to start the task.

The only way I could see to get out of the loop, was to abort the work units. So I did.
The tasks are below.
Curiously, there was also a beta task that I had worked on (which error'd out and was reported way before the power outage), where it also said:
Kernel not found# SWAN swan_assert 0

1) Why did the full stderr.txt not get included in my aborted task logs?
2) Why did the app continually try to restart this unresumable situation?
3) Was the error in the beta task intentionally set (to test the retry logic?)


Thanks,
Jacob

Name	I66R8-NATHAN_KIDKIXc22_6-9-50-RND7714_1
Workunit	4795185
Created	29 Sep 2013 | 9:39:42 UTC
Sent	29 Sep 2013 | 9:56:59 UTC
Received	30 Sep 2013 | 4:01:08 UTC
Server state	Over
Outcome	Computation error
Client state	Aborted by user
Exit status	203 (0xcb) EXIT_ABORTED_VIA_GUI
Computer ID	153764
Report deadline	4 Oct 2013 | 9:56:59 UTC
Run time	48,589.21
CPU time	48,108.94
Validate state	Invalid
Credit	0.00
Application version	Long runs (8-12 hours on fastest card) v8.14 (cuda55)
Stderr output

<core_client_version>7.2.16</core_client_version>
<![CDATA[
<message>
aborted by user
</message>
]]>


Name	17x6-SANTI_RAP74wtCUBIC-13-34-RND0681_0
Workunit	4807187
Created	29 Sep 2013 | 13:06:23 UTC
Sent	29 Sep 2013 | 17:32:54 UTC
Received	30 Sep 2013 | 4:01:08 UTC
Server state	Over
Outcome	Computation error
Client state	Aborted by user
Exit status	203 (0xcb) EXIT_ABORTED_VIA_GUI
Computer ID	153764
Report deadline	4 Oct 2013 | 17:32:54 UTC
Run time	17,822.88
CPU time	3,669.02
Validate state	Invalid
Credit	0.00
Application version	Long runs (8-12 hours on fastest card) v8.14 (cuda55)
Stderr output

<core_client_version>7.2.16</core_client_version>
<![CDATA[
<message>
aborted by user
</message>
]]>


Name	112-MJHARVEY_CRASH3-14-25-RND0090_2
Workunit	4807215
Created	29 Sep 2013 | 17:32:12 UTC
Sent	29 Sep 2013 | 17:32:54 UTC
Received	29 Sep 2013 | 19:04:42 UTC
Server state	Over
Outcome	Computation error
Client state	Compute error
Exit status	-226 (0xffffffffffffff1e) ERR_TOO_MANY_EXITS
Computer ID	153764
Report deadline	4 Oct 2013 | 17:32:54 UTC
Run time	4,020.13
CPU time	1,062.94
Validate state	Invalid
Credit	0.00
Application version	ACEMD beta version v8.14 (cuda55)

Stderr output

<core_client_version>7.2.16</core_client_version>
<![CDATA[
<message>
too many exit(0)s
</message>
<stderr_txt>
# GPU [GeForce GTX 460] Platform [Windows] Rev [3203] VERSION [55]
# SWAN Device 1	:
#	Name		: GeForce GTX 460
#	ECC		: Disabled
#	Global mem	: 1024MB
#	Capability	: 2.1
#	PCI ID		: 0000:08:00.0
#	Device clock	: 1526MHz
#	Memory clock	: 1900MHz
#	Memory width	: 256bit
#	Driver version	: r325_00 : 32723
# GPU 0 : 68C
# GPU 1 : 61C
# GPU 2 : 83C
# GPU 1 : 63C
# GPU 1 : 64C
# GPU 1 : 65C
# GPU 1 : 66C
# GPU 1 : 67C
# GPU 1 : 68C
# GPU 0 : 69C
# GPU 1 : 69C
# GPU 1 : 70C
# GPU 0 : 70C
# GPU 1 : 71C
# GPU 0 : 71C
# GPU [GeForce GTX 460] Platform [Windows] Rev [3203] VERSION [55]
# SWAN Device 1	:
#	Name		: GeForce GTX 460
#	ECC		: Disabled
#	Global mem	: 1024MB
#	Capability	: 2.1
#	PCI ID		: 0000:08:00.0
#	Device clock	: 1526MHz
#	Memory clock	: 1900MHz
#	Memory width	: 256bit
#	Driver version	: r325_00 : 32723
Kernel not found# SWAN swan_assert 0
14:56:38 (1696): Can't acquire lockfile (32) - waiting 35s
# GPU [GeForce GTX 460] Platform [Windows] Rev [3203] VERSION [55]
# SWAN Device 1	:
#	Name		: GeForce GTX 460
#	ECC		: Disabled
#	Global mem	: 1024MB
#	Capability	: 2.1
#	PCI ID		: 0000:08:00.0
#	Device clock	: 1526MHz
#	Memory clock	: 1900MHz
#	Memory width	: 256bit
#	Driver version	: r325_00 : 32723
Kernel not found# SWAN swan_assert 0
# GPU [GeForce GTX 460] Platform [Windows] Rev [3203] VERSION [55]
# SWAN Device 1	:
#	Name		: GeForce GTX 460
#	ECC		: Disabled
#	Global mem	: 1024MB
#	Capability	: 2.1
#	PCI ID		: 0000:08:00.0
#	Device clock	: 1526MHz
#	Memory clock	: 1900MHz
#	Memory width	: 256bit
#	Driver version	: r325_00 : 32723
Kernel not found# SWAN swan_assert 0
...
# GPU [GeForce GTX 460] Platform [Windows] Rev [3203] VERSION [55]
# SWAN Device 1	:
#	Name		: GeForce GTX 460
#	ECC		: Disabled
#	Global mem	: 1024MB
#	Capability	: 2.1
#	PCI ID		: 0000:08:00.0
#	Device clock	: 1526MHz
#	Memory clock	: 1900MHz
#	Memory width	: 256bit
#	Driver version	: r325_00 : 32723
Kernel not found# SWAN swan_assert 0

</stderr_txt>
]]>
ID: 33274 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 351
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33277 - Posted: 30 Sep 2013, 10:31:53 UTC - in response to Message 33274.  

Interesting that you and Matt - no, not Matt Harvey, the guy in GPUGrid Start Up/Recovery Issues - should both post about similar issues on the same day.

I've also had the problem of the continual "Driver reset" loop after an abnormal shutdown, also mostly with NATHAN_KIDKIXc22 tasks. The problem would appear to be a failure to restart the tasks from a (possibly damaged or currupt) checkpoint file - maybe the project team could look into that?

My workround has been to restart Windows in safe mode (which prevents BOINC loading), and edit client_state.xml to add the line

    <suspended_via_gui/>

to the <result> block for the suspect task.

As the name suggests, that's the same as clicking 'suspend' for the task while BOINC is running, and gets control of the machine back so you can investigate on the next normal restart. By convention, the line goes just under <plan_class> in client_state, but I think anywhere at the first indent level will do.

Interesting point about stderr.txt - I hadn't looked that far into it.

The process for stderr is:

It gets written as a file in the slot directory
On task completion, the contents of the file gets copied into that same <result> block in client_state.xml
The <result> data is copied into a sched_request file for the project's server
The scheduler result handler copies it into the database for display on the web.

So, which of those gets skipped if a task gets aborted? Next time it happens, I'll follow the process through and see where it goes missing. Any which way, it's probably a BOINC problem, and I agree it would be better if partial information was available for aborted tasks. You and I both know where and how to get that changed once we've narrowed down the problem ;)
ID: 33277 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Betting Slip

Send message
Joined: 5 Jan 09
Posts: 670
Credit: 2,498,095,550
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33278 - Posted: 30 Sep 2013, 10:42:22 UTC - in response to Message 33277.  
Last modified: 30 Sep 2013, 10:56:32 UTC

I have the same problem with Nathan units on a GTX460 but I didn't have power outages.


ADDED

I wish they would Beta test full size WU's before releasing them on an unsuspecting public. It's little wonder there a very small bunch of hardcore GPUGrid crunchers because it's just too much hassle for ordinary user and causes too many problems. They join and quickly leave...shame because a little more full beta testing would catch these problems.
Radio Caroline, the world's most famous offshore pirate radio station.
Great music since April 1964. Support Radio Caroline Team -
Radio Caroline
ID: 33278 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Betting Slip

Send message
Joined: 5 Jan 09
Posts: 670
Credit: 2,498,095,550
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33279 - Posted: 30 Sep 2013, 11:03:35 UTC

Also had my GTX660TI throw a wobbly on a Noelia WU here

http://www.gpugrid.net/result.php?resultid=7310174
Radio Caroline, the world's most famous offshore pirate radio station.
Great music since April 1964. Support Radio Caroline Team -
Radio Caroline
ID: 33279 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Operator

Send message
Joined: 15 May 11
Posts: 108
Credit: 297,176,099
RAC: 0
Level
Asn
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33284 - Posted: 30 Sep 2013, 13:18:27 UTC - in response to Message 33274.  

I recently had a power outage here, where the computer lost power while it had been working on BOINC.

When I turned the computer on again, for the 2 Long-run GPUGrid tasks, they got stuck in a continual "Driver reset" loop, where I was getting continuous Windows balloons saying the driver had successfully recovered, over and over. I looked at the stderr.txt file in the slots directory, and remember seeing:
Kernel not found# SWAN swan_assert 0
... over and over, along with each retry to start the task.

The only way I could see to get out of the loop, was to abort the work units. So I did.
The tasks are below.
Curiously, there was also a beta task that I had worked on (which error'd out and was reported way before the power outage), where it also said:
Kernel not found# SWAN swan_assert 0

1) Why did the full stderr.txt not get included in my aborted task logs?
2) Why did the app continually try to restart this unresumable situation?
3) Was the error in the beta task intentionally set (to test the retry logic?)


Thanks,
Jacob



Jacob, this has been my life with my GTX 590 box for the last month.

I usually just end up resetting the whole project because the apps will not continue. It may run for a day or two or it may just run for two hours before BSOD.

I'm fighting the nvlddmkm.sys thing right now and will probably end up reinstalling as a last ditch effort. This system does not normally crash unless BOINC is running GPUGrid WUs. It is not overclocked and is water cooled. All timings and specs are as from the Dell factory for this T7500.

But yeah..I completely understand what you're going through.

Operator
ID: 33284 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33327 - Posted: 2 Oct 2013, 10:32:48 UTC - in response to Message 33274.  

MJH:
Can you try to reproduce this problem (in my report in the first post) and fix it?
ID: 33327 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Operator

Send message
Joined: 15 May 11
Posts: 108
Credit: 297,176,099
RAC: 0
Level
Asn
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33337 - Posted: 3 Oct 2013, 0:43:30 UTC

I did reinstall the OS on my GTX 590 box and have not installed any updates.

I am using 314.22 right now and it's been running for two days without any errors at all.

I am now convinced that there was a "third party software" issue or possibly the Microsoft WDDM 1.1, 1.2, 1.3 update package that caused my problems.

I'm using Win7 x64 so I really don't think I need the update to my windows display model to work with Win 8 or 8.1 if I'm not using that OS.

Regardless, it's working now!

Operator
ID: 33337 · Rating: 0 · rate: Rate + / Rate - Report as offensive
wiyosaya

Send message
Joined: 22 Nov 09
Posts: 114
Credit: 589,114,683
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33367 - Posted: 5 Oct 2013, 23:23:06 UTC

I had the same problem with this WU on my GTX 580 machine - http://www.gpugrid.net/workunit.php?wuid=4819239
Only I did not have a power outage.

My symptoms were finding my computer frozen, with no choice other than to hit the reset switch. When the computer came back up, I kept getting the windows balloons saying that there were driver problems and that the driver failed to start, and then blue screens.

I booted into safe mode, then downloaded and installed the latest WHQL NVidia driver. I then rebooted and got exactly the same thing again. I figured it was the GPU grid WU, so I again booted into safe mode, brought up BOINC manager and aborted the task. Now the my computer comes back up and is running, however, I got a computation error on this WU - http://www.gpugrid.net/workunit.php?wuid=4820870 which also caused a blue screen.

I've set my GPU Grid project to not get new tasks for the time.

Interestingly enough, my GTX 460 machine seems to be having no problems at the moment.
ID: 33367 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33368 - Posted: 5 Oct 2013, 23:36:07 UTC - in response to Message 33367.  

MJH: Any response?
I tried to provide as much detail as possible.
ID: 33368 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile MJH

Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 33369 - Posted: 6 Oct 2013, 8:49:51 UTC - in response to Message 33368.  

Jacob,

Next time this happens, please email me the contents of the slot directory and the task files.

Mjh
ID: 33369 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33373 - Posted: 6 Oct 2013, 12:52:45 UTC - in response to Message 33369.  

Ha! Considering it seems like it should be easy to reproduce (turn off PC, via switch and not via normal shutdown, in the middle of the GPUGrid task)... Challenge accepted.
ID: 33373 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33374 - Posted: 6 Oct 2013, 13:09:33 UTC - in response to Message 33373.  

MJH:

If I'm able to reproduce the issue, where should I email the requested files? Can you please PM me your email address?

Also... For my first test, the issue did not occur on my Long-run SANTI-baxbim tasks. I wonder if it is task-type-specific? I'll try to test a "abrupt computer restart" against other task types.
ID: 33374 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33375 - Posted: 6 Oct 2013, 13:26:37 UTC - in response to Message 33374.  
Last modified: 6 Oct 2013, 13:28:13 UTC

MJH:

I have been able to reproduce the problem with a SANTI_MAR422dim task.
Can you please PM me your email address?

Thanks,
Jacob
ID: 33375 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33376 - Posted: 6 Oct 2013, 16:37:58 UTC - in response to Message 33375.  

Matt,
I have received your PM, and have sent you the files.
Please let me know if you need anything or find anything!

Thanks,
Jacob
ID: 33376 · Rating: 0 · rate: Rate + / Rate - Report as offensive
wiyosaya

Send message
Joined: 22 Nov 09
Posts: 114
Credit: 589,114,683
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33377 - Posted: 6 Oct 2013, 16:47:48 UTC - in response to Message 33374.  
Last modified: 6 Oct 2013, 17:00:03 UTC

MJH:

If I'm able to reproduce the issue, where should I email the requested files? Can you please PM me your email address?

Also... For my first test, the issue did not occur on my Long-run SANTI-baxbim tasks. I wonder if it is task-type-specific? I'll try to test a "abrupt computer restart" against other task types.

FWIW - My GTX 460 machine finished the task that I posted about. Although it took longer than 24-hours, it was a SANTI-baxbim task - http://www.gpugrid.net/workunit.php?wuid=4818983

Also, I have to say that I somewhat agree with the above post about people who run this project really needing to know what they are doing. I'm a software developer / computer scientist by trade, and I build my own PCs when I need them.

One reason that I think many people might leave this project is that it takes so long to run a WU, and that they must be returned in 5-days. Some people might turn their machines off, and thus would not be able to return the WU in 5-days. Personally, I only run this project on weekends.

In general, I have found this project to be relatively stable with this, perhaps, the only serious fault I have encountered so far. However, when faults like this arise, it would almost certainly take skilled people to get out of the situation created.

Unfortunately, though, this and other similar projects, at least as I see it, are on the bleeding edge. As in my job where the software that I work with is also on the bleeding edge (a custom FEA program), it is sometimes extraordinarily difficult to catch a bug like this since it sounds like it occurs only under limited circumstances that may not be caught in a test of the software unless, in this case, the PC were shut down abnormally.

Just my $0.02.
ID: 33377 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Betting Slip

Send message
Joined: 5 Jan 09
Posts: 670
Credit: 2,498,095,550
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33391 - Posted: 7 Oct 2013, 12:49:50 UTC - in response to Message 33377.  
Last modified: 7 Oct 2013, 12:57:19 UTC

One reason that I think many people might leave this project is that it takes so long to run a WU, and that they must be returned in 5-days. Some people might turn their machines off, and thus would not be able to return the WU in 5-days. Personally, I only run this project on weekends.


That's another thing that is a "trap" and confusing. While the deadline is 5 days if you don't return within 2 days the WU is resent to another host and if that host returns first (likely) your computing time, if you return a result, has been wasted because the first valid result returned is canonical and yours is binned.
Radio Caroline, the world's most famous offshore pirate radio station.
Great music since April 1964. Support Radio Caroline Team -
Radio Caroline
ID: 33391 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33392 - Posted: 7 Oct 2013, 12:57:18 UTC - in response to Message 33391.  

One reason that I think many people might leave this project is that it takes so long to run a WU, and that they must be returned in 5-days. Some people might turn their machines off, and thus would not be able to return the WU in 5-days. Personally, I only run this project on weekends.


That's another thing that is a "trap" and confusing. While the deadline is 5 days if you don't return within 2 days the WU is resent to another host and if that host returns first (likely) you don't get any credit and your computing time if you return a result has been wasted because the first valid result returned is canonical and yours is binned.

The 2 day resend was ceased long ago.
ID: 33392 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Betting Slip

Send message
Joined: 5 Jan 09
Posts: 670
Credit: 2,498,095,550
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33393 - Posted: 7 Oct 2013, 12:59:48 UTC - in response to Message 33392.  

One reason that I think many people might leave this project is that it takes so long to run a WU, and that they must be returned in 5-days. Some people might turn their machines off, and thus would not be able to return the WU in 5-days. Personally, I only run this project on weekends.


That's another thing that is a "trap" and confusing. While the deadline is 5 days if you don't return within 2 days the WU is resent to another host and if that host returns first (likely) you don't get any credit and your computing time if you return a result has been wasted because the first valid result returned is canonical and yours is binned.

The 2 day resend was ceased long ago.


I type corrected :-)


Radio Caroline, the world's most famous offshore pirate radio station.
Great music since April 1964. Support Radio Caroline Team -
Radio Caroline
ID: 33393 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33394 - Posted: 7 Oct 2013, 13:05:11 UTC

Alright, back on topic here...
I'm awaiting MJH to analyze the files that I sent him.
ID: 33394 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Operator

Send message
Joined: 15 May 11
Posts: 108
Credit: 297,176,099
RAC: 0
Level
Asn
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33397 - Posted: 7 Oct 2013, 17:22:31 UTC - in response to Message 33394.  
Last modified: 7 Oct 2013, 17:23:33 UTC

Alright, back on topic here...
I'm awaiting MJH to analyze the files that I sent him.



Jacob;

Are you talking about when one of the GPUs TDR it screws up all the other tasks running on other GPUs as well?

That happens to me on my GTX590 box all the time (mostly power outages). If one messes up and ends up causing a TDR or complete dump and reboot, when I start BOINC again all the remaining WUs in progress on the other GPUs also cause more TDRs unless I abort them.

Sometimes even that doesn't help and I have to completely reset the project.

Example: I had a TDR the other day. Three WUs were uploading at the time. Only one was actually processing. Fine. So I reboot and catch BOINC before it starts processing the problem WU and suspend processing so the three that did complete can upload for credit.

Now, I abort the problem WU and let the system download 4 new WUs.

As soon as processing starts, Wham! Another TDR.

So I do a reset of the project and 4 more WUs download and start processing without any problem at all.

So the point is, unless I reset the project when I get a TDR I'm just wasting my time downloading new WUs because they are all going to continue to crash until I do a complete reset.

So I'm not sure what file that's left over in the BOINC or GPUGrid project folder(s) is causing the TDRs after the original event.

Is that the same issue you are talking about here or am I way off?

Operator
ID: 33397 · Rating: 0 · rate: Rate + / Rate - Report as offensive
1 · 2 · 3 · 4 · Next

Message boards : Number crunching : Abrupt computer restart - Tasks stuck - Kernel not found

©2025 Universitat Pompeu Fabra