Abrupt computer restart - Tasks stuck

Author	Message
Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 33274 - Posted: 30 Sep 2013, 4:09:25 UTC Last modified: 30 Sep 2013, 4:11:43 UTC I recently had a power outage here, where the computer lost power while it had been working on BOINC. When I turned the computer on again, for the 2 Long-run GPUGrid tasks, they got stuck in a continual "Driver reset" loop, where I was getting continuous Windows balloons saying the driver had successfully recovered, over and over. I looked at the stderr.txt file in the slots directory, and remember seeing: Kernel not found# SWAN swan_assert 0 ... over and over, along with each retry to start the task. The only way I could see to get out of the loop, was to abort the work units. So I did. The tasks are below. Curiously, there was also a beta task that I had worked on (which error'd out and was reported way before the power outage), where it also said: Kernel not found# SWAN swan_assert 0 1) Why did the full stderr.txt not get included in my aborted task logs? 2) Why did the app continually try to restart this unresumable situation? 3) Was the error in the beta task intentionally set (to test the retry logic?) Thanks, Jacob Name I66R8-NATHAN_KIDKIXc22_6-9-50-RND7714_1 Workunit 4795185 Created 29 Sep 2013 \| 9:39:42 UTC Sent 29 Sep 2013 \| 9:56:59 UTC Received 30 Sep 2013 \| 4:01:08 UTC Server state Over Outcome Computation error Client state Aborted by user Exit status 203 (0xcb) EXIT_ABORTED_VIA_GUI Computer ID 153764 Report deadline 4 Oct 2013 \| 9:56:59 UTC Run time 48,589.21 CPU time 48,108.94 Validate state Invalid Credit 0.00 Application version Long runs (8-12 hours on fastest card) v8.14 (cuda55) Stderr output <core_client_version>7.2.16</core_client_version> <![CDATA[ <message> aborted by user </message> ]]> Name 17x6-SANTI_RAP74wtCUBIC-13-34-RND0681_0 Workunit 4807187 Created 29 Sep 2013 \| 13:06:23 UTC Sent 29 Sep 2013 \| 17:32:54 UTC Received 30 Sep 2013 \| 4:01:08 UTC Server state Over Outcome Computation error Client state Aborted by user Exit status 203 (0xcb) EXIT_ABORTED_VIA_GUI Computer ID 153764 Report deadline 4 Oct 2013 \| 17:32:54 UTC Run time 17,822.88 CPU time 3,669.02 Validate state Invalid Credit 0.00 Application version Long runs (8-12 hours on fastest card) v8.14 (cuda55) Stderr output <core_client_version>7.2.16</core_client_version> <![CDATA[ <message> aborted by user </message> ]]> Name 112-MJHARVEY_CRASH3-14-25-RND0090_2 Workunit 4807215 Created 29 Sep 2013 \| 17:32:12 UTC Sent 29 Sep 2013 \| 17:32:54 UTC Received 29 Sep 2013 \| 19:04:42 UTC Server state Over Outcome Computation error Client state Compute error Exit status -226 (0xffffffffffffff1e) ERR_TOO_MANY_EXITS Computer ID 153764 Report deadline 4 Oct 2013 \| 17:32:54 UTC Run time 4,020.13 CPU time 1,062.94 Validate state Invalid Credit 0.00 Application version ACEMD beta version v8.14 (cuda55) Stderr output <core_client_version>7.2.16</core_client_version> <![CDATA[ <message> too many exit(0)s </message> <stderr_txt> # GPU [GeForce GTX 460] Platform [Windows] Rev [3203] VERSION [55] # SWAN Device 1 : # Name : GeForce GTX 460 # ECC : Disabled # Global mem : 1024MB # Capability : 2.1 # PCI ID : 0000:08:00.0 # Device clock : 1526MHz # Memory clock : 1900MHz # Memory width : 256bit # Driver version : r325_00 : 32723 # GPU 0 : 68C # GPU 1 : 61C # GPU 2 : 83C # GPU 1 : 63C # GPU 1 : 64C # GPU 1 : 65C # GPU 1 : 66C # GPU 1 : 67C # GPU 1 : 68C # GPU 0 : 69C # GPU 1 : 69C # GPU 1 : 70C # GPU 0 : 70C # GPU 1 : 71C # GPU 0 : 71C # GPU [GeForce GTX 460] Platform [Windows] Rev [3203] VERSION [55] # SWAN Device 1 : # Name : GeForce GTX 460 # ECC : Disabled # Global mem : 1024MB # Capability : 2.1 # PCI ID : 0000:08:00.0 # Device clock : 1526MHz # Memory clock : 1900MHz # Memory width : 256bit # Driver version : r325_00 : 32723 Kernel not found# SWAN swan_assert 0 14:56:38 (1696): Can't acquire lockfile (32) - waiting 35s # GPU [GeForce GTX 460] Platform [Windows] Rev [3203] VERSION [55] # SWAN Device 1 : # Name : GeForce GTX 460 # ECC : Disabled # Global mem : 1024MB # Capability : 2.1 # PCI ID : 0000:08:00.0 # Device clock : 1526MHz # Memory clock : 1900MHz # Memory width : 256bit # Driver version : r325_00 : 32723 Kernel not found# SWAN swan_assert 0 # GPU [GeForce GTX 460] Platform [Windows] Rev [3203] VERSION [55] # SWAN Device 1 : # Name : GeForce GTX 460 # ECC : Disabled # Global mem : 1024MB # Capability : 2.1 # PCI ID : 0000:08:00.0 # Device clock : 1526MHz # Memory clock : 1900MHz # Memory width : 256bit # Driver version : r325_00 : 32723 Kernel not found# SWAN swan_assert 0 ... # GPU [GeForce GTX 460] Platform [Windows] Rev [3203] VERSION [55] # SWAN Device 1 : # Name : GeForce GTX 460 # ECC : Disabled # Global mem : 1024MB # Capability : 2.1 # PCI ID : 0000:08:00.0 # Device clock : 1526MHz # Memory clock : 1900MHz # Memory width : 256bit # Driver version : r325_00 : 32723 Kernel not found# SWAN swan_assert 0 </stderr_txt> ]]> ID: 33274 · Rating: 0 · rate: /

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 2 Level Scientific publications	Message 33277 - Posted: 30 Sep 2013, 10:31:53 UTC - in response to Message 33274. Interesting that you and Matt - no, not Matt Harvey, the guy in GPUGrid Start Up/Recovery Issues - should both post about similar issues on the same day. I've also had the problem of the continual "Driver reset" loop after an abnormal shutdown, also mostly with NATHAN_KIDKIXc22 tasks. The problem would appear to be a failure to restart the tasks from a (possibly damaged or currupt) checkpoint file - maybe the project team could look into that? My workround has been to restart Windows in safe mode (which prevents BOINC loading), and edit client_state.xml to add the line <suspended_via_gui/> to the <result> block for the suspect task. As the name suggests, that's the same as clicking 'suspend' for the task while BOINC is running, and gets control of the machine back so you can investigate on the next normal restart. By convention, the line goes just under <plan_class> in client_state, but I think anywhere at the first indent level will do. Interesting point about stderr.txt - I hadn't looked that far into it. The process for stderr is: It gets written as a file in the slot directory On task completion, the contents of the file gets copied into that same <result> block in client_state.xml The <result> data is copied into a sched_request file for the project's server The scheduler result handler copies it into the database for display on the web. So, which of those gets skipped if a task gets aborted? Next time it happens, I'll follow the process through and see where it goes missing. Any which way, it's probably a BOINC problem, and I agree it would be better if partial information was available for aborted tasks. You and I both know where and how to get that changed once we've narrowed down the problem ;) ID: 33277 · Rating: 0 · rate: /

Betting Slip Send message Joined: 5 Jan 09 Posts: 670 Credit: 2,498,095,550 RAC: 0 Level Scientific publications	Message 33278 - Posted: 30 Sep 2013, 10:42:22 UTC - in response to Message 33277. Last modified: 30 Sep 2013, 10:56:32 UTC I have the same problem with Nathan units on a GTX460 but I didn't have power outages. ADDED I wish they would Beta test full size WU's before releasing them on an unsuspecting public. It's little wonder there a very small bunch of hardcore GPUGrid crunchers because it's just too much hassle for ordinary user and causes too many problems. They join and quickly leave...shame because a little more full beta testing would catch these problems. Radio Caroline, the world's most famous offshore pirate radio station. Great music since April 1964. Support Radio Caroline Team - Radio Caroline ID: 33278 · Rating: 0 · rate: /

Betting Slip Send message Joined: 5 Jan 09 Posts: 670 Credit: 2,498,095,550 RAC: 0 Level Scientific publications	Message 33279 - Posted: 30 Sep 2013, 11:03:35 UTC Also had my GTX660TI throw a wobbly on a Noelia WU here http://www.gpugrid.net/result.php?resultid=7310174 Radio Caroline, the world's most famous offshore pirate radio station. Great music since April 1964. Support Radio Caroline Team - Radio Caroline ID: 33279 · Rating: 0 · rate: /

Operator Send message Joined: 15 May 11 Posts: 108 Credit: 297,176,099 RAC: 0 Level Scientific publications	Message 33284 - Posted: 30 Sep 2013, 13:18:27 UTC - in response to Message 33274. I recently had a power outage here, where the computer lost power while it had been working on BOINC. When I turned the computer on again, for the 2 Long-run GPUGrid tasks, they got stuck in a continual "Driver reset" loop, where I was getting continuous Windows balloons saying the driver had successfully recovered, over and over. I looked at the stderr.txt file in the slots directory, and remember seeing: Kernel not found# SWAN swan_assert 0 ... over and over, along with each retry to start the task. The only way I could see to get out of the loop, was to abort the work units. So I did. The tasks are below. Curiously, there was also a beta task that I had worked on (which error'd out and was reported way before the power outage), where it also said: Kernel not found# SWAN swan_assert 0 1) Why did the full stderr.txt not get included in my aborted task logs? 2) Why did the app continually try to restart this unresumable situation? 3) Was the error in the beta task intentionally set (to test the retry logic?) Thanks, Jacob Jacob, this has been my life with my GTX 590 box for the last month. I usually just end up resetting the whole project because the apps will not continue. It may run for a day or two or it may just run for two hours before BSOD. I'm fighting the nvlddmkm.sys thing right now and will probably end up reinstalling as a last ditch effort. This system does not normally crash unless BOINC is running GPUGrid WUs. It is not overclocked and is water cooled. All timings and specs are as from the Dell factory for this T7500. But yeah..I completely understand what you're going through. Operator ID: 33284 · Rating: 0 · rate: /

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 33327 - Posted: 2 Oct 2013, 10:32:48 UTC - in response to Message 33274. MJH: Can you try to reproduce this problem (in my report in the first post) and fix it? ID: 33327 · Rating: 0 · rate: /

Operator Send message Joined: 15 May 11 Posts: 108 Credit: 297,176,099 RAC: 0 Level Scientific publications	Message 33337 - Posted: 3 Oct 2013, 0:43:30 UTC I did reinstall the OS on my GTX 590 box and have not installed any updates. I am using 314.22 right now and it's been running for two days without any errors at all. I am now convinced that there was a "third party software" issue or possibly the Microsoft WDDM 1.1, 1.2, 1.3 update package that caused my problems. I'm using Win7 x64 so I really don't think I need the update to my windows display model to work with Win 8 or 8.1 if I'm not using that OS. Regardless, it's working now! Operator ID: 33337 · Rating: 0 · rate: /

wiyosaya Send message Joined: 22 Nov 09 Posts: 114 Credit: 589,114,683 RAC: 0 Level Scientific publications	Message 33367 - Posted: 5 Oct 2013, 23:23:06 UTC I had the same problem with this WU on my GTX 580 machine - http://www.gpugrid.net/workunit.php?wuid=4819239 Only I did not have a power outage. My symptoms were finding my computer frozen, with no choice other than to hit the reset switch. When the computer came back up, I kept getting the windows balloons saying that there were driver problems and that the driver failed to start, and then blue screens. I booted into safe mode, then downloaded and installed the latest WHQL NVidia driver. I then rebooted and got exactly the same thing again. I figured it was the GPU grid WU, so I again booted into safe mode, brought up BOINC manager and aborted the task. Now the my computer comes back up and is running, however, I got a computation error on this WU - http://www.gpugrid.net/workunit.php?wuid=4820870 which also caused a blue screen. I've set my GPU Grid project to not get new tasks for the time. Interestingly enough, my GTX 460 machine seems to be having no problems at the moment. ID: 33367 · Rating: 0 · rate: /

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 33368 - Posted: 5 Oct 2013, 23:36:07 UTC - in response to Message 33367. MJH: Any response? I tried to provide as much detail as possible. ID: 33368 · Rating: 0 · rate: /

MJH Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level Scientific publications	Message 33369 - Posted: 6 Oct 2013, 8:49:51 UTC - in response to Message 33368. Jacob, Next time this happens, please email me the contents of the slot directory and the task files. Mjh ID: 33369 · Rating: 0 · rate: /

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 33373 - Posted: 6 Oct 2013, 12:52:45 UTC - in response to Message 33369. Ha! Considering it seems like it should be easy to reproduce (turn off PC, via switch and not via normal shutdown, in the middle of the GPUGrid task)... Challenge accepted. ID: 33373 · Rating: 0 · rate: /

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 33374 - Posted: 6 Oct 2013, 13:09:33 UTC - in response to Message 33373. MJH: If I'm able to reproduce the issue, where should I email the requested files? Can you please PM me your email address? Also... For my first test, the issue did not occur on my Long-run SANTI-baxbim tasks. I wonder if it is task-type-specific? I'll try to test a "abrupt computer restart" against other task types. ID: 33374 · Rating: 0 · rate: /

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 33375 - Posted: 6 Oct 2013, 13:26:37 UTC - in response to Message 33374. Last modified: 6 Oct 2013, 13:28:13 UTC MJH: I have been able to reproduce the problem with a SANTI_MAR422dim task. Can you please PM me your email address? Thanks, Jacob ID: 33375 · Rating: 0 · rate: /

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 33376 - Posted: 6 Oct 2013, 16:37:58 UTC - in response to Message 33375. Matt, I have received your PM, and have sent you the files. Please let me know if you need anything or find anything! Thanks, Jacob ID: 33376 · Rating: 0 · rate: /

wiyosaya Send message Joined: 22 Nov 09 Posts: 114 Credit: 589,114,683 RAC: 0 Level Scientific publications	Message 33377 - Posted: 6 Oct 2013, 16:47:48 UTC - in response to Message 33374. Last modified: 6 Oct 2013, 17:00:03 UTC MJH: If I'm able to reproduce the issue, where should I email the requested files? Can you please PM me your email address? Also... For my first test, the issue did not occur on my Long-run SANTI-baxbim tasks. I wonder if it is task-type-specific? I'll try to test a "abrupt computer restart" against other task types. FWIW - My GTX 460 machine finished the task that I posted about. Although it took longer than 24-hours, it was a SANTI-baxbim task - http://www.gpugrid.net/workunit.php?wuid=4818983 Also, I have to say that I somewhat agree with the above post about people who run this project really needing to know what they are doing. I'm a software developer / computer scientist by trade, and I build my own PCs when I need them. One reason that I think many people might leave this project is that it takes so long to run a WU, and that they must be returned in 5-days. Some people might turn their machines off, and thus would not be able to return the WU in 5-days. Personally, I only run this project on weekends. In general, I have found this project to be relatively stable with this, perhaps, the only serious fault I have encountered so far. However, when faults like this arise, it would almost certainly take skilled people to get out of the situation created. Unfortunately, though, this and other similar projects, at least as I see it, are on the bleeding edge. As in my job where the software that I work with is also on the bleeding edge (a custom FEA program), it is sometimes extraordinarily difficult to catch a bug like this since it sounds like it occurs only under limited circumstances that may not be caught in a test of the software unless, in this case, the PC were shut down abnormally. Just my $0.02. ID: 33377 · Rating: 0 · rate: /

Betting Slip Send message Joined: 5 Jan 09 Posts: 670 Credit: 2,498,095,550 RAC: 0 Level Scientific publications	Message 33391 - Posted: 7 Oct 2013, 12:49:50 UTC - in response to Message 33377. Last modified: 7 Oct 2013, 12:57:19 UTC One reason that I think many people might leave this project is that it takes so long to run a WU, and that they must be returned in 5-days. Some people might turn their machines off, and thus would not be able to return the WU in 5-days. Personally, I only run this project on weekends. That's another thing that is a "trap" and confusing. While the deadline is 5 days if you don't return within 2 days the WU is resent to another host and if that host returns first (likely) your computing time, if you return a result, has been wasted because the first valid result returned is canonical and yours is binned. Radio Caroline, the world's most famous offshore pirate radio station. Great music since April 1964. Support Radio Caroline Team - Radio Caroline ID: 33391 · Rating: 0 · rate: /

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 33392 - Posted: 7 Oct 2013, 12:57:18 UTC - in response to Message 33391. One reason that I think many people might leave this project is that it takes so long to run a WU, and that they must be returned in 5-days. Some people might turn their machines off, and thus would not be able to return the WU in 5-days. Personally, I only run this project on weekends. That's another thing that is a "trap" and confusing. While the deadline is 5 days if you don't return within 2 days the WU is resent to another host and if that host returns first (likely) you don't get any credit and your computing time if you return a result has been wasted because the first valid result returned is canonical and yours is binned. The 2 day resend was ceased long ago. ID: 33392 · Rating: 0 · rate: /

Betting Slip Send message Joined: 5 Jan 09 Posts: 670 Credit: 2,498,095,550 RAC: 0 Level Scientific publications	Message 33393 - Posted: 7 Oct 2013, 12:59:48 UTC - in response to Message 33392. One reason that I think many people might leave this project is that it takes so long to run a WU, and that they must be returned in 5-days. Some people might turn their machines off, and thus would not be able to return the WU in 5-days. Personally, I only run this project on weekends. That's another thing that is a "trap" and confusing. While the deadline is 5 days if you don't return within 2 days the WU is resent to another host and if that host returns first (likely) you don't get any credit and your computing time if you return a result has been wasted because the first valid result returned is canonical and yours is binned. The 2 day resend was ceased long ago. I type corrected :-) Radio Caroline, the world's most famous offshore pirate radio station. Great music since April 1964. Support Radio Caroline Team - Radio Caroline ID: 33393 · Rating: 0 · rate: /

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 33394 - Posted: 7 Oct 2013, 13:05:11 UTC Alright, back on topic here... I'm awaiting MJH to analyze the files that I sent him. ID: 33394 · Rating: 0 · rate: /

Operator Send message Joined: 15 May 11 Posts: 108 Credit: 297,176,099 RAC: 0 Level Scientific publications	Message 33397 - Posted: 7 Oct 2013, 17:22:31 UTC - in response to Message 33394. Last modified: 7 Oct 2013, 17:23:33 UTC Alright, back on topic here... I'm awaiting MJH to analyze the files that I sent him. Jacob; Are you talking about when one of the GPUs TDR it screws up all the other tasks running on other GPUs as well? That happens to me on my GTX590 box all the time (mostly power outages). If one messes up and ends up causing a TDR or complete dump and reboot, when I start BOINC again all the remaining WUs in progress on the other GPUs also cause more TDRs unless I abort them. Sometimes even that doesn't help and I have to completely reset the project. Example: I had a TDR the other day. Three WUs were uploading at the time. Only one was actually processing. Fine. So I reboot and catch BOINC before it starts processing the problem WU and suspend processing so the three that did complete can upload for credit. Now, I abort the problem WU and let the system download 4 new WUs. As soon as processing starts, Wham! Another TDR. So I do a reset of the project and 4 more WUs download and start processing without any problem at all. So the point is, unless I reset the project when I get a TDR I'm just wasting my time downloading new WUs because they are all going to continue to crash until I do a complete reset. So I'm not sure what file that's left over in the BOINC or GPUGrid project folder(s) is causing the TDRs after the original event. Is that the same issue you are talking about here or am I way off? Operator ID: 33397 · Rating: 0 · rate: /

Abrupt computer restart - Tasks stuck - Kernel not found