Message boards :
Number crunching :
Abrupt computer restart - Tasks stuck - Kernel not found
Message board moderation
Previous · 1 · 2 · 3 · 4 · Next
| Author | Message |
|---|---|
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I believe your issue is a separate issue. Mine occurs as outlined in the first post within this thread: If a GPUGrid task is in the middle of being processed, and BOINC is shutdown abnormally (like a power outage, or the computer froze without user issuing the shutdown command)... Then when the computer/BOINC/task restarts, it can get into a loop where it crashes the driver, tries to start again (I see the "elapsed" time back off a few seconds indicating it is retrying), crash the driver again, etc. etc. It keeps crashing the driver until I abort the task. It does not affect other tasks. I've captured a copy of the data directory when this was happening, and submitted some files to MJH, to hopefully figure out what is happening. If you have a different issue, please consider opening a separate thread. Thanks, Jacob |
|
Send message Joined: 22 Nov 09 Posts: 114 Credit: 589,114,683 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Was there a resolution to this? I ran several WUs this past weekend on my 580 machine, which is the one that had the problem, and I did not see this issue again. |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
There has been no recent contact from MJH, and so no resolution. I believe the issue only happens when the computer (running a GPUGrid.net task) is interrupted (or freezes completely) without being able to shutdown cleanly. I haven't seen it happen recently, because I usually shutdown/restart normally, instead of an abrupt power shutoff. Regards, Jacob |
|
Send message Joined: 22 Nov 09 Posts: 114 Credit: 589,114,683 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Thanks. I had no problems this past weekend. However, I did not experience any abnormal shutdowns or freezes. |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 2 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Are we still collecting these? I had a sticking task - multiple driver restarts after a forced reboot - with 23x6-SANTI_RAP74wtCUBIC-18-34-RND6543_0 The std_err txt follows: I'll preserve the rest of the slot contents before aborting the task, in case anyone wants them. # GPU [GeForce GTX 670] Platform [Windows] Rev [3203] VERSION [55] # SWAN Device 0 : # Name : GeForce GTX 670 # ECC : Disabled # Global mem : 2048MB # Capability : 3.0 # PCI ID : 0000:07:00.0 # Device clock : 1084MHz # Memory clock : 3054MHz # Memory width : 256bit # Driver version : r331_00 : 33140 # GPU 0 : 74C # GPU 1 : 55C # GPU 0 : 75C # GPU 1 : 56C # GPU 0 : 76C # GPU 0 : 77C # GPU 0 : 78C # GPU 0 : 79C # GPU 1 : 57C # GPU 0 : 80C # GPU 0 : 81C # GPU 1 : 58C # GPU [GeForce GTX 670] Platform [Windows] Rev [3203] VERSION [55] # SWAN Device 0 : # Name : GeForce GTX 670 # ECC : Disabled # Global mem : 2048MB # Capability : 3.0 # PCI ID : 0000:07:00.0 # Device clock : 1084MHz # Memory clock : 3054MHz # Memory width : 256bit # Driver version : r331_00 : 33140 SWAN : FATAL : Cuda driver error 702 in file 'swanlibnv2.cpp' in line 1963. # SWAN swan_assert 0 # GPU [GeForce GTX 670] Platform [Windows] Rev [3203] VERSION [55] # SWAN Device 0 : # Name : GeForce GTX 670 # ECC : Disabled # Global mem : 2048MB # Capability : 3.0 # PCI ID : 0000:07:00.0 # Device clock : 1084MHz # Memory clock : 3054MHz # Memory width : 256bit # Driver version : r331_00 : 33140 SWAN : FATAL : Cuda driver error 702 in file 'swanlibnv2.cpp' in line 1963. # SWAN swan_assert 0 # GPU [GeForce GTX 670] Platform [Windows] Rev [3203] VERSION [55] # SWAN Device 0 : # Name : GeForce GTX 670 # ECC : Disabled # Global mem : 2048MB # Capability : 3.0 # PCI ID : 0000:07:00.0 # Device clock : 1084MHz # Memory clock : 3054MHz # Memory width : 256bit # Driver version : r331_00 : 33140 SWAN : FATAL : Cuda driver error 702 in file 'swanlibnv2.cpp' in line 1963. # SWAN swan_assert 0 # GPU [GeForce GTX 670] Platform [Windows] Rev [3203] VERSION [55] # SWAN Device 0 : # Name : GeForce GTX 670 # ECC : Disabled # Global mem : 2048MB # Capability : 3.0 # PCI ID : 0000:07:00.0 # Device clock : 1084MHz # Memory clock : 3054MHz # Memory width : 256bit # Driver version : r331_00 : 33140 SWAN : FATAL : Cuda driver error 999 in file 'swanlibnv2.cpp' in line 1963. # SWAN swan_assert 0 # GPU [GeForce GTX 670] Platform [Windows] Rev [3203] VERSION [55] # SWAN Device 0 : # Name : GeForce GTX 670 # ECC : Disabled # Global mem : 2048MB # Capability : 3.0 # PCI ID : 0000:07:00.0 # Device clock : 1084MHz # Memory clock : 3054MHz # Memory width : 256bit # Driver version : r331_00 : 33140 |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I sent MJH some files, but haven't heard from him :/ |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 2 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
And it's just happened again, this time with potx108-NOELIA_INS1P-0-14-RND5839_0 # GPU [GeForce GTX 670] Platform [Windows] Rev [3203] VERSION [55] # SWAN Device 1 : # Name : GeForce GTX 670 # ECC : Disabled # Global mem : 2048MB # Capability : 3.0 # PCI ID : 0000:08:00.0 # Device clock : 1084MHz # Memory clock : 3054MHz # Memory width : 256bit # Driver version : r331_00 : 33140 # GPU 0 : 76C # GPU 1 : 56C # GPU 1 : 57C # GPU 1 : 58C # GPU 1 : 59C # GPU 1 : 60C # GPU 1 : 61C # GPU 0 : 77C # GPU 1 : 62C # GPU 1 : 63C # GPU 0 : 78C # GPU [GeForce GTX 670] Platform [Windows] Rev [3203] VERSION [55] # SWAN Device 0 : # Name : GeForce GTX 670 # ECC : Disabled # Global mem : 2048MB # Capability : 3.0 # PCI ID : 0000:07:00.0 # Device clock : 1084MHz # Memory clock : 3054MHz # Memory width : 256bit # Driver version : r331_00 : 33140 SWAN : FATAL : Cuda driver error 999 in file 'swanlibnv2.cpp' in line 1963. # SWAN swan_assert 0 # GPU [GeForce GTX 670] Platform [Windows] Rev [3203] VERSION [55] # SWAN Device 0 : # Name : GeForce GTX 670 # ECC : Disabled # Global mem : 2048MB # Capability : 3.0 # PCI ID : 0000:07:00.0 # Device clock : 1084MHz # Memory clock : 3054MHz # Memory width : 256bit # Driver version : r331_00 : 33140 SWAN : FATAL : Cuda driver error 999 in file 'swanlibnv2.cpp' in line 1963. # SWAN swan_assert 0 # GPU [GeForce GTX 670] Platform [Windows] Rev [3203] VERSION [55] # SWAN Device 0 : # Name : GeForce GTX 670 # ECC : Disabled # Global mem : 2048MB # Capability : 3.0 # PCI ID : 0000:07:00.0 # Device clock : 1084MHz # Memory clock : 3054MHz # Memory width : 256bit # Driver version : r331_00 : 33140 SWAN : FATAL : Cuda driver error 999 in file 'swanlibnv2.cpp' in line 1963. # SWAN swan_assert 0 # GPU [GeForce GTX 670] Platform [Windows] Rev [3203] VERSION [55] # SWAN Device 0 : # Name : GeForce GTX 670 # ECC : Disabled # Global mem : 2048MB # Capability : 3.0 # PCI ID : 0000:07:00.0 # Device clock : 1084MHz # Memory clock : 3054MHz # Memory width : 256bit # Driver version : r331_00 : 33140 SWAN : FATAL : Cuda driver error 702 in file 'swanlibnv2.cpp' in line 1963. # SWAN swan_assert 0 22:21:16 (5824): Can't acquire lockfile (32) - waiting 35s # GPU [GeForce GTX 670] Platform [Windows] Rev [3203] VERSION [55] # SWAN Device 0 : # Name : GeForce GTX 670 # ECC : Disabled # Global mem : 2048MB # Capability : 3.0 # PCI ID : 0000:07:00.0 # Device clock : 1084MHz # Memory clock : 3054MHz # Memory width : 256bit # Driver version : r331_00 : 33140 SWAN : FATAL : Cuda driver error 702 in file 'swanlibnv2.cpp' in line 1963. # SWAN swan_assert 0 # GPU [GeForce GTX 670] Platform [Windows] Rev [3203] VERSION [55] # SWAN Device 0 : # Name : GeForce GTX 670 # ECC : Disabled # Global mem : 2048MB # Capability : 3.0 # PCI ID : 0000:07:00.0 # Device clock : 1084MHz # Memory clock : 3054MHz # Memory width : 256bit # Driver version : r331_00 : 33140 SWAN : FATAL : Cuda driver error 702 in file 'swanlibnv2.cpp' in line 1963. # SWAN swan_assert 0 # GPU [GeForce GTX 670] Platform [Windows] Rev [3203] VERSION [55] # SWAN Device 0 : # Name : GeForce GTX 670 # ECC : Disabled # Global mem : 2048MB # Capability : 3.0 # PCI ID : 0000:07:00.0 # Device clock : 1084MHz # Memory clock : 3054MHz # Memory width : 256bit # Driver version : r331_00 : 33140 SWAN : FATAL : Cuda driver error 702 in file 'swanlibnv2.cpp' in line 1963. # SWAN swan_assert 0 # GPU [GeForce GTX 670] Platform [Windows] Rev [3203] VERSION [55] # SWAN Device 0 : # Name : GeForce GTX 670 # ECC : Disabled # Global mem : 2048MB # Capability : 3.0 # PCI ID : 0000:07:00.0 # Device clock : 1084MHz # Memory clock : 3054MHz # Memory width : 256bit # Driver version : r331_00 : 33140 SWAN : FATAL : Cuda driver error 702 in file 'swanlibnv2.cpp' in line 1963. # SWAN swan_assert 0 # GPU [GeForce GTX 670] Platform [Windows] Rev [3203] VERSION [55] # SWAN Device 0 : # Name : GeForce GTX 670 # ECC : Disabled # Global mem : 2048MB # Capability : 3.0 # PCI ID : 0000:07:00.0 # Device clock : 1084MHz # Memory clock : 3054MHz # Memory width : 256bit # Driver version : r331_00 : 33140 SWAN : FATAL : Cuda driver error 702 in file 'swanlibnv2.cpp' in line 1963. # SWAN swan_assert 0 # GPU [GeForce GTX 670] Platform [Windows] Rev [3203] VERSION [55] # SWAN Device 0 : # Name : GeForce GTX 670 # ECC : Disabled # Global mem : 2048MB # Capability : 3.0 # PCI ID : 0000:07:00.0 # Device clock : 1084MHz # Memory clock : 3054MHz # Memory width : 256bit # Driver version : r331_00 : 33140 I seem to see similarities in SWAN : FATAL : Cuda driver error 702 in file 'swanlibnv2.cpp' in line 1963. SWAN : FATAL : Cuda driver error 999 in file 'swanlibnv2.cpp' in line 1963. in both reports. And in both cases, the first error occurs after the first restart. Interestingly, this was running in the same slot directory as the previous one (slot 4), and part of my bug report to BOINC (apart from the non-report of stderr_txt) was that the slot directory wasn't cleaned after an abort. I'll make sure that's done properly before I risk another one. |
MJHSend message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level ![]() Scientific publications ![]()
|
Sorry guys, I've been (and still am) very busy. Jacob, thanks for the files, they were useful and I know how to fix the problem. Unfortunately, I'll not have opportunity to do any more work on the application for a while. Will keep you posted. MJH |
|
Send message Joined: 16 Jul 07 Posts: 209 Credit: 5,616,860,456 RAC: 313,890 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I have also been experiencing this problem. Over the past several weeks at least. Also glad to see the cause has been identified by the project. Now just waiting for a fix. FWIW, this is the trick I use to be able to get to the BOINC GUI controls before crashing. I add this line to the cc_config.xml: <cc_config> <options> <start_delay>60</start_delay> </options> </cc_config> "Specify a number of seconds to delay running applications after client startup. List-add.pngNew in 6.1.6" No fiddling with safe mode or any of that. http://boinc.berkeley.edu/wiki/Client_configuration Reno, NV Team: SETI.USA |
JStatesonSend message Joined: 31 Oct 08 Posts: 186 Credit: 3,578,903,157 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Suspect I had the same problem: Driver resetting in loop eventually blue screen and memory dump. Managed to stop the gpu and spotted MD5 checksum error message associated with some gpugrid logo png file. Probably more to it than a bad logofile download so I reset the project and stopped future work. Problem disappeared on this gtx570 system. Other systems are running gpugrid ok. Upgraded from 327 to 331 drivers before deciding to reset the project. EDIT - JUST REALIZED I HAD A POWER OUTAGE RIGHT BEFORE THE PROBLEM. |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
The cause: Happens when Windows is shutdown unexpectedly, ie: from freezing up, from user pulling plug, or from power outage. The problem: The driver resets continuously, GPUGrid tasks do not progress normally, and sometimes Windows will BSOD because of the driver resets. The solution: Find a way to abort any GPUGrid tasks that are causing the problem. If Windows gives you enough time to stop BOINC when you login, then do that. Stop/suspend BOINC, abort the GPUGrid tasks, restart/resume BOINC. If Windows doesn't give you enough time, then utilizing the <start_delay> option in cc_config.xml is a good choice, but you would have to start in safe mode (to prevent BOINC from starting) in order to create/edit that file, then start in regular mode, and while BOINC is in the startup delay, stop/suspend BOINC, abort the GPUGrid tasks, restart/resume BOINC. This is a GPUGrid problem, and I hope MJH fixes it! He says he knows how to, it's just a matter of his limited time. Regards, Jacob |
|
Send message Joined: 15 May 11 Posts: 108 Credit: 297,176,099 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I have edited the cc_config file to include the startup delay and now that delay (60 seconds in my case) is initiated everytime I start BOINC up, whether I had a problem before it was shutdown or not. So now I don't have to try and 'catch' BOINC to abort tasks, or go into safe mode or anything else. I can just abort tasks that I know will fail due to the power interruption issues I occasionally have to deal with here (mostly on my GTX590 box). Operator |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
My computer abruptly restarted a couple times today, and I had to deal with this problem again. A "I505-SANTI_baxbim2-18-32" task got stuck into an infinite driver reset loop, and I had to suspend GPU to get to that task to abort it. A "23x5-SANTI_RAP74wtCUPIC-20-34" task did not get stuck in the loop, and so I didn't have to abort that one. So... this is still an ongoing problem for me. MJH? |
MJHSend message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level ![]() Scientific publications ![]()
|
Jacob, I'll probably get a fix for this problem out next week. Matt |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Jacob, Thanks. I'm looking forward to the fix. And testing the fix should be fun too muaahahahahaha (don't get to yank power cord out of this machine very often!) |
|
Send message Joined: 22 Jan 09 Posts: 8 Credit: 988,332,833 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Please hurry MJH.
|
|
Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
And testing the fix should be fun too muaahahahahaha (don't get to yank power cord out of this machine very often!) LOL! I recommend using a switch instead (power switch or at the PSU) as these are "debounced" (not sure this is the correct electrical engineering term.. sounds wrong). It could also work to just kill BOINC via task manager - maybe try this before the fix is out :) MrS Scanning for our furry friends since Jan 2002 |
ChileanSend message Joined: 8 Oct 12 Posts: 98 Credit: 385,652,461 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
And testing the fix should be fun too muaahahahahaha (don't get to yank power cord out of this machine very often!) You got it right.
|
[PUGLIA] RiccardoSend message Joined: 27 Feb 12 Posts: 2 Credit: 3,410,838 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Exactly the same for me. 3 SANTI WU corrupted after a power outage (and about to be dismissed PSU!!!) Actually are the 7443155, 7456552 and 7457465 of my current WUs: http://www.gpugrid.net/results.php?hostid=155107 Drivers crashing and Win7 rebooting until I've been so fast to suspend work and abort GPUGRID's wus Mio Dio, รจ pieno di stelle! |
skgivenSend message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I didn't have a power outage, but the computer did restart (the WU's caused the system to restart). On reboot the driver kept crashing while trying to run the same tasks. 43x1-SANTI_RAP74wtCUBIC-22-34-RND5480_0 SWAN : FATAL : Cuda driver error 999 in file 'swanlibnv2.cpp' in line 1963. FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help |
©2026 Universitat Pompeu Fabra