Abrupt computer restart - Tasks stuck

Author	Message
Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 2 Level Scientific publications	Message 33904 - Posted: 16 Nov 2013, 11:44:06 UTC - in response to Message 33903. (the WU's caused the system to restart). That's a bold statement. Have you opted IN to the current beta test of the v8.15 application designed to prevent the endless driver crash loop on restart, however the original problem came about? ID: 33904 · Rating: 0 · rate: /

TJ Send message Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level Scientific publications	Message 33905 - Posted: 16 Nov 2013, 12:10:01 UTC I had also an error with a Santi Rap after 3 "stop and starts". I had also an error earlier this week with a Noelia with a fatal cuda driver but that was the first time that the GPU clock was not down clocked. I had also a Santi LR run last week and however it finished without error it had down clocked the GPU clock. I have opted for the beta but got only two of them and are quit fast. Greetings from TJ ID: 33905 · Rating: 0 · rate: /

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 33908 - Posted: 16 Nov 2013, 15:12:13 UTC - in response to Message 33904. Have you opted IN to the current beta test of the v8.15 application... I didn't bother yesterday as I saw that there was only ~10 test WU's released and none available. Since selecting Beta's today, none have come my way, so far, 16/11/2013 14:32:50 \| GPUGRID \| No tasks are available for ACEMD beta version 16/11/2013 14:48:03 \| GPUGRID \| No tasks are available for ACEMD beta version 16/11/2013 15:01:34 \| GPUGRID \| No tasks are available for ACEMD beta version 16/11/2013 15:10:52 \| GPUGRID \| No tasks are available for ACEMD beta version Server says, ACEMD beta version 0 9 0.43 (0.15 - 2.74) 8 BTW. I do usually participate in the Betas: hours and percentage of runtime for 3 systems since the end of July 2013, GPUGRID ACEMD beta version 1140.53 (5.24%) GPUGRID ACEMD beta version 47.20 (1.12%) GPUGRID ACEMD beta version 544.78 (5.51%) FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help ID: 33908 · Rating: 0 · rate: /

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 33909 - Posted: 16 Nov 2013, 15:17:44 UTC - in response to Message 33908. Last modified: 16 Nov 2013, 15:20:59 UTC Right. I too have not yet been able to get a Beta task in order to do testing with. But I think the point was: The 8.15 beta application supposedly fixes the problem that you (and I and others) had, which is caused by [an abrupt computer restart, or power outage, or BOINC being killed in TaskManager without closing gracefully], and results in a loop of driver resets, and which can only be resolved by aborting the GPUGrid task(s) causing the loop. ID: 33909 · Rating: 0 · rate: /

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 2 Level Scientific publications	Message 33923 - Posted: 17 Nov 2013, 18:14:55 UTC - in response to Message 33909. Right. I too have not yet been able to get a Beta task in order to do testing with. But I think the point was: The 8.15 beta application supposedly fixes the problem that you (and I and others) had, which is caused by [an abrupt computer restart, or power outage, or BOINC being killed in TaskManager without closing gracefully], and results in a loop of driver resets, and which can only be resolved by aborting the GPUGrid task(s) causing the loop. And looking at the stderr for the individual task in question, I could see no sign that GPUGrid had crashed or otherwise caused the initial problem, only that it had entered the 'looping driver' state on the first restart. There seem to be more Beta tasks available for testing this afternoon - I have some flagged 'KLAUDE' which look to be heading towards 6-7 hours on my GTX 670s. That should be long enough to trigger a crash for testing purposes :) ID: 33923 · Rating: 0 · rate: /

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 33924 - Posted: 17 Nov 2013, 19:52:48 UTC - in response to Message 33923. The Betas might fix the driver restarts, but that doesn't address the cause of the system crash/restart - if it is related to the task/app. This seemed to be happening in the past, with certain types of WU; you ran the WU's and the system crashed and drivers restarted on restart, you didn't run those tasks and there weren't any restarts or driver failures. There probably wouldn't be anything in the Boinc logs if the app/WU triggered an immediate system Stop. FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help ID: 33924 · Rating: 0 · rate: /

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 33932 - Posted: 18 Nov 2013, 22:37:17 UTC - in response to Message 33924. Last modified: 18 Nov 2013, 22:37:56 UTC MJH: I just got my first Beta unit that lasted more than a couple seconds. But, when I tested it (killing process trees using Process Explorer), it restarted fine 2 times, but on the 3rd time, the task itself error'd out, with: Exit status: 80 (0x50) Unknown error number Message: The file exists. (0x50) - exit code 80 (0x50) See: http://www.gpugrid.net/result.php?resultid=7474144 Was this expected? Or is this a new bug? Also, if you would like us to test the Beta units by doing abnormal actions, please give us a set of steps to perform. I had just been letting it run for a bit, then killing Tree in Process Explorer, but that was just a guess as to what testing steps might be necessary. So, let me know what you think about this one? ID: 33932 · Rating: 0 · rate: /

MJH Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level Scientific publications	Message 33935 - Posted: 19 Nov 2013, 0:11:58 UTC - in response to Message 33932. Last modified: 19 Nov 2013, 0:13:12 UTC Jacob - That's expected behaviour now, but not entirely desired. The app misinterpreted the rapid restart as in indication that it was stuck in a restart loop and so aborted. I expect I'll need a more sensitive test. Matt ID: 33935 · Rating: 0 · rate: /

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 33936 - Posted: 19 Nov 2013, 0:14:28 UTC - in response to Message 33935. So, for testing the current app, should I have waited several checkpoints between Tree Kills? ID: 33936 · Rating: 0 · rate: /

JStateson Send message Joined: 31 Oct 08 Posts: 186 Credit: 3,578,903,157 RAC: 0 Level Scientific publications	Message 34087 - Posted: 30 Nov 2013, 16:09:58 UTC I hope this gets fixed because cold weather here is causing more frequent power outs and I have a farm of gpugrid systems. ID: 34087 · Rating: 0 · rate: /

Jozef J Send message Joined: 7 Jun 12 Posts: 112 Credit: 1,140,895,172 RAC: 0 Level Scientific publications	Message 34103 - Posted: 2 Dec 2013, 21:42:36 UTC GPUGRID project is no longer under control, errors and various problems with tasks when users counting is increasingly...I am completely finished from this project, ID: 34103 · Rating: 0 · rate: /

ExtraTerrestrial Apes Volunteer moderator Volunteer tester Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level Scientific publications	Message 34132 - Posted: 5 Dec 2013, 20:40:29 UTC - in response to Message 34103. Actually the subjective error rate has decreased a lot since the trouble was resolved a few months ago, when Matt developed the app to 8.14. What's left are occasional glitches (like sending WUs to the wrong queue) and from what I'm seeing more isolated and/or special errors. MrS Scanning for our furry friends since Jan 2002 ID: 34132 · Rating: 0 · rate: /

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 2 Level Scientific publications	Message 34133 - Posted: 5 Dec 2013, 21:41:07 UTC - in response to Message 34132. Actually the subjective error rate has decreased a lot since the trouble was resolved a few months ago, when Matt developed the app to 8.14. What's left are occasional glitches (like sending WUs to the wrong queue) and from what I'm seeing more isolated and/or special errors. MrS And I suspect it will be even better when they have enough confidence to promote the restart-fix v8.15 from Beta to stock application. ID: 34133 · Rating: 0 · rate: /

FoldingNator Send message Joined: 1 Dec 12 Posts: 24 Credit: 60,122,950 RAC: 0 Level Scientific publications	Message 34137 - Posted: 6 Dec 2013, 12:53:01 UTC - in response to Message 34132. Actually the subjective error rate has decreased a lot since the trouble was resolved a few months ago, when Matt developed the app to 8.14. What's left are occasional glitches (like sending WUs to the wrong queue) and from what I'm seeing more isolated and/or special errors. MrS Yeah maybe, but my computer had also a few BSOD's yesterday, with multiple long run WU's. Nothing didn't work after that, I had to delete the whole BOINC folder, the whole driver, clean install everything and after that it finally work again. A lot work for a few long runs in my opinion, I'm glad I've only 1 pc. haha :P The BSOD error had something to do with kernel issues and corrupted the installed NVIDIA driver. So when the computer boots, the screens are freezing down, the driver crashed within a few seconds and after that I had a BSOD again, again and again. I dont know if it's a coincidence that I had or that more people have had the same kind of problems like this. ID: 34137 · Rating: 0 · rate: /

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 34138 - Posted: 6 Dec 2013, 12:56:07 UTC - in response to Message 34137. Last modified: 6 Dec 2013, 13:16:38 UTC FoldingNator: I don't think it was corrupting the drivers. I'm betting what you were experiencing exactly what was reported by me in post 1 of this thread. Specifically, here are the steps that create the problem: - v8.14 GPUGrid tasks are abruptly-interrupted (by power outage, or BSOD, or improper Windows shutdown, or BOINC client killed via Task Manager) - Windows or user starts BOINC - BOINC tries to run one or more of the v8.14 abruptly-interrupted GPUGrid tasks - Running those abruptly-interrupted tasks resets the NVIDIA drivers continuously (with either continual "Display driver stopped working" notifications or BSODs) The workaround (as previously recommended) is to: - Abort those v8.14 abruptly-interrupted GPUGrid tasks (Try to suspend BOINC at earliest opportunity, to stop the crashing, so that you can abort these tasks) - Restart the computer The solution (which prevents these tasks from getting stuck in a crashing loop) is: - GPUGrid needs to release the v8.15 application (which fixes the issue, but is currently still only on the Beta queue) Regards, Jacob Klein MJH: We can haz 8.15? ID: 34138 · Rating: 0 · rate: /

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 34144 - Posted: 7 Dec 2013, 1:03:55 UTC - in response to Message 34138. I again had this problem today and yesterday. It impacts Windows systems only. No power outage, no improper shutdowns. The app/tasks cause the drivers to fail. On reboot one or more GPUGrid WU's fail. Some GPUGrid WU's can continue however. FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help ID: 34144 · Rating: 0 · rate: /

FoldingNator Send message Joined: 1 Dec 12 Posts: 24 Credit: 60,122,950 RAC: 0 Level Scientific publications	Message 34146 - Posted: 7 Dec 2013, 3:36:35 UTC - in response to Message 34138. Last modified: 7 Dec 2013, 3:38:08 UTC Hi Jacob/skgiven, thanks for your messages. It sounds the same, but after the driver crash the Windows logfiles said that a part of the driver was corrupted. Though I also doubt it. I've restarted my computer multiple times, but it didnt work. Actually, these were my steps: 1.) restart Windows in safe mode 2.) select in BOINC that the manager don't start manually after system start-up 3.) aborted the SANTI and NATHAN runs 4.) restart again and test a new run -> again driver a crash and BSOD 5.) delete driver with driver sweeper + CCleaner for the registry entries, restart 6.) install new driver, restart, start-up BOINC and runs from point 4 -> again BSOD 7.) set in MSI Afterburner all to stock -> new tasks -> BSOD 8.) restart again, startup BOINC and go further with the tasks from point 7, finally he is getting it... it runs again 9.) after a 30 minutes I paused it, get the OC back and the WU's are running fine... very strange IMO The solution (which prevents these tasks from getting stuck in a crashing loop) is: - GPUGrid needs to release the v8.15 application (which fixes the issue, but is currently still only on the Beta queue) Regards, Jacob Klein MJH: We can haz 8.15? Hmmm, sounds like a great idea. ;-) ID: 34146 · Rating: 0 · rate: /

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 34148 - Posted: 7 Dec 2013, 11:18:36 UTC - in response to Message 34146. This morning I again find that my computer restarted (3days in a row) and when I logged in the NVidia driver repeatedly restarts. One GPUGrid task had completed and wanted to upload (which I also saw yesterday). So it likely that the new task is causing this. When you have 3 GPU's in the one system and have to abort 3 tasks, 3days in a row, things aren't working out! FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help ID: 34148 · Rating: 0 · rate: /

Jim1348 Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level Scientific publications	Message 34150 - Posted: 7 Dec 2013, 22:40:28 UTC I have never had the reboot problem on my dedicated BOINC-PC with two GTX 660s running the longs (331.65 drivers, Win7 64-bit). But that PC has an uninterruptable power supply (UPS), and never suffers from power outages. Also, the cards are now stable, after some effort as explained elsewhere, to the point where they now don't have "The simulation has become unstable. Terminating to avoid lock-up" problem. However my main PC is a different story. That one also has a UPS, but I put the weakest of my GTX 660's there, and found that the stability of a card depends on the motherboard; what works in an Ivy Bridge board does not work in an older P45 Core2Duo board. Before I got it stable, that card would crash on its own, sometimes producing a BSOD, which then initiated the reboot problem. Curiously, the BSODs often don't even produce a minidump file; unless you are there to see it, you might miss that it happened at all. So I had to reduce the GPU clock still more, and even reduce the memory clock to get it stable. So the bottom line is that until they can come out with the 8.15 fix, the best thing to do is to make your cards as stable as possible so that you never get the "The simulation has become unstable" messages in stderr.txt, which I now consider a canary in the coal mine for the reboot problem.. As discussed elsewhere, what worked for me is increasing the power limits (to 110%), and reducing the GPU clocks and increasing the GPU core voltage as necessary; Nvidia Inspector worked for me. Of course, the cooling should be sufficient also; it is worth doing what it takes. ID: 34150 · Rating: 0 · rate: /

(retired account) Send message Joined: 22 Dec 11 Posts: 38 Credit: 28,606,255 RAC: 0 Level Scientific publications	Message 34153 - Posted: 8 Dec 2013, 14:17:50 UTC Also have a massive problem with reboots now which might be related. The GTX Titan (Win 7 SP1 64bit, driver 331.40) received the following two long runs and then a short run (which I did load on purpose for testing) today: I72R1-NATHAN_KIDKIXc22_6-34-50-RND4048 I35R3-NATHAN_KIDKIXc22_6-41-50-RND0098 I259-SANTI_baxbimSPW2-8-62-RND4721 With all three workunits the PC suddenly crashed and rebooted (I did not see a BSOD, the screen only went black, then immediate reboot). I don't have BOINC in autostart, when I did start it manually it took a few seconds then the nvidia driver crashed and was restarted by Windows (no reboot or BSOD here, BOINC still ran). The GPUGRID workunits were crashed, in the first two cases the long runs did also take the WUProp workunits down with them. http://wuprop.boinc-af.org/result.php?resultid=36443848 http://wuprop.boinc-af.org/result.php?resultid=36448588 All other projects running concurrently remained unharmed afaics (including Einstein FGRP2, S6CasA, Test4Theory and POEM++ OpenCl, the latter running on a HD 7950). <core_client_version>7.2.31</core_client_version> <![CDATA[ <message> (unknown error) - exit code -52 (0xffffffcc) </message> <stderr_txt> # GPU [GeForce GTX TITAN] Platform [Windows] Rev [3203] VERSION [55] # SWAN Device 0 : # Name : GeForce GTX TITAN # ECC : Disabled # Global mem : 4095MB # Capability : 3.5 # PCI ID : 0000:01:00.0 # Device clock : 875MHz # Memory clock : 3004MHz # Memory width : 384bit # Driver version : r331_00 : 33140 # GPU 0 : 49C (...) # GPU 0 : 81C # GPU [GeForce GTX TITAN] Platform [Windows] Rev [3203] VERSION [55] # SWAN Device 0 : # Name : GeForce GTX TITAN (see above) # Driver version : r331_00 : 33140 SWAN : FATAL Unable to load module .mshake_kernel.cu. (999) </stderr_txt> ]]> The card ran fine the last week with Milkyway in DP mode. I will try to run some other projects now in SP mode on the Titan to see if the card and the nvidia driver installation is still fine. I will also test if the same problem occurs with the GT 650M card. Mark my words and remember me. - 11th Hour, Lamb of God ID: 34153 · Rating: 0 · rate: /

Abrupt computer restart - Tasks stuck - Kernel not found