Problem - Tasks error when exiting/resuming using 334.67 drivers

Author	Message
Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 35927 - Posted: 27 Mar 2014, 11:47:41 UTC Last modified: 27 Mar 2014, 11:49:41 UTC MJH: Please please please help. I just threw away another several hours of GPUGrid work, because I had to restart BOINC, and the 2 GPUGrid tasks died. :( This time, I didn't suspend the tasks, I just exited BOINC normally. Then, upon restart, both tasks died. Surely this is fixable?!?!? Name 1211-GIANNI_ntl-1-4-RND3734_0 Workunit 5485267 Created 26 Mar 2014 \| 21:32:15 UTC Sent 27 Mar 2014 \| 0:06:01 UTC Received 27 Mar 2014 \| 11:46:13 UTC Server state Over Outcome Computation error Client state Compute error Exit status 80 (0x50) Unknown error number Computer ID 153764 Report deadline 1 Apr 2014 \| 0:06:01 UTC Run time 0.00 CPU time 0.00 Validate state Invalid Credit 0.00 Application version Long runs (8-12 hours on fastest card) v8.15 (cuda42) Stderr output <core_client_version>7.3.11</core_client_version> <![CDATA[ <message> The file exists. (0x50) - exit code 80 (0x50) </message> <stderr_txt> # GPU [GeForce GTX 460] Platform [Windows] Rev [3203M] VERSION [42] # SWAN Device 1 : # Name : GeForce GTX 460 # ECC : Disabled # Global mem : 1024MB # Capability : 2.1 # PCI ID : 0000:08:00.0 # Device clock : 1526MHz # Memory clock : 1900MHz # Memory width : 256bit # Driver version : r334_89 : 33523 # GPU 0 : 58C # GPU 1 : 47C # GPU 2 : 67C # GPU 0 : 60C # GPU 1 : 50C # GPU 2 : 69C # GPU 0 : 61C # GPU 1 : 52C # GPU 0 : 62C # GPU 1 : 55C # GPU 2 : 70C # GPU 0 : 63C # GPU 1 : 56C # GPU 2 : 71C # GPU 1 : 57C # GPU 0 : 64C # GPU 1 : 59C # GPU 2 : 72C # GPU 0 : 65C # GPU 1 : 61C # GPU 1 : 62C # GPU 1 : 63C # GPU 2 : 73C # GPU 0 : 66C # GPU 1 : 64C # GPU 2 : 74C # GPU 1 : 65C # GPU 0 : 67C # GPU 1 : 66C # GPU 2 : 75C # GPU 0 : 68C # GPU 2 : 76C # GPU 1 : 67C # BOINC suspending at user request (exit) </stderr_txt> ]]> Name 1733-GIANNI_ntl-3-4-RND9094_0 Workunit 5485140 Created 26 Mar 2014 \| 21:06:01 UTC Sent 27 Mar 2014 \| 6:35:37 UTC Received 27 Mar 2014 \| 11:46:13 UTC Server state Over Outcome Computation error Client state Compute error Exit status 80 (0x50) Unknown error number Computer ID 153764 Report deadline 1 Apr 2014 \| 6:35:37 UTC Run time 0.00 CPU time 0.00 Validate state Invalid Credit 0.00 Application version Long runs (8-12 hours on fastest card) v8.15 (cuda55) Stderr output <core_client_version>7.3.11</core_client_version> <![CDATA[ <message> The file exists. (0x50) - exit code 80 (0x50) </message> <stderr_txt> # GPU [GeForce GTX 660 Ti] Platform [Windows] Rev [3203M] VERSION [55] # SWAN Device 0 : # Name : GeForce GTX 660 Ti # ECC : Disabled # Global mem : 3072MB # Capability : 3.0 # PCI ID : 0000:09:00.0 # Device clock : 1124MHz # Memory clock : 3004MHz # Memory width : 192bit # Driver version : r334_89 : 33523 # GPU 0 : 64C # GPU 1 : 65C # GPU 2 : 74C # GPU 2 : 75C # GPU 0 : 65C # GPU 1 : 66C # GPU 0 : 66C # GPU 2 : 76C # GPU 1 : 67C # BOINC suspending at user request (exit) </stderr_txt> ]]> ID: 35927 · Rating: 0 · rate: / Reply Quote

MJH Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level Scientific publications	Message 35928 - Posted: 27 Mar 2014, 13:09:29 UTC - in response to Message 35927. Jacob Try the acemdshort app 820. Should fix the problem. Matt ID: 35928 · Rating: 0 · rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 35934 - Posted: 27 Mar 2014, 18:12:43 UTC - in response to Message 35928. Last modified: 27 Mar 2014, 18:13:05 UTC What was the problem, and what was the fix? When do you think it will land on the Long queue? I will try to monitor application version numbers more closely, as I usually get a variety of Long/Short tasks. ID: 35934 · Rating: 0 · rate: / Reply Quote

MJH Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level Scientific publications	Message 35941 - Posted: 27 Mar 2014, 19:56:23 UTC - in response to Message 35934. The problem, I think, is a false positive from the test to see if the Wu has got Stuck in a crash loop, as introduced in 815. I fixed that a while ago but only rolled it out with 820. Let's see... Matt ID: 35941 · Rating: 0 · rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 36010 - Posted: 30 Mar 2014, 22:42:14 UTC - in response to Message 35941. When do you plan on deploying the 8.20 app, to Long-queue? People are still getting the "File already exists" error, losing tons of work, daily. If you were still testing it, then why was it not contained to the Beta-queue? Since it's already on Short, I think it should already be on Long too. Sick of losing work because of this . . . ID: 36010 · Rating: 0 · rate: / Reply Quote

Stefan Project administrator Project developer Project tester Project scientist Send message Joined: 5 Mar 13 Posts: 348 Credit: 0 RAC: 0 Level Scientific publications	Message 36016 - Posted: 31 Mar 2014, 9:08:49 UTC - in response to Message 36010. Bugs get through beta-queue testing from time to time. So it's obviously better if we only lose the work on the short queue and not the work from both queues. But I guess at this point 820 looks stable enough, so I will suggest to Matt to push it to long. ID: 36016 · Rating: 0 · rate: / Reply Quote

MJH Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level Scientific publications	Message 36017 - Posted: 31 Mar 2014, 10:37:37 UTC - in response to Message 36010. Jacob, 820 for cuda6 is on long now. Matt ID: 36017 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 2 Level Scientific publications	Message 36018 - Posted: 31 Mar 2014, 11:05:34 UTC - in response to Message 36017. Jacob, 820 for cuda6 is on long now. Matt Have you been able to find a way of preventing the server from allocating cuda55 or cuda42 to Maxwell (CC 5.0) cards yet? Doesn't waste any actual computing time, but the downloads are a bit of a pain - and having several hours of expected crunching suddenly disappear rather confuses BOINC's scheduler. :-D ID: 36018 · Rating: 0 · rate: / Reply Quote

MJH Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level Scientific publications	Message 36019 - Posted: 31 Mar 2014, 11:20:11 UTC - in response to Message 36018. Have you been able to find a way of preventing the server from allocating cuda55 or cuda42 to Maxwell (CC 5.0) cards yet? No idea, although haven't looked deeply into it yet. Matt ID: 36019 · Rating: 0 · rate: / Reply Quote

MJH Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level Scientific publications	Message 36020 - Posted: 31 Mar 2014, 11:20:15 UTC - in response to Message 36018. Have you been able to find a way of preventing the server from allocating cuda55 or cuda42 to Maxwell (CC 5.0) cards yet? No idea, although haven't looked deeply into it yet. Matt ID: 36020 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 2 Level Scientific publications	Message 36021 - Posted: 31 Mar 2014, 11:31:40 UTC - in response to Message 36020. Have you been able to find a way of preventing the server from allocating cuda55 or cuda42 to Maxwell (CC 5.0) cards yet? No idea, although haven't looked deeply into it yet. Matt It should be possible, by setting a maximum compute_capability for the two unwanted plan_classes. ID: 36021 · Rating: 0 · rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 36022 - Posted: 31 Mar 2014, 12:09:55 UTC - in response to Message 36017. Last modified: 31 Mar 2014, 12:11:22 UTC Jacob, 820 for cuda6 is on long now. Matt Finally!! I noticed that it was only deployed for the cuda6 plan classes; are there any plans to update the app for the other plan classes? Also, please continue to make stability a priority. It is so very frustrating to lose progress. Some of the tasks that fail say they only had a couple seconds of run-time, where I believe they may have actually had several hours invested. Perhaps that masked the severity of the issue to you guys, not sure. But I hope bug-fixing becomes a high(er) priority. Regards, Jacob ID: 36022 · Rating: 0 · rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 36083 - Posted: 4 Apr 2014, 2:33:01 UTC Had to chime in again to say THANK YOU for fixing this. BOINC Task Stability is obviously very important to me, and this bug had been plaguing me for weeks. The new 8.20 app seems to be suspending/exiting/resuming much better for me thus far. Thank you! ID: 36083 · Rating: 0 · rate: / Reply Quote

Wdethomas Send message Joined: 6 Feb 10 Posts: 38 Credit: 274,204,838 RAC: 0 Level Scientific publications	Message 36143 - Posted: 7 Apr 2014, 19:03:44 UTC This has not been fixed. I have all CUDA 55 WU and if the light goes out, the work units get lost. ID: 36143 · Rating: 0 · rate: / Reply Quote

Variable Send message Joined: 20 Nov 13 Posts: 21 Credit: 483,846,415 RAC: 2,108 Level Scientific publications	Message 36145 - Posted: 7 Apr 2014, 19:43:02 UTC It looks like I've started getting some errors on my machine as well over the last few days. It's not running overly hot, not sure what's going on. This is the output from the last one: Stderr output <core_client_version>7.2.33</core_client_version> <![CDATA[ <message> (unknown error) - exit code -97 (0xffffff9f) </message> <stderr_txt> # GPU [GeForce GTX 760] Platform [Windows] Rev [3301M] VERSION [60] # SWAN Device 0 : # Name : GeForce GTX 760 # ECC : Disabled # Global mem : 2048MB # Capability : 3.0 # PCI ID : 0000:01:00.0 # Device clock : 1084MHz # Memory clock : 3404MHz # Memory width : 256bit # Driver version : r334_00 : 33489 # GPU 0 : 44C # GPU 0 : 45C # GPU 0 : 47C # GPU 0 : 48C # GPU 0 : 49C # GPU 0 : 50C # GPU 0 : 51C # GPU 0 : 52C # GPU 0 : 53C # GPU 0 : 54C # GPU 0 : 55C # GPU 0 : 56C # GPU 0 : 57C # The simulation has become unstable. Terminating to avoid lock-up (1) # Attempting restart (step 76000) # GPU [GeForce GTX 760] Platform [Windows] Rev [3301M] VERSION [60] # SWAN Device 0 : # Name : GeForce GTX 760 # ECC : Disabled # Global mem : 2048MB # Capability : 3.0 # PCI ID : 0000:01:00.0 # Device clock : 1084MHz # Memory clock : 3404MHz # Memory width : 256bit # Driver version : r334_00 : 33489 # GPU 0 : 56C # GPU 0 : 57C # The simulation has become unstable. Terminating to avoid lock-up (1) # Attempting restart (step 174000) # GPU [GeForce GTX 760] Platform [Windows] Rev [3301M] VERSION [60] # SWAN Device 0 : # Name : GeForce GTX 760 # ECC : Disabled # Global mem : 2048MB # Capability : 3.0 # PCI ID : 0000:01:00.0 # Device clock : 1084MHz # Memory clock : 3404MHz # Memory width : 256bit # Driver version : r334_00 : 33489 # GPU 0 : 56C # GPU 0 : 57C # The simulation has become unstable. Terminating to avoid lock-up (1) # Attempting restart (step 175000) # GPU [GeForce GTX 760] Platform [Windows] Rev [3301M] VERSION [60] # SWAN Device 0 : # Name : GeForce GTX 760 # ECC : Disabled # Global mem : 2048MB # Capability : 3.0 # PCI ID : 0000:01:00.0 # Device clock : 1084MHz # Memory clock : 3404MHz # Memory width : 256bit # Driver version : r334_00 : 33489 # The simulation has become unstable. Terminating to avoid lock-up (1) </stderr_txt> ]]> ID: 36145 · Rating: 0 · rate: / Reply Quote

Jim1348 Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level Scientific publications	Message 36146 - Posted: 7 Apr 2014, 21:46:59 UTC - in response to Message 36145. Last modified: 7 Apr 2014, 21:50:15 UTC It looks like I've started getting some errors on my machine as well over the last few days. It's not running overly hot, not sure what's going on. I have been seeing that too recently on one of my previously stable GTX 660s. But the other one that I had previously underclocked from 993 MHz to 967 MHz has been stable. So it appears that the work units have just gotten a little harder, and now I am underclocking both of them. I would suggest reducing your GPU clock to 1000 MHz or so. (It is not a heat issue; mine were around 66 C). ID: 36146 · Rating: 0 · rate: / Reply Quote

petnek Send message Joined: 30 May 09 Posts: 3 Credit: 35,191,012 RAC: 0 Level Scientific publications	Message 36219 - Posted: 11 Apr 2014, 4:54:24 UTC I have the same issue on two different GPUs with different drivers. On GTX 275: <core_client_version>7.2.39</core_client_version> <![CDATA[ <message> (unknown error) - exit code -59 (0xffffffc5) On Quadro FX 3800: <core_client_version>6.10.18</core_client_version> <![CDATA[ <message> The file exists. (0x50) - exit code 80 (0x50) On both I´am running short tasks. Please solve this failing! ID: 36219 · Rating: 0 · rate: / Reply Quote

TJ Send message Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level Scientific publications	Message 36220 - Posted: 11 Apr 2014, 8:23:12 UTC Last modified: 11 Apr 2014, 8:24:05 UTC Perhaps a little help. Yesterday I needed to boot all my systems for the necessary Windows updates after running for 26 days. First thing I do is set to accept no new work so the queue can empty. Eventually I needed to go to bed but still WU's running. I suspended all work in BOINC manager and did then a cold boot (install updates and then power off system). After starting the PC's I went to the BOINC manager again and resumed work. All worked fine without error. I know this is not the option Jacob, the original poster wants, but at least in my case it did not result in loss of work. Edit: I need to mention I am still using 331.82 graphics driver Greetings from TJ ID: 36220 · Rating: 0 · rate: / Reply Quote

MJH Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level Scientific publications	Message 36223 - Posted: 11 Apr 2014, 10:11:57 UTC - in response to Message 36083. Thank you! Thank you too, for your help in diagnosing it. On to the next problem! Matt ID: 36223 · Rating: 0 · rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 36252 - Posted: 12 Apr 2014, 14:38:14 UTC - in response to Message 36223. Thank you! Thank you too, for your help in diagnosing it. On to the next problem! Matt I thought this problem was fixed -- why are we still receiving 8.15 tasks? I just had 2 more fail, losing several hours of work, presumably because they were 8.15 instead of 8.20. Upsetting. ID: 36252 · Rating: 0 · rate: / Reply Quote