Message boards :
Number crunching :
Problem - Tasks error when exiting/resuming using 334.67 drivers
Message board moderation
Previous · 1 · 2 · 3 · 4 · Next
| Author | Message |
|---|---|
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
MJH: Please please please help. I just threw away another several hours of GPUGrid work, because I had to restart BOINC, and the 2 GPUGrid tasks died. :( This time, I didn't suspend the tasks, I just exited BOINC normally. Then, upon restart, both tasks died. Surely this is fixable?!?!? Name 1211-GIANNI_ntl-1-4-RND3734_0 Workunit 5485267 Created 26 Mar 2014 | 21:32:15 UTC Sent 27 Mar 2014 | 0:06:01 UTC Received 27 Mar 2014 | 11:46:13 UTC Server state Over Outcome Computation error Client state Compute error Exit status 80 (0x50) Unknown error number Computer ID 153764 Report deadline 1 Apr 2014 | 0:06:01 UTC Run time 0.00 CPU time 0.00 Validate state Invalid Credit 0.00 Application version Long runs (8-12 hours on fastest card) v8.15 (cuda42) Stderr output <core_client_version>7.3.11</core_client_version> <![CDATA[ <message> The file exists. (0x50) - exit code 80 (0x50) </message> <stderr_txt> # GPU [GeForce GTX 460] Platform [Windows] Rev [3203M] VERSION [42] # SWAN Device 1 : # Name : GeForce GTX 460 # ECC : Disabled # Global mem : 1024MB # Capability : 2.1 # PCI ID : 0000:08:00.0 # Device clock : 1526MHz # Memory clock : 1900MHz # Memory width : 256bit # Driver version : r334_89 : 33523 # GPU 0 : 58C # GPU 1 : 47C # GPU 2 : 67C # GPU 0 : 60C # GPU 1 : 50C # GPU 2 : 69C # GPU 0 : 61C # GPU 1 : 52C # GPU 0 : 62C # GPU 1 : 55C # GPU 2 : 70C # GPU 0 : 63C # GPU 1 : 56C # GPU 2 : 71C # GPU 1 : 57C # GPU 0 : 64C # GPU 1 : 59C # GPU 2 : 72C # GPU 0 : 65C # GPU 1 : 61C # GPU 1 : 62C # GPU 1 : 63C # GPU 2 : 73C # GPU 0 : 66C # GPU 1 : 64C # GPU 2 : 74C # GPU 1 : 65C # GPU 0 : 67C # GPU 1 : 66C # GPU 2 : 75C # GPU 0 : 68C # GPU 2 : 76C # GPU 1 : 67C # BOINC suspending at user request (exit) </stderr_txt> ]]> Name 1733-GIANNI_ntl-3-4-RND9094_0 Workunit 5485140 Created 26 Mar 2014 | 21:06:01 UTC Sent 27 Mar 2014 | 6:35:37 UTC Received 27 Mar 2014 | 11:46:13 UTC Server state Over Outcome Computation error Client state Compute error Exit status 80 (0x50) Unknown error number Computer ID 153764 Report deadline 1 Apr 2014 | 6:35:37 UTC Run time 0.00 CPU time 0.00 Validate state Invalid Credit 0.00 Application version Long runs (8-12 hours on fastest card) v8.15 (cuda55) Stderr output <core_client_version>7.3.11</core_client_version> <![CDATA[ <message> The file exists. (0x50) - exit code 80 (0x50) </message> <stderr_txt> # GPU [GeForce GTX 660 Ti] Platform [Windows] Rev [3203M] VERSION [55] # SWAN Device 0 : # Name : GeForce GTX 660 Ti # ECC : Disabled # Global mem : 3072MB # Capability : 3.0 # PCI ID : 0000:09:00.0 # Device clock : 1124MHz # Memory clock : 3004MHz # Memory width : 192bit # Driver version : r334_89 : 33523 # GPU 0 : 64C # GPU 1 : 65C # GPU 2 : 74C # GPU 2 : 75C # GPU 0 : 65C # GPU 1 : 66C # GPU 0 : 66C # GPU 2 : 76C # GPU 1 : 67C # BOINC suspending at user request (exit) </stderr_txt> ]]> |
MJHSend message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level ![]() Scientific publications ![]()
|
Jacob Try the acemdshort app 820. Should fix the problem. Matt |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
What was the problem, and what was the fix? When do you think it will land on the Long queue? I will try to monitor application version numbers more closely, as I usually get a variety of Long/Short tasks. |
MJHSend message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level ![]() Scientific publications ![]()
|
The problem, I think, is a false positive from the test to see if the Wu has got Stuck in a crash loop, as introduced in 815. I fixed that a while ago but only rolled it out with 820. Let's see... Matt |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
When do you plan on deploying the 8.20 app, to Long-queue? People are still getting the "File already exists" error, losing tons of work, daily. If you were still testing it, then why was it not contained to the Beta-queue? Since it's already on Short, I think it should already be on Long too. Sick of losing work because of this . . . |
|
Send message Joined: 5 Mar 13 Posts: 348 Credit: 0 RAC: 0 Level ![]() Scientific publications ![]() |
Bugs get through beta-queue testing from time to time. So it's obviously better if we only lose the work on the short queue and not the work from both queues. But I guess at this point 820 looks stable enough, so I will suggest to Matt to push it to long. |
MJHSend message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level ![]() Scientific publications ![]()
|
Jacob, 820 for cuda6 is on long now. Matt |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 318 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Jacob, Have you been able to find a way of preventing the server from allocating cuda55 or cuda42 to Maxwell (CC 5.0) cards yet? Doesn't waste any actual computing time, but the downloads are a bit of a pain - and having several hours of expected crunching suddenly disappear rather confuses BOINC's scheduler. :-D |
MJHSend message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level ![]() Scientific publications ![]()
|
No idea, although haven't looked deeply into it yet. Matt |
MJHSend message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level ![]() Scientific publications ![]()
|
No idea, although haven't looked deeply into it yet. Matt |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 318 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
It should be possible, by setting a maximum compute_capability for the two unwanted plan_classes. |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Jacob, Finally!! I noticed that it was only deployed for the cuda6 plan classes; are there any plans to update the app for the other plan classes? Also, please continue to make stability a priority. It is so very frustrating to lose progress. Some of the tasks that fail say they only had a couple seconds of run-time, where I believe they may have actually had several hours invested. Perhaps that masked the severity of the issue to you guys, not sure. But I hope bug-fixing becomes a high(er) priority. Regards, Jacob |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Had to chime in again to say THANK YOU for fixing this. BOINC Task Stability is obviously very important to me, and this bug had been plaguing me for weeks. The new 8.20 app seems to be suspending/exiting/resuming much better for me thus far. Thank you! |
|
Send message Joined: 6 Feb 10 Posts: 38 Credit: 274,204,838 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]()
|
This has not been fixed. I have all CUDA 55 WU and if the light goes out, the work units get lost. |
|
Send message Joined: 20 Nov 13 Posts: 21 Credit: 480,846,415 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
It looks like I've started getting some errors on my machine as well over the last few days. It's not running overly hot, not sure what's going on. This is the output from the last one: Stderr output |
|
Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
It looks like I've started getting some errors on my machine as well over the last few days. It's not running overly hot, not sure what's going on. I have been seeing that too recently on one of my previously stable GTX 660s. But the other one that I had previously underclocked from 993 MHz to 967 MHz has been stable. So it appears that the work units have just gotten a little harder, and now I am underclocking both of them. I would suggest reducing your GPU clock to 1000 MHz or so. (It is not a heat issue; mine were around 66 C). |
petnekSend message Joined: 30 May 09 Posts: 3 Credit: 35,191,012 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I have the same issue on two different GPUs with different drivers. On GTX 275: <core_client_version>7.2.39</core_client_version> <![CDATA[ <message> (unknown error) - exit code -59 (0xffffffc5) On Quadro FX 3800: <core_client_version>6.10.18</core_client_version> <![CDATA[ <message> The file exists. (0x50) - exit code 80 (0x50) On both I´am running short tasks. Please solve this failing! |
|
Send message Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Perhaps a little help. Yesterday I needed to boot all my systems for the necessary Windows updates after running for 26 days. First thing I do is set to accept no new work so the queue can empty. Eventually I needed to go to bed but still WU's running. I suspended all work in BOINC manager and did then a cold boot (install updates and then power off system). After starting the PC's I went to the BOINC manager again and resumed work. All worked fine without error. I know this is not the option Jacob, the original poster wants, but at least in my case it did not result in loss of work. Edit: I need to mention I am still using 331.82 graphics driver Greetings from TJ |
MJHSend message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level ![]() Scientific publications ![]()
|
Thank you too, for your help in diagnosing it. On to the next problem! Matt |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I thought this problem was fixed -- why are we still receiving 8.15 tasks? I just had 2 more fail, losing several hours of work, presumably because they were 8.15 instead of 8.20. Upsetting. |
©2025 Universitat Pompeu Fabra