Problem - Tasks error when exiting/resuming using 334.67 drivers

Message boards : Number crunching : Problem - Tasks error when exiting/resuming using 334.67 drivers
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 35927 - Posted: 27 Mar 2014, 11:47:41 UTC
Last modified: 27 Mar 2014, 11:49:41 UTC

MJH: Please please please help.

I just threw away another several hours of GPUGrid work, because I had to restart BOINC, and the 2 GPUGrid tasks died. :( This time, I didn't suspend the tasks, I just exited BOINC normally. Then, upon restart, both tasks died. Surely this is fixable?!?!?

Name	1211-GIANNI_ntl-1-4-RND3734_0
Workunit	5485267
Created	26 Mar 2014 | 21:32:15 UTC
Sent	27 Mar 2014 | 0:06:01 UTC
Received	27 Mar 2014 | 11:46:13 UTC
Server state	Over
Outcome	Computation error
Client state	Compute error
Exit status	80 (0x50) Unknown error number
Computer ID	153764
Report deadline	1 Apr 2014 | 0:06:01 UTC
Run time	0.00
CPU time	0.00
Validate state	Invalid
Credit	0.00
Application version	Long runs (8-12 hours on fastest card) v8.15 (cuda42)
Stderr output

<core_client_version>7.3.11</core_client_version>
<![CDATA[
<message>
The file exists.
 (0x50) - exit code 80 (0x50)
</message>
<stderr_txt>
# GPU [GeForce GTX 460] Platform [Windows] Rev [3203M] VERSION [42]
# SWAN Device 1	:
#	Name		: GeForce GTX 460
#	ECC		: Disabled
#	Global mem	: 1024MB
#	Capability	: 2.1
#	PCI ID		: 0000:08:00.0
#	Device clock	: 1526MHz
#	Memory clock	: 1900MHz
#	Memory width	: 256bit
#	Driver version	: r334_89 : 33523
# GPU 0 : 58C
# GPU 1 : 47C
# GPU 2 : 67C
# GPU 0 : 60C
# GPU 1 : 50C
# GPU 2 : 69C
# GPU 0 : 61C
# GPU 1 : 52C
# GPU 0 : 62C
# GPU 1 : 55C
# GPU 2 : 70C
# GPU 0 : 63C
# GPU 1 : 56C
# GPU 2 : 71C
# GPU 1 : 57C
# GPU 0 : 64C
# GPU 1 : 59C
# GPU 2 : 72C
# GPU 0 : 65C
# GPU 1 : 61C
# GPU 1 : 62C
# GPU 1 : 63C
# GPU 2 : 73C
# GPU 0 : 66C
# GPU 1 : 64C
# GPU 2 : 74C
# GPU 1 : 65C
# GPU 0 : 67C
# GPU 1 : 66C
# GPU 2 : 75C
# GPU 0 : 68C
# GPU 2 : 76C
# GPU 1 : 67C
# BOINC suspending at user request (exit)

</stderr_txt>
]]>


Name	1733-GIANNI_ntl-3-4-RND9094_0
Workunit	5485140
Created	26 Mar 2014 | 21:06:01 UTC
Sent	27 Mar 2014 | 6:35:37 UTC
Received	27 Mar 2014 | 11:46:13 UTC
Server state	Over
Outcome	Computation error
Client state	Compute error
Exit status	80 (0x50) Unknown error number
Computer ID	153764
Report deadline	1 Apr 2014 | 6:35:37 UTC
Run time	0.00
CPU time	0.00
Validate state	Invalid
Credit	0.00
Application version	Long runs (8-12 hours on fastest card) v8.15 (cuda55)
Stderr output

<core_client_version>7.3.11</core_client_version>
<![CDATA[
<message>
The file exists.
 (0x50) - exit code 80 (0x50)
</message>
<stderr_txt>
# GPU [GeForce GTX 660 Ti] Platform [Windows] Rev [3203M] VERSION [55]
# SWAN Device 0	:
#	Name		: GeForce GTX 660 Ti
#	ECC		: Disabled
#	Global mem	: 3072MB
#	Capability	: 3.0
#	PCI ID		: 0000:09:00.0
#	Device clock	: 1124MHz
#	Memory clock	: 3004MHz
#	Memory width	: 192bit
#	Driver version	: r334_89 : 33523
# GPU 0 : 64C
# GPU 1 : 65C
# GPU 2 : 74C
# GPU 2 : 75C
# GPU 0 : 65C
# GPU 1 : 66C
# GPU 0 : 66C
# GPU 2 : 76C
# GPU 1 : 67C
# BOINC suspending at user request (exit)

</stderr_txt>
]]>
ID: 35927 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile MJH

Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 35928 - Posted: 27 Mar 2014, 13:09:29 UTC - in response to Message 35927.  

Jacob

Try the acemdshort app 820. Should fix the problem.

Matt
ID: 35928 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 35934 - Posted: 27 Mar 2014, 18:12:43 UTC - in response to Message 35928.  
Last modified: 27 Mar 2014, 18:13:05 UTC

What was the problem, and what was the fix? When do you think it will land on the Long queue?

I will try to monitor application version numbers more closely, as I usually get a variety of Long/Short tasks.
ID: 35934 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile MJH

Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 35941 - Posted: 27 Mar 2014, 19:56:23 UTC - in response to Message 35934.  

The problem, I think, is a false positive from the test to see if the Wu has got Stuck in a crash loop, as introduced in 815.

I fixed that a while ago but only rolled it out with 820.

Let's see...
Matt
ID: 35941 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 36010 - Posted: 30 Mar 2014, 22:42:14 UTC - in response to Message 35941.  

When do you plan on deploying the 8.20 app, to Long-queue? People are still getting the "File already exists" error, losing tons of work, daily. If you were still testing it, then why was it not contained to the Beta-queue? Since it's already on Short, I think it should already be on Long too.

Sick of losing work because of this . . .
ID: 36010 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Stefan
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 5 Mar 13
Posts: 348
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 36016 - Posted: 31 Mar 2014, 9:08:49 UTC - in response to Message 36010.  

Bugs get through beta-queue testing from time to time. So it's obviously better if we only lose the work on the short queue and not the work from both queues. But I guess at this point 820 looks stable enough, so I will suggest to Matt to push it to long.
ID: 36016 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile MJH

Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 36017 - Posted: 31 Mar 2014, 10:37:37 UTC - in response to Message 36010.  

Jacob,

820 for cuda6 is on long now.

Matt
ID: 36017 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 318
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 36018 - Posted: 31 Mar 2014, 11:05:34 UTC - in response to Message 36017.  

Jacob,

820 for cuda6 is on long now.

Matt

Have you been able to find a way of preventing the server from allocating cuda55 or cuda42 to Maxwell (CC 5.0) cards yet?

Doesn't waste any actual computing time, but the downloads are a bit of a pain - and having several hours of expected crunching suddenly disappear rather confuses BOINC's scheduler. :-D
ID: 36018 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile MJH

Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 36019 - Posted: 31 Mar 2014, 11:20:11 UTC - in response to Message 36018.  


Have you been able to find a way of preventing the server from allocating cuda55 or cuda42 to Maxwell (CC 5.0) cards yet?


No idea, although haven't looked deeply into it yet.

Matt
ID: 36019 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile MJH

Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 36020 - Posted: 31 Mar 2014, 11:20:15 UTC - in response to Message 36018.  


Have you been able to find a way of preventing the server from allocating cuda55 or cuda42 to Maxwell (CC 5.0) cards yet?


No idea, although haven't looked deeply into it yet.

Matt
ID: 36020 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 318
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 36021 - Posted: 31 Mar 2014, 11:31:40 UTC - in response to Message 36020.  


Have you been able to find a way of preventing the server from allocating cuda55 or cuda42 to Maxwell (CC 5.0) cards yet?


No idea, although haven't looked deeply into it yet.

Matt

It should be possible, by setting a maximum compute_capability for the two unwanted plan_classes.
ID: 36021 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 36022 - Posted: 31 Mar 2014, 12:09:55 UTC - in response to Message 36017.  
Last modified: 31 Mar 2014, 12:11:22 UTC

Jacob,

820 for cuda6 is on long now.

Matt



Finally!!

I noticed that it was only deployed for the cuda6 plan classes; are there any plans to update the app for the other plan classes?

Also, please continue to make stability a priority. It is so very frustrating to lose progress. Some of the tasks that fail say they only had a couple seconds of run-time, where I believe they may have actually had several hours invested. Perhaps that masked the severity of the issue to you guys, not sure. But I hope bug-fixing becomes a high(er) priority.

Regards,
Jacob
ID: 36022 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 36083 - Posted: 4 Apr 2014, 2:33:01 UTC

Had to chime in again to say THANK YOU for fixing this. BOINC Task Stability is obviously very important to me, and this bug had been plaguing me for weeks. The new 8.20 app seems to be suspending/exiting/resuming much better for me thus far.

Thank you!
ID: 36083 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Wdethomas

Send message
Joined: 6 Feb 10
Posts: 38
Credit: 274,204,838
RAC: 0
Level
Asn
Scientific publications
watwatwatwatwatwat
Message 36143 - Posted: 7 Apr 2014, 19:03:44 UTC

This has not been fixed. I have all CUDA 55 WU and if the light goes out, the work units get lost.
ID: 36143 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Variable

Send message
Joined: 20 Nov 13
Posts: 21
Credit: 480,846,415
RAC: 0
Level
Gln
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 36145 - Posted: 7 Apr 2014, 19:43:02 UTC

It looks like I've started getting some errors on my machine as well over the last few days. It's not running overly hot, not sure what's going on. This is the output from the last one:

Stderr output
<core_client_version>7.2.33</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code -97 (0xffffff9f)
</message>
<stderr_txt>
# GPU [GeForce GTX 760] Platform [Windows] Rev [3301M] VERSION [60]
# SWAN Device 0 :
# Name : GeForce GTX 760
# ECC : Disabled
# Global mem : 2048MB
# Capability : 3.0
# PCI ID : 0000:01:00.0
# Device clock : 1084MHz
# Memory clock : 3404MHz
# Memory width : 256bit
# Driver version : r334_00 : 33489
# GPU 0 : 44C
# GPU 0 : 45C
# GPU 0 : 47C
# GPU 0 : 48C
# GPU 0 : 49C
# GPU 0 : 50C
# GPU 0 : 51C
# GPU 0 : 52C
# GPU 0 : 53C
# GPU 0 : 54C
# GPU 0 : 55C
# GPU 0 : 56C
# GPU 0 : 57C
# The simulation has become unstable. Terminating to avoid lock-up (1)
# Attempting restart (step 76000)
# GPU [GeForce GTX 760] Platform [Windows] Rev [3301M] VERSION [60]
# SWAN Device 0 :
# Name : GeForce GTX 760
# ECC : Disabled
# Global mem : 2048MB
# Capability : 3.0
# PCI ID : 0000:01:00.0
# Device clock : 1084MHz
# Memory clock : 3404MHz
# Memory width : 256bit
# Driver version : r334_00 : 33489
# GPU 0 : 56C
# GPU 0 : 57C
# The simulation has become unstable. Terminating to avoid lock-up (1)
# Attempting restart (step 174000)
# GPU [GeForce GTX 760] Platform [Windows] Rev [3301M] VERSION [60]
# SWAN Device 0 :
# Name : GeForce GTX 760
# ECC : Disabled
# Global mem : 2048MB
# Capability : 3.0
# PCI ID : 0000:01:00.0
# Device clock : 1084MHz
# Memory clock : 3404MHz
# Memory width : 256bit
# Driver version : r334_00 : 33489
# GPU 0 : 56C
# GPU 0 : 57C
# The simulation has become unstable. Terminating to avoid lock-up (1)
# Attempting restart (step 175000)
# GPU [GeForce GTX 760] Platform [Windows] Rev [3301M] VERSION [60]
# SWAN Device 0 :
# Name : GeForce GTX 760
# ECC : Disabled
# Global mem : 2048MB
# Capability : 3.0
# PCI ID : 0000:01:00.0
# Device clock : 1084MHz
# Memory clock : 3404MHz
# Memory width : 256bit
# Driver version : r334_00 : 33489
# The simulation has become unstable. Terminating to avoid lock-up (1)

</stderr_txt>
]]>
ID: 36145 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 36146 - Posted: 7 Apr 2014, 21:46:59 UTC - in response to Message 36145.  
Last modified: 7 Apr 2014, 21:50:15 UTC

It looks like I've started getting some errors on my machine as well over the last few days. It's not running overly hot, not sure what's going on.

I have been seeing that too recently on one of my previously stable GTX 660s. But the other one that I had previously underclocked from 993 MHz to 967 MHz has been stable. So it appears that the work units have just gotten a little harder, and now I am underclocking both of them. I would suggest reducing your GPU clock to 1000 MHz or so. (It is not a heat issue; mine were around 66 C).
ID: 36146 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile petnek

Send message
Joined: 30 May 09
Posts: 3
Credit: 35,191,012
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwat
Message 36219 - Posted: 11 Apr 2014, 4:54:24 UTC

I have the same issue on two different GPUs with different drivers.

On GTX 275:
<core_client_version>7.2.39</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code -59 (0xffffffc5)


On Quadro FX 3800:
<core_client_version>6.10.18</core_client_version>
<![CDATA[
<message>
The file exists. (0x50) - exit code 80 (0x50)


On both I´am running short tasks.

Please solve this failing!
ID: 36219 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
TJ

Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 36220 - Posted: 11 Apr 2014, 8:23:12 UTC
Last modified: 11 Apr 2014, 8:24:05 UTC

Perhaps a little help. Yesterday I needed to boot all my systems for the necessary Windows updates after running for 26 days.
First thing I do is set to accept no new work so the queue can empty. Eventually I needed to go to bed but still WU's running. I suspended all work in BOINC manager and did then a cold boot (install updates and then power off system). After starting the PC's I went to the BOINC manager again and resumed work. All worked fine without error.

I know this is not the option Jacob, the original poster wants, but at least in my case it did not result in loss of work.

Edit: I need to mention I am still using 331.82 graphics driver
Greetings from TJ
ID: 36220 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile MJH

Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 36223 - Posted: 11 Apr 2014, 10:11:57 UTC - in response to Message 36083.  


Thank you!


Thank you too, for your help in diagnosing it.
On to the next problem!

Matt
ID: 36223 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 36252 - Posted: 12 Apr 2014, 14:38:14 UTC - in response to Message 36223.  


Thank you!


Thank you too, for your help in diagnosing it.
On to the next problem!

Matt


I thought this problem was fixed -- why are we still receiving 8.15 tasks? I just had 2 more fail, losing several hours of work, presumably because they were 8.15 instead of 8.20. Upsetting.
ID: 36252 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · Next

Message boards : Number crunching : Problem - Tasks error when exiting/resuming using 334.67 drivers

©2025 Universitat Pompeu Fabra