Problem - Tasks error when exiting/resuming using 334.67 drivers

Message boards : Number crunching : Problem - Tasks error when exiting/resuming using 334.67 drivers
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 · Next

AuthorMessage
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34967 - Posted: 10 Feb 2014, 13:41:49 UTC
Last modified: 10 Feb 2014, 13:43:06 UTC

MJH / Admins:

I'm getting several task errors (Windows 8.1 x64, 334.67 drivers) that say:

<core_client_version>7.2.39</core_client_version>
<![CDATA[
<message>
The file exists.
(0x50) - exit code 80 (0x50)
</message>


and the last line in the stderr.txt file is:
# BOINC suspending at user request (exit)


I think that suspending/resuming tasks isn't working very well. Tasks are erroring out, when being resumed.
http://www.gpugrid.net/result.php?resultid=7747671
http://www.gpugrid.net/result.php?resultid=7749480
http://www.gpugrid.net/result.php?resultid=7750550
http://www.gpugrid.net/result.php?resultid=7751319

Can you please look into this? I'm not sure if it's the application, or if it's the new BETA drivers, or if it's an issue that has always been there. But I would like it fixed!

Hoping you agree, and available to help,
Jacob

PS: I originally posted this in the 8.15 app thread, but decided to create a new thread here. Also, I'm not the only one having this problem.
ID: 34967 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 35019 - Posted: 13 Feb 2014, 16:21:53 UTC - in response to Message 34967.  
Last modified: 13 Feb 2014, 16:22:31 UTC

MJH:

Have you noticed an increase in instability, with 334.67 drivers, when suspending/resuming tasks, or shutting down and restarting BOINC quickly? If so, is there any way to determine if the application is the problem, or if the driver is the problem?
ID: 35019 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Killersocke

Send message
Joined: 18 Oct 13
Posts: 53
Credit: 406,647,419
RAC: 0
Level
Gln
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 35105 - Posted: 17 Feb 2014, 21:17:56 UTC
Last modified: 17 Feb 2014, 21:18:20 UTC

Confirm same Problems here with 332.21 Driver

589x-SANTI_MAR422cap310-12-32-RND9315_0
Arbeitspaket 5177762

Name 369x-SANTI_MAR422cap310-8-32-RND5608_0
Arbeitspaket 5175511

Simulation unstable. Flag 9 value 375
# The simulation has become unstable. Terminating to avoid lock-up
ID: 35105 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 35242 - Posted: 23 Feb 2014, 2:13:36 UTC

This happened again, where suspending the task, then closing BOINC, resulted in the task error'ing:
http://www.gpugrid.net/result.php?resultid=7810949

Can an admin please help to resolve this issue, or will it go unanswered?
I'm willing to offer whatever it takes to help test to get it resolved.

MJH?
ID: 35242 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Mumak
Avatar

Send message
Joined: 7 Dec 12
Posts: 92
Credit: 225,897,225
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 35314 - Posted: 24 Feb 2014, 9:26:34 UTC

Same issue here too...
ID: 35314 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
lukeu

Send message
Joined: 14 Oct 11
Posts: 31
Credit: 81,420,504
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 35315 - Posted: 24 Feb 2014, 9:34:09 UTC

Snap! GTX660, Win7-64, Driver 311.06
ID: 35315 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 35336 - Posted: 25 Feb 2014, 13:04:40 UTC

Anyone at GPUGrid care to fix this, like we did the previous suspend/resume problems? I'm willing to help test.
ID: 35336 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 351
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 35347 - Posted: 25 Feb 2014, 19:22:02 UTC - in response to Message 35336.  

Have we any more complete idea of the cause yet? I've recently upgraded to the WHQL version of the driver (334.89) for my GTX 670: no crashes yet, but then I don't routinely suspend tasks once they've started. What I have noticed is the reduced CPU demand, and a welcome reduction in the runtime of the SIMAP tasks running at the same time.

I note that stderr says

The file exists.
 (0x50) - exit code 80 (0x50)

but MLH's FAQ says

* -80 Failed to recover after an access violation (Win32)

Any signs of an access violation from Windows, Jacob?

I'd be interested if the problem could be narrowed down to a more immediate cause. Candidates are

Windows (I see Jacob using v8.1 - I have 7 here)
Driver
BOINC client (I see Jacob using alpha client v7.3.2)
BOINC API (linked into application)
Application

and of course any combination of the above, plus probably more besides. My instinctive reaction on seeing the thread title was 'API', but I'm not so sure having looked at the full error messages.
ID: 35347 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 35348 - Posted: 25 Feb 2014, 21:02:54 UTC - in response to Message 35347.  
Last modified: 25 Feb 2014, 21:05:37 UTC

I was able to get a task to fail by:

- Run BOINC such that the GPU Task is processing
- Right-click tray, choose "Snooze GPU"
- Verify task now says "GPU suspended"
- Right-click tray, "Exit", with "Stop running tasks" checked, click OK
- Start BOINC

They don't fail all the time, but... if you try those exact steps over and over, eventually you might get a failure.

I'd like this thread to focus on failures that are a result of those steps above. I hope we can solve it, but we'll need help from MJH.
ID: 35348 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Dagorath

Send message
Joined: 16 Mar 11
Posts: 509
Credit: 179,005,236
RAC: 0
Level
Ile
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 35352 - Posted: 25 Feb 2014, 23:04:03 UTC - in response to Message 35348.  

They don't fail all the time, but... if you try those exact steps over and over, eventually you might get a failure.


I have caused GPUgrid tasks to fail on restart by stopping and restarting BOINC quickly 3 or 4 times in a row on Linux but that was last year not with current app and drivers. If I think of it I'll try to replicate it on a newly started task but I'm not going to try it on a task I've put an hour into.

If a single stop BOINC and restart cycle is causing crashes then that's worth fixing but if it happens only after several stop and restart cycles in quick succession then I wonder if it's worth fixing as that is not a likely operating scenario.

BOINC <<--- credit whores, pedants, alien hunters
ID: 35352 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 35353 - Posted: 25 Feb 2014, 23:05:37 UTC - in response to Message 35352.  

I run applications that I have setup as "exclusive applications" in BOINC. And sometimes I shut down BOINC.

These, even in combination, should be supported, by the projects.
And I hope to have this issue resolved eventually. :)
ID: 35353 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile MJH

Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 35354 - Posted: 25 Feb 2014, 23:29:42 UTC

I've scheduled some time to sort this out in a week or so, when I'll also be putting out Maxwell support.

Matt
ID: 35354 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 35356 - Posted: 26 Feb 2014, 1:34:09 UTC

Thank you.

I'm not sure how exactly to help, but I'm definitely willing. Last time, we iterated app versions with debug text to solve it, right? We might have to do something similar here.
ID: 35356 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Beyond
Avatar

Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 35457 - Posted: 2 Mar 2014, 17:05:00 UTC

I don't think this is a driver issue. I'd been error free for a long time but in the last 5 days have been seeing errors in SANTI_MAR WUs only. Some of them occur whenever BOINC is exited (gracefully, by exit dialog) for any reason. No other WU types are affected. At first I though the exit error was only on 1GB cards but now I see on other users that it's happening on 660 Ti cards also. The SANTI_MAR WUs also seem to be particularly sensitive to other conditions too and are failing at too high a rate IMO.
ID: 35457 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile skgiven
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 35478 - Posted: 3 Mar 2014, 8:26:52 UTC - in response to Message 35457.  
Last modified: 3 Mar 2014, 8:29:26 UTC

I've had 10 SANTI_MAR failures on the same Linux system in the past 3weeks,
http://www.gpugrid.net/results.php?hostid=159186&offset=0&show_names=1&state=5&appid=

Other than that there has only been the one SANTI_bax2 failure and 2 WU's I aborted since Nov.

They are all,
    Exit status 255 (0xff) Unknown error number


FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help
ID: 35478 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
TJ

Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 35497 - Posted: 4 Mar 2014, 12:15:47 UTC

I have almost every other day an error of a Santi WU on my 660. On the 770 and 780Ti no errors (yet). I agree with Beyond (nice new picture of dog) that it is not the drivers. Santi's seem to be "special".
Coincidentally I found a crunchers tasks list with a Titan and all Santi's failed on that system, but the recent Noeilia's finished okay.
Greetings from TJ
ID: 35497 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 35566 - Posted: 8 Mar 2014, 23:57:01 UTC - in response to Message 35354.  
Last modified: 9 Mar 2014, 0:00:51 UTC

Matt:

It has been a while -- Have you made any progress on this?

I'm still regularly failing tasks during suspend and resume operations, especially SANTI_MAR tasks. It's especially painful to see 2 tasks fail simultaneously, which happens to me, because I have 2 GPUs dedicated to GPUGrid computing. Then when the tasks fail, instantly 10-20 hours of work, dead, to "Computation Error". Frustrating.

We need a fix! Please help!



Posted: 25 Feb 2014 | 23:29:42 UTC
I've scheduled some time to sort this out in a week or so, when I'll also be putting out Maxwell support.

Matt
ID: 35566 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 35583 - Posted: 10 Mar 2014, 15:08:22 UTC
Last modified: 10 Mar 2014, 15:13:12 UTC

I just had another one fail. I had 19 hours invested into it, and needed to restart my machine. I had suspended the task, I had closed BOINC, I restarted the machine, I resumed the task, and poof, Computation Error.

19 hours, wasted. This is very very frustrating.

Stderr output

<core_client_version>7.3.10</core_client_version>
<![CDATA[
<message>
The file exists.
 (0x50) - exit code 80 (0x50)
</message>
<stderr_txt>
# GPU [GeForce GTX 460] Platform [Windows] Rev [3203M] VERSION [42]
# SWAN Device 1	:
#	Name		: GeForce GTX 460
#	ECC		: Disabled
#	Global mem	: 1024MB
#	Capability	: 2.1
#	PCI ID		: 0000:08:00.0
#	Device clock	: 1526MHz
#	Memory clock	: 1900MHz
#	Memory width	: 256bit
#	Driver version	: r334_00 : 33489
# GPU 0 : 67C
# GPU 1 : 66C
# GPU 2 : 78C
# GPU 1 : 67C
# GPU 1 : 68C
# GPU 0 : 68C
# GPU 1 : 69C
# GPU 1 : 70C
# GPU 0 : 69C
# GPU 2 : 79C
# BOINC suspending at user request (exit)
# GPU [GeForce GTX 460] Platform [Windows] Rev [3203M] VERSION [42]
# SWAN Device 1	:
#	Name		: GeForce GTX 460
#	ECC		: Disabled
#	Global mem	: 1024MB
#	Capability	: 2.1
#	PCI ID		: 0000:08:00.0
#	Device clock	: 1526MHz
#	Memory clock	: 1900MHz
#	Memory width	: 256bit
#	Driver version	: r334_00 : 33489
# GPU 0 : 66C
# GPU 1 : 65C
# GPU 2 : 73C
# GPU 1 : 66C
# GPU 2 : 75C
# GPU 0 : 67C
# BOINC suspending at user request (exit)

</stderr_txt>
]]>
ID: 35583 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 35709 - Posted: 17 Mar 2014, 15:26:50 UTC

And another one today.

The file exists.
(0x50) - exit code 80 (0x50)

MJH?
ID: 35709 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Beyond
Avatar

Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 35792 - Posted: 21 Mar 2014, 20:59:19 UTC - in response to Message 35583.  

I just had another one fail. I had 19 hours invested into it, and needed to restart my machine. I had suspended the task, I had closed BOINC, I restarted the machine, I resumed the task, and poof, Computation Error.

19 hours, wasted. This is very very frustrating.

This same thing happens here on every SANTI_MAR WU when I have to exit BOINC and reboot for an update or whatever. 100% chance of error. Frustrating is the word.
ID: 35792 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · 3 · 4 · Next

Message boards : Number crunching : Problem - Tasks error when exiting/resuming using 334.67 drivers

©2025 Universitat Pompeu Fabra