Advanced search

Message boards : Number crunching : Problem - Tasks error when exiting/resuming using 334.67 drivers

Author Message
Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34967 - Posted: 10 Feb 2014 | 13:41:49 UTC
Last modified: 10 Feb 2014 | 13:43:06 UTC

MJH / Admins:

I'm getting several task errors (Windows 8.1 x64, 334.67 drivers) that say:

<core_client_version>7.2.39</core_client_version>
<![CDATA[
<message>
The file exists.
(0x50) - exit code 80 (0x50)
</message>


and the last line in the stderr.txt file is:
# BOINC suspending at user request (exit)


I think that suspending/resuming tasks isn't working very well. Tasks are erroring out, when being resumed.
http://www.gpugrid.net/result.php?resultid=7747671
http://www.gpugrid.net/result.php?resultid=7749480
http://www.gpugrid.net/result.php?resultid=7750550
http://www.gpugrid.net/result.php?resultid=7751319

Can you please look into this? I'm not sure if it's the application, or if it's the new BETA drivers, or if it's an issue that has always been there. But I would like it fixed!

Hoping you agree, and available to help,
Jacob

PS: I originally posted this in the 8.15 app thread, but decided to create a new thread here. Also, I'm not the only one having this problem.

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 35019 - Posted: 13 Feb 2014 | 16:21:53 UTC - in response to Message 34967.
Last modified: 13 Feb 2014 | 16:22:31 UTC

MJH:

Have you noticed an increase in instability, with 334.67 drivers, when suspending/resuming tasks, or shutting down and restarting BOINC quickly? If so, is there any way to determine if the application is the problem, or if the driver is the problem?

Killersocke
Send message
Joined: 18 Oct 13
Posts: 53
Credit: 406,647,419
RAC: 0
Level
Gln
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 35105 - Posted: 17 Feb 2014 | 21:17:56 UTC
Last modified: 17 Feb 2014 | 21:18:20 UTC

Confirm same Problems here with 332.21 Driver

589x-SANTI_MAR422cap310-12-32-RND9315_0
Arbeitspaket 5177762

Name 369x-SANTI_MAR422cap310-8-32-RND5608_0
Arbeitspaket 5175511

Simulation unstable. Flag 9 value 375
# The simulation has become unstable. Terminating to avoid lock-up

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 35242 - Posted: 23 Feb 2014 | 2:13:36 UTC

This happened again, where suspending the task, then closing BOINC, resulted in the task error'ing:
http://www.gpugrid.net/result.php?resultid=7810949

Can an admin please help to resolve this issue, or will it go unanswered?
I'm willing to offer whatever it takes to help test to get it resolved.

MJH?

Profile Mumak
Avatar
Send message
Joined: 7 Dec 12
Posts: 92
Credit: 225,897,225
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 35314 - Posted: 24 Feb 2014 | 9:26:34 UTC

Same issue here too...

lukeu
Send message
Joined: 14 Oct 11
Posts: 31
Credit: 75,720,504
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 35315 - Posted: 24 Feb 2014 | 9:34:09 UTC

Snap! GTX660, Win7-64, Driver 311.06

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 35336 - Posted: 25 Feb 2014 | 13:04:40 UTC

Anyone at GPUGrid care to fix this, like we did the previous suspend/resume problems? I'm willing to help test.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,601,261,851
RAC: 8,782,442
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 35347 - Posted: 25 Feb 2014 | 19:22:02 UTC - in response to Message 35336.

Have we any more complete idea of the cause yet? I've recently upgraded to the WHQL version of the driver (334.89) for my GTX 670: no crashes yet, but then I don't routinely suspend tasks once they've started. What I have noticed is the reduced CPU demand, and a welcome reduction in the runtime of the SIMAP tasks running at the same time.

I note that stderr says

The file exists.
(0x50) - exit code 80 (0x50)

but MLH's FAQ says

* -80 Failed to recover after an access violation (Win32)

Any signs of an access violation from Windows, Jacob?

I'd be interested if the problem could be narrowed down to a more immediate cause. Candidates are

Windows (I see Jacob using v8.1 - I have 7 here)
Driver
BOINC client (I see Jacob using alpha client v7.3.2)
BOINC API (linked into application)
Application

and of course any combination of the above, plus probably more besides. My instinctive reaction on seeing the thread title was 'API', but I'm not so sure having looked at the full error messages.

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 35348 - Posted: 25 Feb 2014 | 21:02:54 UTC - in response to Message 35347.
Last modified: 25 Feb 2014 | 21:05:37 UTC

I was able to get a task to fail by:

- Run BOINC such that the GPU Task is processing
- Right-click tray, choose "Snooze GPU"
- Verify task now says "GPU suspended"
- Right-click tray, "Exit", with "Stop running tasks" checked, click OK
- Start BOINC

They don't fail all the time, but... if you try those exact steps over and over, eventually you might get a failure.

I'd like this thread to focus on failures that are a result of those steps above. I hope we can solve it, but we'll need help from MJH.

Dagorath
Send message
Joined: 16 Mar 11
Posts: 509
Credit: 179,005,236
RAC: 0
Level
Ile
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 35352 - Posted: 25 Feb 2014 | 23:04:03 UTC - in response to Message 35348.

They don't fail all the time, but... if you try those exact steps over and over, eventually you might get a failure.


I have caused GPUgrid tasks to fail on restart by stopping and restarting BOINC quickly 3 or 4 times in a row on Linux but that was last year not with current app and drivers. If I think of it I'll try to replicate it on a newly started task but I'm not going to try it on a task I've put an hour into.

If a single stop BOINC and restart cycle is causing crashes then that's worth fixing but if it happens only after several stop and restart cycles in quick succession then I wonder if it's worth fixing as that is not a likely operating scenario.

____________
BOINC <<--- credit whores, pedants, alien hunters

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 35353 - Posted: 25 Feb 2014 | 23:05:37 UTC - in response to Message 35352.

I run applications that I have setup as "exclusive applications" in BOINC. And sometimes I shut down BOINC.

These, even in combination, should be supported, by the projects.
And I hope to have this issue resolved eventually. :)

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 35354 - Posted: 25 Feb 2014 | 23:29:42 UTC

I've scheduled some time to sort this out in a week or so, when I'll also be putting out Maxwell support.

Matt

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 35356 - Posted: 26 Feb 2014 | 1:34:09 UTC

Thank you.

I'm not sure how exactly to help, but I'm definitely willing. Last time, we iterated app versions with debug text to solve it, right? We might have to do something similar here.

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 35457 - Posted: 2 Mar 2014 | 17:05:00 UTC

I don't think this is a driver issue. I'd been error free for a long time but in the last 5 days have been seeing errors in SANTI_MAR WUs only. Some of them occur whenever BOINC is exited (gracefully, by exit dialog) for any reason. No other WU types are affected. At first I though the exit error was only on 1GB cards but now I see on other users that it's happening on 660 Ti cards also. The SANTI_MAR WUs also seem to be particularly sensitive to other conditions too and are failing at too high a rate IMO.

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 35478 - Posted: 3 Mar 2014 | 8:26:52 UTC - in response to Message 35457.
Last modified: 3 Mar 2014 | 8:29:26 UTC

I've had 10 SANTI_MAR failures on the same Linux system in the past 3weeks,
http://www.gpugrid.net/results.php?hostid=159186&offset=0&show_names=1&state=5&appid=

Other than that there has only been the one SANTI_bax2 failure and 2 WU's I aborted since Nov.

They are all,

    Exit status 255 (0xff) Unknown error number


____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

TJ
Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 35497 - Posted: 4 Mar 2014 | 12:15:47 UTC

I have almost every other day an error of a Santi WU on my 660. On the 770 and 780Ti no errors (yet). I agree with Beyond (nice new picture of dog) that it is not the drivers. Santi's seem to be "special".
Coincidentally I found a crunchers tasks list with a Titan and all Santi's failed on that system, but the recent Noeilia's finished okay.
____________
Greetings from TJ

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 35566 - Posted: 8 Mar 2014 | 23:57:01 UTC - in response to Message 35354.
Last modified: 9 Mar 2014 | 0:00:51 UTC

Matt:

It has been a while -- Have you made any progress on this?

I'm still regularly failing tasks during suspend and resume operations, especially SANTI_MAR tasks. It's especially painful to see 2 tasks fail simultaneously, which happens to me, because I have 2 GPUs dedicated to GPUGrid computing. Then when the tasks fail, instantly 10-20 hours of work, dead, to "Computation Error". Frustrating.

We need a fix! Please help!



Posted: 25 Feb 2014 | 23:29:42 UTC

I've scheduled some time to sort this out in a week or so, when I'll also be putting out Maxwell support.

Matt

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 35583 - Posted: 10 Mar 2014 | 15:08:22 UTC
Last modified: 10 Mar 2014 | 15:13:12 UTC

I just had another one fail. I had 19 hours invested into it, and needed to restart my machine. I had suspended the task, I had closed BOINC, I restarted the machine, I resumed the task, and poof, Computation Error.

19 hours, wasted. This is very very frustrating.

Stderr output

<core_client_version>7.3.10</core_client_version>
<![CDATA[
<message>
The file exists.
(0x50) - exit code 80 (0x50)
</message>
<stderr_txt>
# GPU [GeForce GTX 460] Platform [Windows] Rev [3203M] VERSION [42]
# SWAN Device 1 :
# Name : GeForce GTX 460
# ECC : Disabled
# Global mem : 1024MB
# Capability : 2.1
# PCI ID : 0000:08:00.0
# Device clock : 1526MHz
# Memory clock : 1900MHz
# Memory width : 256bit
# Driver version : r334_00 : 33489
# GPU 0 : 67C
# GPU 1 : 66C
# GPU 2 : 78C
# GPU 1 : 67C
# GPU 1 : 68C
# GPU 0 : 68C
# GPU 1 : 69C
# GPU 1 : 70C
# GPU 0 : 69C
# GPU 2 : 79C
# BOINC suspending at user request (exit)
# GPU [GeForce GTX 460] Platform [Windows] Rev [3203M] VERSION [42]
# SWAN Device 1 :
# Name : GeForce GTX 460
# ECC : Disabled
# Global mem : 1024MB
# Capability : 2.1
# PCI ID : 0000:08:00.0
# Device clock : 1526MHz
# Memory clock : 1900MHz
# Memory width : 256bit
# Driver version : r334_00 : 33489
# GPU 0 : 66C
# GPU 1 : 65C
# GPU 2 : 73C
# GPU 1 : 66C
# GPU 2 : 75C
# GPU 0 : 67C
# BOINC suspending at user request (exit)

</stderr_txt>
]]>

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 35709 - Posted: 17 Mar 2014 | 15:26:50 UTC

And another one today.

The file exists.
(0x50) - exit code 80 (0x50)

MJH?

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 35792 - Posted: 21 Mar 2014 | 20:59:19 UTC - in response to Message 35583.

I just had another one fail. I had 19 hours invested into it, and needed to restart my machine. I had suspended the task, I had closed BOINC, I restarted the machine, I resumed the task, and poof, Computation Error.

19 hours, wasted. This is very very frustrating.

This same thing happens here on every SANTI_MAR WU when I have to exit BOINC and reboot for an update or whatever. 100% chance of error. Frustrating is the word.

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 35927 - Posted: 27 Mar 2014 | 11:47:41 UTC
Last modified: 27 Mar 2014 | 11:49:41 UTC

MJH: Please please please help.

I just threw away another several hours of GPUGrid work, because I had to restart BOINC, and the 2 GPUGrid tasks died. :( This time, I didn't suspend the tasks, I just exited BOINC normally. Then, upon restart, both tasks died. Surely this is fixable?!?!?

Name 1211-GIANNI_ntl-1-4-RND3734_0
Workunit 5485267
Created 26 Mar 2014 | 21:32:15 UTC
Sent 27 Mar 2014 | 0:06:01 UTC
Received 27 Mar 2014 | 11:46:13 UTC
Server state Over
Outcome Computation error
Client state Compute error
Exit status 80 (0x50) Unknown error number
Computer ID 153764
Report deadline 1 Apr 2014 | 0:06:01 UTC
Run time 0.00
CPU time 0.00
Validate state Invalid
Credit 0.00
Application version Long runs (8-12 hours on fastest card) v8.15 (cuda42)
Stderr output

<core_client_version>7.3.11</core_client_version>
<![CDATA[
<message>
The file exists.
(0x50) - exit code 80 (0x50)
</message>
<stderr_txt>
# GPU [GeForce GTX 460] Platform [Windows] Rev [3203M] VERSION [42]
# SWAN Device 1 :
# Name : GeForce GTX 460
# ECC : Disabled
# Global mem : 1024MB
# Capability : 2.1
# PCI ID : 0000:08:00.0
# Device clock : 1526MHz
# Memory clock : 1900MHz
# Memory width : 256bit
# Driver version : r334_89 : 33523
# GPU 0 : 58C
# GPU 1 : 47C
# GPU 2 : 67C
# GPU 0 : 60C
# GPU 1 : 50C
# GPU 2 : 69C
# GPU 0 : 61C
# GPU 1 : 52C
# GPU 0 : 62C
# GPU 1 : 55C
# GPU 2 : 70C
# GPU 0 : 63C
# GPU 1 : 56C
# GPU 2 : 71C
# GPU 1 : 57C
# GPU 0 : 64C
# GPU 1 : 59C
# GPU 2 : 72C
# GPU 0 : 65C
# GPU 1 : 61C
# GPU 1 : 62C
# GPU 1 : 63C
# GPU 2 : 73C
# GPU 0 : 66C
# GPU 1 : 64C
# GPU 2 : 74C
# GPU 1 : 65C
# GPU 0 : 67C
# GPU 1 : 66C
# GPU 2 : 75C
# GPU 0 : 68C
# GPU 2 : 76C
# GPU 1 : 67C
# BOINC suspending at user request (exit)

</stderr_txt>
]]>


Name 1733-GIANNI_ntl-3-4-RND9094_0
Workunit 5485140
Created 26 Mar 2014 | 21:06:01 UTC
Sent 27 Mar 2014 | 6:35:37 UTC
Received 27 Mar 2014 | 11:46:13 UTC
Server state Over
Outcome Computation error
Client state Compute error
Exit status 80 (0x50) Unknown error number
Computer ID 153764
Report deadline 1 Apr 2014 | 6:35:37 UTC
Run time 0.00
CPU time 0.00
Validate state Invalid
Credit 0.00
Application version Long runs (8-12 hours on fastest card) v8.15 (cuda55)
Stderr output

<core_client_version>7.3.11</core_client_version>
<![CDATA[
<message>
The file exists.
(0x50) - exit code 80 (0x50)
</message>
<stderr_txt>
# GPU [GeForce GTX 660 Ti] Platform [Windows] Rev [3203M] VERSION [55]
# SWAN Device 0 :
# Name : GeForce GTX 660 Ti
# ECC : Disabled
# Global mem : 3072MB
# Capability : 3.0
# PCI ID : 0000:09:00.0
# Device clock : 1124MHz
# Memory clock : 3004MHz
# Memory width : 192bit
# Driver version : r334_89 : 33523
# GPU 0 : 64C
# GPU 1 : 65C
# GPU 2 : 74C
# GPU 2 : 75C
# GPU 0 : 65C
# GPU 1 : 66C
# GPU 0 : 66C
# GPU 2 : 76C
# GPU 1 : 67C
# BOINC suspending at user request (exit)

</stderr_txt>
]]>

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 35928 - Posted: 27 Mar 2014 | 13:09:29 UTC - in response to Message 35927.

Jacob

Try the acemdshort app 820. Should fix the problem.

Matt

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 35934 - Posted: 27 Mar 2014 | 18:12:43 UTC - in response to Message 35928.
Last modified: 27 Mar 2014 | 18:13:05 UTC

What was the problem, and what was the fix? When do you think it will land on the Long queue?

I will try to monitor application version numbers more closely, as I usually get a variety of Long/Short tasks.

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 35941 - Posted: 27 Mar 2014 | 19:56:23 UTC - in response to Message 35934.

The problem, I think, is a false positive from the test to see if the Wu has got Stuck in a crash loop, as introduced in 815.

I fixed that a while ago but only rolled it out with 820.

Let's see...
Matt

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 36010 - Posted: 30 Mar 2014 | 22:42:14 UTC - in response to Message 35941.

When do you plan on deploying the 8.20 app, to Long-queue? People are still getting the "File already exists" error, losing tons of work, daily. If you were still testing it, then why was it not contained to the Beta-queue? Since it's already on Short, I think it should already be on Long too.

Sick of losing work because of this . . .

Stefan
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 5 Mar 13
Posts: 348
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 36016 - Posted: 31 Mar 2014 | 9:08:49 UTC - in response to Message 36010.

Bugs get through beta-queue testing from time to time. So it's obviously better if we only lose the work on the short queue and not the work from both queues. But I guess at this point 820 looks stable enough, so I will suggest to Matt to push it to long.

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 36017 - Posted: 31 Mar 2014 | 10:37:37 UTC - in response to Message 36010.

Jacob,

820 for cuda6 is on long now.

Matt

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,601,261,851
RAC: 8,782,442
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 36018 - Posted: 31 Mar 2014 | 11:05:34 UTC - in response to Message 36017.

Jacob,

820 for cuda6 is on long now.

Matt

Have you been able to find a way of preventing the server from allocating cuda55 or cuda42 to Maxwell (CC 5.0) cards yet?

Doesn't waste any actual computing time, but the downloads are a bit of a pain - and having several hours of expected crunching suddenly disappear rather confuses BOINC's scheduler. :-D

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 36019 - Posted: 31 Mar 2014 | 11:20:11 UTC - in response to Message 36018.


Have you been able to find a way of preventing the server from allocating cuda55 or cuda42 to Maxwell (CC 5.0) cards yet?


No idea, although haven't looked deeply into it yet.

Matt

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 36020 - Posted: 31 Mar 2014 | 11:20:15 UTC - in response to Message 36018.


Have you been able to find a way of preventing the server from allocating cuda55 or cuda42 to Maxwell (CC 5.0) cards yet?


No idea, although haven't looked deeply into it yet.

Matt

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,601,261,851
RAC: 8,782,442
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 36021 - Posted: 31 Mar 2014 | 11:31:40 UTC - in response to Message 36020.


Have you been able to find a way of preventing the server from allocating cuda55 or cuda42 to Maxwell (CC 5.0) cards yet?


No idea, although haven't looked deeply into it yet.

Matt

It should be possible, by setting a maximum compute_capability for the two unwanted plan_classes.

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 36022 - Posted: 31 Mar 2014 | 12:09:55 UTC - in response to Message 36017.
Last modified: 31 Mar 2014 | 12:11:22 UTC

Jacob,

820 for cuda6 is on long now.

Matt



Finally!!

I noticed that it was only deployed for the cuda6 plan classes; are there any plans to update the app for the other plan classes?

Also, please continue to make stability a priority. It is so very frustrating to lose progress. Some of the tasks that fail say they only had a couple seconds of run-time, where I believe they may have actually had several hours invested. Perhaps that masked the severity of the issue to you guys, not sure. But I hope bug-fixing becomes a high(er) priority.

Regards,
Jacob

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 36083 - Posted: 4 Apr 2014 | 2:33:01 UTC

Had to chime in again to say THANK YOU for fixing this. BOINC Task Stability is obviously very important to me, and this bug had been plaguing me for weeks. The new 8.20 app seems to be suspending/exiting/resuming much better for me thus far.

Thank you!

Wdethomas
Send message
Joined: 6 Feb 10
Posts: 38
Credit: 274,204,838
RAC: 0
Level
Asn
Scientific publications
watwatwatwatwatwat
Message 36143 - Posted: 7 Apr 2014 | 19:03:44 UTC

This has not been fixed. I have all CUDA 55 WU and if the light goes out, the work units get lost.

Variable
Send message
Joined: 20 Nov 13
Posts: 21
Credit: 413,046,105
RAC: 734,706
Level
Gln
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 36145 - Posted: 7 Apr 2014 | 19:43:02 UTC

It looks like I've started getting some errors on my machine as well over the last few days. It's not running overly hot, not sure what's going on. This is the output from the last one:

Stderr output
<core_client_version>7.2.33</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code -97 (0xffffff9f)
</message>
<stderr_txt>
# GPU [GeForce GTX 760] Platform [Windows] Rev [3301M] VERSION [60]
# SWAN Device 0 :
# Name : GeForce GTX 760
# ECC : Disabled
# Global mem : 2048MB
# Capability : 3.0
# PCI ID : 0000:01:00.0
# Device clock : 1084MHz
# Memory clock : 3404MHz
# Memory width : 256bit
# Driver version : r334_00 : 33489
# GPU 0 : 44C
# GPU 0 : 45C
# GPU 0 : 47C
# GPU 0 : 48C
# GPU 0 : 49C
# GPU 0 : 50C
# GPU 0 : 51C
# GPU 0 : 52C
# GPU 0 : 53C
# GPU 0 : 54C
# GPU 0 : 55C
# GPU 0 : 56C
# GPU 0 : 57C
# The simulation has become unstable. Terminating to avoid lock-up (1)
# Attempting restart (step 76000)
# GPU [GeForce GTX 760] Platform [Windows] Rev [3301M] VERSION [60]
# SWAN Device 0 :
# Name : GeForce GTX 760
# ECC : Disabled
# Global mem : 2048MB
# Capability : 3.0
# PCI ID : 0000:01:00.0
# Device clock : 1084MHz
# Memory clock : 3404MHz
# Memory width : 256bit
# Driver version : r334_00 : 33489
# GPU 0 : 56C
# GPU 0 : 57C
# The simulation has become unstable. Terminating to avoid lock-up (1)
# Attempting restart (step 174000)
# GPU [GeForce GTX 760] Platform [Windows] Rev [3301M] VERSION [60]
# SWAN Device 0 :
# Name : GeForce GTX 760
# ECC : Disabled
# Global mem : 2048MB
# Capability : 3.0
# PCI ID : 0000:01:00.0
# Device clock : 1084MHz
# Memory clock : 3404MHz
# Memory width : 256bit
# Driver version : r334_00 : 33489
# GPU 0 : 56C
# GPU 0 : 57C
# The simulation has become unstable. Terminating to avoid lock-up (1)
# Attempting restart (step 175000)
# GPU [GeForce GTX 760] Platform [Windows] Rev [3301M] VERSION [60]
# SWAN Device 0 :
# Name : GeForce GTX 760
# ECC : Disabled
# Global mem : 2048MB
# Capability : 3.0
# PCI ID : 0000:01:00.0
# Device clock : 1084MHz
# Memory clock : 3404MHz
# Memory width : 256bit
# Driver version : r334_00 : 33489
# The simulation has become unstable. Terminating to avoid lock-up (1)

</stderr_txt>
]]>

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 36146 - Posted: 7 Apr 2014 | 21:46:59 UTC - in response to Message 36145.
Last modified: 7 Apr 2014 | 21:50:15 UTC

It looks like I've started getting some errors on my machine as well over the last few days. It's not running overly hot, not sure what's going on.

I have been seeing that too recently on one of my previously stable GTX 660s. But the other one that I had previously underclocked from 993 MHz to 967 MHz has been stable. So it appears that the work units have just gotten a little harder, and now I am underclocking both of them. I would suggest reducing your GPU clock to 1000 MHz or so. (It is not a heat issue; mine were around 66 C).

Profile petnek
Send message
Joined: 30 May 09
Posts: 3
Credit: 32,491,012
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwat
Message 36219 - Posted: 11 Apr 2014 | 4:54:24 UTC

I have the same issue on two different GPUs with different drivers.

On GTX 275:

<core_client_version>7.2.39</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code -59 (0xffffffc5)


On Quadro FX 3800:
<core_client_version>6.10.18</core_client_version>
<![CDATA[
<message>
The file exists. (0x50) - exit code 80 (0x50)


On both I´am running short tasks.

Please solve this failing!

TJ
Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 36220 - Posted: 11 Apr 2014 | 8:23:12 UTC
Last modified: 11 Apr 2014 | 8:24:05 UTC

Perhaps a little help. Yesterday I needed to boot all my systems for the necessary Windows updates after running for 26 days.
First thing I do is set to accept no new work so the queue can empty. Eventually I needed to go to bed but still WU's running. I suspended all work in BOINC manager and did then a cold boot (install updates and then power off system). After starting the PC's I went to the BOINC manager again and resumed work. All worked fine without error.

I know this is not the option Jacob, the original poster wants, but at least in my case it did not result in loss of work.

Edit: I need to mention I am still using 331.82 graphics driver
____________
Greetings from TJ

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 36223 - Posted: 11 Apr 2014 | 10:11:57 UTC - in response to Message 36083.


Thank you!


Thank you too, for your help in diagnosing it.
On to the next problem!

Matt

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 36252 - Posted: 12 Apr 2014 | 14:38:14 UTC - in response to Message 36223.


Thank you!


Thank you too, for your help in diagnosing it.
On to the next problem!

Matt


I thought this problem was fixed -- why are we still receiving 8.15 tasks? I just had 2 more fail, losing several hours of work, presumably because they were 8.15 instead of 8.20. Upsetting.

Variable
Send message
Joined: 20 Nov 13
Posts: 21
Credit: 413,046,105
RAC: 734,706
Level
Gln
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 36279 - Posted: 14 Apr 2014 | 13:48:03 UTC

I downclocked my card slightly (~50MHz), or more precisely reduced the overclock, and haven't gotten any more errors since. Not sure if that's causal or coincidental since I haven't bumped it back up yet to test.

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 36280 - Posted: 14 Apr 2014 | 14:05:15 UTC - in response to Message 36279.
Last modified: 14 Apr 2014 | 14:05:49 UTC

Variable: Your issue(s) are different than the one posted in this thread (see post 1). If you continue to have problems, please create a new thread.

Thanks,
Jacob

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 36427 - Posted: 19 Apr 2014 | 11:16:21 UTC - in response to Message 36252.
Last modified: 19 Apr 2014 | 11:16:40 UTC

And... another 8.15 task crashed just now, losing tons of work. Why are we still using 8.15?!?


Thank you!


Thank you too, for your help in diagnosing it.
On to the next problem!

Matt


I thought this problem was fixed -- why are we still receiving 8.15 tasks? I just had 2 more fail, losing several hours of work, presumably because they were 8.15 instead of 8.20. Upsetting.

Wdethomas
Send message
Joined: 6 Feb 10
Posts: 38
Credit: 274,204,838
RAC: 0
Level
Asn
Scientific publications
watwatwatwatwatwat
Message 36439 - Posted: 19 Apr 2014 | 16:23:50 UTC

Power went out yesterday, I lost work units. Power went out today, I lost work units. This needs to get fixed!!!!!

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 36707 - Posted: 28 Apr 2014 | 12:34:28 UTC
Last modified: 28 Apr 2014 | 12:35:36 UTC

MJH:

Although the 8.41 app appears to have improved the situation, I am still occasionally getting what appears to be the same error. I think the scenario is suspending activity, then restarting BOINC. Can you see if there's some scenario/condition that still causes the task to fail?

Error summary:
Exit status 80 (0x50) Unknown error number
The file exists.
(0x50) - exit code 80 (0x50)

Last messaged logged in stderr.txt:
# BOINC suspending at user request (exit)

Task results and stderr.txt:

http://www.gpugrid.net/result.php?resultid=9339200

Name I188-NATHAN_RPS1_adapt4-1-5-RND2310_0
Workunit 6566597
Created 25 Apr 2014 | 22:04:16 UTC
Sent 26 Apr 2014 | 11:18:41 UTC
Received 27 Apr 2014 | 4:06:56 UTC
Server state Over
Outcome Computation error
Client state Compute error
Exit status 80 (0x50) Unknown error number
Computer ID 153764
Report deadline 1 May 2014 | 11:18:41 UTC
Run time 38,039.02
CPU time 6,213.84
Validate state Invalid
Credit 0.00
Application version Long runs (8-12 hours on fastest card) v8.41 (cuda60)
Stderr output

<core_client_version>7.3.15</core_client_version>
<![CDATA[
<message>
The file exists.
(0x50) - exit code 80 (0x50)
</message>
<stderr_txt>
# GPU [GeForce GTX 660 Ti] Platform [Windows] Rev [3301M] VERSION [60]
# SWAN Device 0 :
# Name : GeForce GTX 660 Ti
# ECC : Disabled
# Global mem : 3072MB
# Capability : 3.0
# PCI ID : 0000:09:00.0
# Device clock : 1124MHz
# Memory clock : 3004MHz
# Memory width : 192bit
# Driver version : DM337_50 : 33761
# GPU 0 : 68C
# GPU 1 : 69C
# GPU 2 : 77C
# GPU 0 : 69C
# GPU 1 : 70C
# BOINC suspending at user request (exit)
# GPU [GeForce GTX 460] Platform [Windows] Rev [3301M] VERSION [60]
# SWAN Device 1 :
# Name : GeForce GTX 460
# ECC : Disabled
# Global mem : 1024MB
# Capability : 2.1
# PCI ID : 0000:08:00.0
# Device clock : 1526MHz
# Memory clock : 1900MHz
# Memory width : 256bit
# Driver version : DM337_50 : 33761
# GPU 0 : 70C
# GPU 1 : 69C
# GPU 2 : 78C
# GPU 1 : 70C
# GPU 1 : 71C
# GPU 2 : 79C
# GPU 0 : 71C
# GPU 1 : 72C
# GPU 0 : 72C
# GPU 0 : 73C
# GPU 2 : 80C
# BOINC suspending at user request (exit)
# GPU [GeForce GTX 460] Platform [Windows] Rev [3301M] VERSION [60]
# SWAN Device 1 :
# Name : GeForce GTX 460
# ECC : Disabled
# Global mem : 1024MB
# Capability : 2.1
# PCI ID : 0000:08:00.0
# Device clock : 1526MHz
# Memory clock : 1900MHz
# Memory width : 256bit
# Driver version : DM337_50 : 33761
# GPU 0 : 70C
# GPU 1 : 65C
# GPU 2 : 74C
# GPU 0 : 71C
# GPU 1 : 68C
# GPU 2 : 75C
# GPU 0 : 72C
# GPU 1 : 70C
# GPU 1 : 72C
# GPU 2 : 76C
# GPU 2 : 77C
# GPU 1 : 73C
# GPU 2 : 79C
# GPU 1 : 74C
# GPU 1 : 75C
# GPU 1 : 76C
# GPU 1 : 77C
# GPU 1 : 78C
# GPU 1 : 79C
# GPU 1 : 80C
# GPU 0 : 74C
# GPU 0 : 75C
# BOINC suspending at user request (exit)
# GPU [GeForce GTX 460] Platform [Windows] Rev [3301M] VERSION [60]
# SWAN Device 1 :
# Name : GeForce GTX 460
# ECC : Disabled
# Global mem : 1024MB
# Capability : 2.1
# PCI ID : 0000:08:00.0
# Device clock : 1526MHz
# Memory clock : 1900MHz
# Memory width : 256bit
# Driver version : DM337_50 : 33761
# GPU 0 : 60C
# GPU 1 : 58C
# GPU 2 : 55C
# GPU 0 : 65C
# GPU 1 : 63C
# GPU 2 : 70C
# GPU 0 : 68C
# GPU 1 : 67C
# GPU 2 : 74C
# GPU 0 : 70C
# GPU 1 : 69C
# GPU 2 : 75C
# GPU 0 : 71C
# GPU 1 : 70C
# GPU 2 : 76C
# GPU 0 : 72C
# GPU 1 : 73C
# GPU 0 : 73C
# GPU 2 : 77C
# GPU 2 : 78C
# GPU 2 : 79C
# BOINC suspending at user request (exit)
# GPU [GeForce GTX 460] Platform [Windows] Rev [3301M] VERSION [60]
# SWAN Device 1 :
# Name : GeForce GTX 460
# ECC : Disabled
# Global mem : 1024MB
# Capability : 2.1
# PCI ID : 0000:08:00.0
# Device clock : 1526MHz
# Memory clock : 1900MHz
# Memory width : 256bit
# Driver version : DM337_50 : 33761
# GPU 0 : 59C
# GPU 1 : 58C
# GPU 2 : 55C
# GPU 0 : 61C
# GPU 1 : 63C
# GPU 0 : 63C
# GPU 1 : 67C
# BOINC suspending at user request (exit)
# GPU [GeForce GTX 460] Platform [Windows] Rev [3301M] VERSION [60]
# SWAN Device 1 :
# Name : GeForce GTX 460
# ECC : Disabled
# Global mem : 1024MB
# Capability : 2.1
# PCI ID : 0000:08:00.0
# Device clock : 1526MHz
# Memory clock : 1900MHz
# Memory width : 256bit
# Driver version : DM337_50 : 33761
# GPU 0 : 61C
# GPU 1 : 64C
# GPU 2 : 55C
# GPU 0 : 63C
# GPU 1 : 68C
# GPU 0 : 64C
# GPU 1 : 70C
# GPU 0 : 65C
# GPU 1 : 71C
# GPU 2 : 56C
# GPU 0 : 66C
# BOINC suspending at user request (exit)
# GPU [GeForce GTX 460] Platform [Windows] Rev [3301M] VERSION [60]
# SWAN Device 1 :
# Name : GeForce GTX 460
# ECC : Disabled
# Global mem : 1024MB
# Capability : 2.1
# PCI ID : 0000:08:00.0
# Device clock : 1526MHz
# Memory clock : 1900MHz
# Memory width : 256bit
# Driver version : DM337_50 : 33761
# GPU 0 : 63C
# GPU 1 : 66C
# GPU 2 : 62C
# GPU 0 : 65C
# GPU 1 : 70C
# GPU 0 : 66C
# GPU 1 : 74C
# GPU 0 : 67C
# GPU 1 : 78C
# GPU 0 : 69C
# GPU 0 : 71C
# GPU 0 : 73C
# GPU 2 : 71C
# GPU 2 : 73C
# GPU 2 : 74C
# GPU 2 : 75C
# BOINC suspending at user request (exit)
# GPU [GeForce GTX 460] Platform [Windows] Rev [3301M] VERSION [60]
# SWAN Device 1 :
# Name : GeForce GTX 460
# ECC : Disabled
# Global mem : 1024MB
# Capability : 2.1
# PCI ID : 0000:08:00.0
# Device clock : 1526MHz
# Memory clock : 1900MHz
# Memory width : 256bit
# Driver version : DM337_50 : 33761
# GPU 0 : 61C
# GPU 1 : 63C
# GPU 2 : 57C
# GPU 0 : 63C
# GPU 1 : 67C
# GPU 0 : 64C
# GPU 1 : 70C
# GPU 0 : 66C
# GPU 1 : 71C
# GPU 1 : 73C
# GPU 0 : 67C
# GPU 1 : 74C
# GPU 1 : 75C
# GPU 0 : 71C
# GPU 0 : 72C
# GPU 2 : 69C
# BOINC suspending at user request (exit)
# GPU [GeForce GTX 460] Platform [Windows] Rev [3301M] VERSION [60]
# SWAN Device 1 :
# Name : GeForce GTX 460
# ECC : Disabled
# Global mem : 1024MB
# Capability : 2.1
# PCI ID : 0000:08:00.0
# Device clock : 1526MHz
# Memory clock : 1900MHz
# Memory width : 256bit
# Driver version : DM337_50 : 33761
# GPU 0 : 59C
# GPU 1 : 59C
# GPU 2 : 56C
# GPU 0 : 65C
# GPU 1 : 64C
# GPU 2 : 64C
# GPU 0 : 68C
# GPU 1 : 67C
# GPU 2 : 69C
# GPU 0 : 69C
# GPU 1 : 69C
# GPU 2 : 71C
# GPU 0 : 71C
# GPU 2 : 72C
# GPU 0 : 73C
# GPU 1 : 72C
# GPU 2 : 73C
# GPU 1 : 73C
# GPU 2 : 74C
# GPU 2 : 75C
# GPU 2 : 76C
# BOINC suspending at user request (exit)
# GPU [GeForce GTX 460] Platform [Windows] Rev [3301M] VERSION [60]
# SWAN Device 1 :
# Name : GeForce GTX 460
# ECC : Disabled
# Global mem : 1024MB
# Capability : 2.1
# PCI ID : 0000:08:00.0
# Device clock : 1526MHz
# Memory clock : 1900MHz
# Memory width : 256bit
# Driver version : DM337_50 : 33761
# GPU 0 : 62C
# GPU 1 : 65C
# GPU 2 : 57C
# GPU 0 : 63C
# GPU 1 : 68C
# BOINC suspending at user request (exit)

</stderr_txt>
]]>

Jozef J
Send message
Joined: 7 Jun 12
Posts: 112
Credit: 1,118,845,172
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 36835 - Posted: 14 May 2014 | 16:20:05 UTC

I have highlighted the problem in counting the cards gtx 680 a month now happens to me from . Every day becomes that the tasks of collapse in such a weird way-slow down your PC system in windows and also according to GPU-Z stops the card count. entire system is as if in slow motion ... only helps suspend computation on graphics card, abortions every task and the new has withdrawn. ., and after about cca 6-12 aborted about the tasks shall start another 3 working normally .. it's weird errors and concerns only nvidia cards 600, to 700 card counting goes perfectly.
I play with the problem for months.... and computing of other projects without problems.
It's not boiling cards or a weak PSU.. I'm not able to count on 680 of these normally GPUGRID, consider selling them or any other project..

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 36856 - Posted: 17 May 2014 | 8:01:05 UTC
Last modified: 17 May 2014 | 8:02:41 UTC

MJH:

The v8.41 version of the application still has the occasional "The file exists. (0x50) - exit code 80 (0x50)" error, trashing loads of work :( Can you please invest some time to fix it?

http://www.gpugrid.net/result.php?resultid=10318262


Name A2ART4Ex05x95-GERARD_A2ART4E-13-14-RND0991_0
Workunit 7496762
Created 14 May 2014 | 5:52:04 UTC
Sent 16 May 2014 | 13:57:32 UTC
Received 17 May 2014 | 3:24:11 UTC
Server state Over
Outcome Computation error
Client state Compute error
Exit status 80 (0x50) Unknown error number
Computer ID 153764
Report deadline 21 May 2014 | 13:57:32 UTC
Run time 24,161.19
CPU time 6,302.88
Validate state Invalid
Credit 0.00
Application version Long runs (8-12 hours on fastest card) v8.41 (cuda60)
Stderr output

<core_client_version>7.3.19</core_client_version>
<![CDATA[
<message>
The file exists.
(0x50) - exit code 80 (0x50)
</message>
<stderr_txt>
# GPU [GeForce GTX 660 Ti] Platform [Windows] Rev [3301M] VERSION [60]
# SWAN Device 0 :
# Name : GeForce GTX 660 Ti
# ECC : Disabled
# Global mem : 3072MB
# Capability : 3.0
# PCI ID : 0000:09:00.0
# Device clock : 1124MHz
# Memory clock : 3004MHz
# Memory width : 192bit
# Driver version : DM337_50 : 33761
# GPU 0 : 67C
# GPU 1 : 75C
# GPU 2 : 74C
# GPU 0 : 68C
# GPU 1 : 76C
# GPU 0 : 69C
# GPU 0 : 70C
# GPU 1 : 77C
# GPU 0 : 71C
# GPU 0 : 72C
# GPU 2 : 75C
# BOINC suspending at user request (exit)
# GPU [GeForce GTX 660 Ti] Platform [Windows] Rev [3301M] VERSION [60]
# SWAN Device 0 :
# Name : GeForce GTX 660 Ti
# ECC : Disabled
# Global mem : 3072MB
# Capability : 3.0
# PCI ID : 0000:09:00.0
# Device clock : 1124MHz
# Memory clock : 3004MHz
# Memory width : 192bit
# Driver version : DM337_50 : 33761
# GPU 0 : 66C
# GPU 1 : 71C
# GPU 2 : 58C
# GPU 0 : 67C
# GPU 2 : 62C
# GPU 2 : 66C
# GPU 2 : 67C
# GPU 0 : 68C
# GPU 1 : 72C
# GPU 2 : 68C
# GPU 2 : 69C
# GPU 2 : 70C
# GPU 0 : 69C
# GPU 1 : 73C
# GPU 2 : 71C
# BOINC suspending at user request (exit)
# GPU [GeForce GTX 660 Ti] Platform [Windows] Rev [3301M] VERSION [60]
# SWAN Device 0 :
# Name : GeForce GTX 660 Ti
# ECC : Disabled
# Global mem : 3072MB
# Capability : 3.0
# PCI ID : 0000:09:00.0
# Device clock : 1124MHz
# Memory clock : 3004MHz
# Memory width : 192bit
# Driver version : DM337_50 : 33761
# GPU 0 : 66C
# GPU 1 : 71C
# GPU 2 : 65C
# GPU 0 : 67C
# GPU 1 : 72C
# GPU 2 : 67C
# GPU 2 : 68C
# GPU 0 : 68C
# GPU 2 : 69C
# GPU 1 : 73C
# GPU 0 : 69C
# GPU 2 : 70C
# GPU 2 : 71C
# GPU 1 : 74C
# BOINC suspending at user request (exit)
# GPU [GeForce GTX 660 Ti] Platform [Windows] Rev [3301M] VERSION [60]
# SWAN Device 0 :
# Name : GeForce GTX 660 Ti
# ECC : Disabled
# Global mem : 3072MB
# Capability : 3.0
# PCI ID : 0000:09:00.0
# Device clock : 1124MHz
# Memory clock : 3004MHz
# Memory width : 192bit
# Driver version : DM337_50 : 33761
# GPU 0 : 68C
# GPU 1 : 73C
# GPU 2 : 68C
# GPU 2 : 69C
# GPU 2 : 70C
# GPU 0 : 69C
# GPU 1 : 74C
# GPU 2 : 71C
# GPU 0 : 70C
# GPU 1 : 75C
# GPU 2 : 72C
# GPU 2 : 73C
# GPU 1 : 76C
# BOINC suspending at user request (exit)
# GPU [GeForce GTX 460] Platform [Windows] Rev [3301M] VERSION [60]
# SWAN Device 1 :
# Name : GeForce GTX 460
# ECC : Disabled
# Global mem : 1024MB
# Capability : 2.1
# PCI ID : 0000:08:00.0
# Device clock : 1526MHz
# Memory clock : 1900MHz
# Memory width : 256bit
# Driver version : DM337_50 : 33761
# GPU 0 : 57C
# GPU 1 : 68C
# GPU 2 : 61C
# GPU 0 : 61C
# GPU 1 : 69C
# GPU 0 : 64C
# GPU 1 : 70C
# GPU 0 : 65C
# GPU 1 : 71C
# GPU 0 : 66C
# GPU 1 : 72C
# GPU 0 : 67C
# GPU 1 : 73C
# GPU 0 : 69C
# GPU 0 : 70C
# GPU 2 : 67C
# BOINC suspending at user request (exit)
# GPU [GeForce GTX 460] Platform [Windows] Rev [3301M] VERSION [60]
# SWAN Device 1 :
# Name : GeForce GTX 460
# ECC : Disabled
# Global mem : 1024MB
# Capability : 2.1
# PCI ID : 0000:08:00.0
# Device clock : 1526MHz
# Memory clock : 1900MHz
# Memory width : 256bit
# Driver version : DM337_50 : 33761
# BOINC suspending at user request (exit)
# GPU [GeForce GTX 460] Platform [Windows] Rev [3301M] VERSION [60]
# SWAN Device 1 :
# Name : GeForce GTX 460
# ECC : Disabled
# Global mem : 1024MB
# Capability : 2.1
# PCI ID : 0000:08:00.0
# Device clock : 1526MHz
# Memory clock : 1900MHz
# Memory width : 256bit
# Driver version : DM337_50 : 33761
# GPU 0 : 61C
# GPU 1 : 53C
# GPU 2 : 67C
# GPU 0 : 64C
# GPU 1 : 58C
# GPU 2 : 69C
# GPU 0 : 66C
# GPU 1 : 61C
# GPU 0 : 67C
# GPU 1 : 64C
# GPU 2 : 70C
# GPU 0 : 68C
# GPU 1 : 65C
# GPU 1 : 67C
# GPU 2 : 71C
# GPU 0 : 69C
# GPU 1 : 69C
# GPU 0 : 70C
# GPU 1 : 70C
# GPU 1 : 71C
# GPU 2 : 72C
# GPU 0 : 71C
# GPU 1 : 72C
# GPU 2 : 73C
# BOINC suspending at user request (exit)
# GPU [GeForce GTX 460] Platform [Windows] Rev [3301M] VERSION [60]
# SWAN Device 1 :
# Name : GeForce GTX 460
# ECC : Disabled
# Global mem : 1024MB
# Capability : 2.1
# PCI ID : 0000:08:00.0
# Device clock : 1526MHz
# Memory clock : 1900MHz
# Memory width : 256bit
# Driver version : DM337_50 : 33761
# GPU 0 : 61C
# GPU 1 : 53C
# GPU 2 : 67C
# GPU 0 : 64C
# GPU 1 : 57C
# GPU 2 : 68C
# GPU 0 : 66C
# GPU 1 : 60C
# GPU 2 : 69C
# GPU 0 : 67C
# GPU 1 : 63C
# GPU 2 : 70C
# GPU 0 : 68C
# GPU 1 : 64C
# GPU 0 : 69C
# GPU 1 : 67C
# GPU 1 : 68C
# GPU 2 : 71C
# GPU 1 : 69C
# GPU 0 : 70C
# GPU 1 : 70C
# GPU 1 : 72C
# GPU 2 : 72C
# BOINC suspending at user request (exit)
# GPU [GeForce GTX 460] Platform [Windows] Rev [3301M] VERSION [60]
# SWAN Device 1 :
# Name : GeForce GTX 460
# ECC : Disabled
# Global mem : 1024MB
# Capability : 2.1
# PCI ID : 0000:08:00.0
# Device clock : 1526MHz
# Memory clock : 1900MHz
# Memory width : 256bit
# Driver version : DM337_50 : 33761
# GPU 0 : 54C
# GPU 1 : 58C
# GPU 2 : 59C
# GPU 1 : 62C
# GPU 1 : 64C
# GPU 0 : 60C
# GPU 1 : 66C
# GPU 0 : 62C
# BOINC suspending at user request (exit)
# GPU [GeForce GTX 460] Platform [Windows] Rev [3301M] VERSION [60]
# SWAN Device 1 :
# Name : GeForce GTX 460
# ECC : Disabled
# Global mem : 1024MB
# Capability : 2.1
# PCI ID : 0000:08:00.0
# Device clock : 1526MHz
# Memory clock : 1900MHz
# Memory width : 256bit
# Driver version : DM337_50 : 33761
# GPU 0 : 58C
# GPU 1 : 53C
# GPU 2 : 58C
# GPU 0 : 60C
# GPU 1 : 58C
# GPU 0 : 63C
# GPU 1 : 62C

</stderr_txt>
]]>

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 7,520
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 36857 - Posted: 17 May 2014 | 8:46:09 UTC - in response to Message 36856.

MJH:

The v8.41 version of the application still has the occasional "The file exists. (0x50) - exit code 80 (0x50)" error, trashing loads of work :( Can you please invest some time to fix it?

http://www.gpugrid.net/result.php?resultid=10318262

+2

http://www.gpugrid.net/result.php?resultid=10328606

http://www.gpugrid.net/result.php?resultid=10328572

These failed after a simple system restart.

Wdethomas
Send message
Joined: 6 Feb 10
Posts: 38
Credit: 274,204,838
RAC: 0
Level
Asn
Scientific publications
watwatwatwatwatwat
Message 36984 - Posted: 1 Jun 2014 | 20:03:13 UTC

Every time the lights go out I lose all the units that are being worked on. If I restart the system using the proper procedures, no problem. This has been going on for months and I am really getting sick of it. Bought UPS, now lets see.

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 36985 - Posted: 2 Jun 2014 | 3:51:56 UTC - in response to Message 36984.

Is your error:


The file exists.
(0x50) - exit code 80 (0x50)


If not, then create a new thread please.

This thread is about that error.

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 37242 - Posted: 7 Jul 2014 | 20:07:12 UTC
Last modified: 7 Jul 2014 | 20:09:57 UTC

This is *STILL* an issue. When can we finally get it fully fixed? :(

http://www.gpugrid.net/result.php?resultid=12800989
Outcome Computation error
Client state Compute error
Exit status 80 (0x50) Unknown error number
Run time 32,087.62

Stderr output
<core_client_version>7.4.8</core_client_version>
<![CDATA[
<message>
The file exists.
(0x50) - exit code 80 (0x50)
</message>
<stderr_txt>
...
...
...
# BOINC suspending at user request (exit)
</stderr_txt>
]]>


http://www.gpugrid.net/result.php?resultid=12796113
Outcome Computation error
Client state Compute error
Exit status 80 (0x50) Unknown error number
Run time 2,221.71
Stderr output
<core_client_version>7.4.8</core_client_version>
<![CDATA[
<message>
The file exists.
(0x50) - exit code 80 (0x50)
</message>
<stderr_txt>
...
...
...
# BOINC suspending at user request (exit)
</stderr_txt>
]]>

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 37325 - Posted: 20 Jul 2014 | 21:08:19 UTC - in response to Message 37242.

Jacob,

You are in luck. It's time for another round of GPUGRID development. Remind me, please, the circumstance under which this is occuring.

Matt

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 37327 - Posted: 20 Jul 2014 | 22:34:15 UTC

I'm on the road, but will be home tonight. I'll try to re-review, probably tomorrow. Thanks!

TJ
Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 37338 - Posted: 21 Jul 2014 | 14:39:34 UTC - in response to Message 37325.

Jacob,

You are in luck. It's time for another round of GPUGRID development. Remind me, please, the circumstance under which this is occuring.

Matt

Hi Matt, I don't know if we need to made a new post for this, but I have a request.
Is it possible inn the Stderr output file, show only the temperature of the GPU that did the job? Now the temperature change from every card is shown.
Thank you.
____________
Greetings from TJ

GPUGRID Role account
Send message
Joined: 15 Feb 07
Posts: 134
Credit: 1,349,535,983
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 37339 - Posted: 21 Jul 2014 | 15:22:47 UTC - in response to Message 37338.

Tricky - the GPU ordering from the temperature query interface doesn't correspond to the CUDA ordering.

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 37353 - Posted: 22 Jul 2014 | 13:21:25 UTC

MJH:

I've reviewed the notes in the thread. The main posts that detail the problem are:
http://www.gpugrid.net/forum_thread.php?id=3621&nowrap=true#35348
http://www.gpugrid.net/forum_thread.php?id=3621&nowrap=true#37242

It is not easy to reproduce on demand. I suspect that your best bet is to investigate/walk the code, to find an area that could result in:
<message>
The file exists.
(0x50) - exit code 80 (0x50)
</message>

It seems to happen more frequently when the task is suspended before BOINC is shutdown, but suspending the task might not be a requirement of the bug.

Testing should involve suspending BOINC, and then shutting BOINC down, and then starting BOINC back up. Also, to test the "power outage" scenario, I think testing could involve right clicking boincmgr.exe in Task Manager, and clicking "End process tree".

I hope this helps. The focus should be on code areas that could result in that error message.

Regards,
Jacob

GPUGRID Role account
Send message
Joined: 15 Feb 07
Posts: 134
Credit: 1,349,535,983
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 37357 - Posted: 22 Jul 2014 | 15:00:47 UTC - in response to Message 37353.

That exit circumstance is the failsafe exit that stops a WU getting stuck in an endless cycle of abort - resume, without making any progress. It should only trigger if the machine has been up for a few minutes (from which we infer that the WU crashed the machine).

Matt

GPUGRID Role account
Send message
Joined: 15 Feb 07
Posts: 134
Credit: 1,349,535,983
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 37358 - Posted: 22 Jul 2014 | 15:00:49 UTC - in response to Message 37353.

That exit circumstance is the failsafe exit that stops a WU getting stuck in an endless cycle of abort - resume, without making any progress. It should only trigger if the machine has been up for a few minutes (from which we infer that the WU crashed the machine).

Matt

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 37359 - Posted: 22 Jul 2014 | 15:28:51 UTC

Perhaps you could give me even more clues on how to reproduce the error on demand? It seems that it is currently too stringent, causing otherwise-healthy tasks to fail when starting BOINC.

Vagelis Giannadakis
Send message
Joined: 5 May 13
Posts: 187
Credit: 349,254,454
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 37361 - Posted: 22 Jul 2014 | 15:46:46 UTC - in response to Message 37359.

He said:

It should only trigger if the machine has been up for a few minutes

So, you could try suspending / closing BOINC then resuming it without shutting down the machine in-between and with shutting it down.
____________

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 37362 - Posted: 22 Jul 2014 | 16:00:32 UTC
Last modified: 22 Jul 2014 | 16:04:45 UTC

Matt,

Could you please give me more details about that exit algorithm? Maybe even pseudocode or something, please? Details, like "If it restarts x times without saving a checkpoint" or "If it restarts x times during a computer-uptime-session" or "If it restarts x times during the course of the task", etc.

... Just so I can easily reproduce the issue on demand, and thus help you test/solve it.

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 37387 - Posted: 24 Jul 2014 | 12:34:05 UTC - in response to Message 37362.

Matt,

Could you please give me more details about that exit algorithm? Maybe even pseudocode or something, please? Details, like "If it restarts x times without saving a checkpoint" or "If it restarts x times during a computer-uptime-session" or "If it restarts x times during the course of the task", etc.

... Just so I can easily reproduce the issue on demand, and thus help you test/solve it.


I was able to get another task to error for that reason... so it is still possible, if enough testing is done. Again, could you provide details on the exit algorithm?

GPUGRID Role account
Send message
Joined: 15 Feb 07
Posts: 134
Credit: 1,349,535,983
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 37388 - Posted: 24 Jul 2014 | 13:09:35 UTC - in response to Message 37387.

Jacob,

When the simulation starts computing, ACEMD puts a file called "canary" in the slot directory, which it then removes the first time it writes a restart file set.

When ACEMD is starting up it looks for the "canary" file - if it finds it that means the simulation aborted for some reason very soon after it started before making significant progress. In this case, if the system has been booted for less than 10 minutes we interpret this as meaning that the last instance of ACEMD crashed the machine and so abort the WU as bad.

Matt

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 37389 - Posted: 24 Jul 2014 | 13:30:33 UTC - in response to Message 37388.
Last modified: 24 Jul 2014 | 13:35:42 UTC

Alright.... So, it looks like the slot directory does get the canary file when the tasks are started within the session. And, by utilizing the <checkpoint_debug> flag in cc_config.xml, I believe I see the file being removed whenever the task's first checkpoint of the session is performed.

So, I've tried closing BOINC (normally) about 2 seconds after startup, which leaves the canary files in my slot directories. But, upon starting BOINC, with those files in the directories, it does not fail the tasks.

How can I get these tasks to easily fail on-demand? Is there more to the logic that decides when to fail them?

EDIT: I just re-read your post... I see "if the system has been booted for less than 10 minutes".... hmm... Let me restart Windows, and perform the same test.

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 37391 - Posted: 24 Jul 2014 | 13:52:29 UTC
Last modified: 24 Jul 2014 | 13:54:05 UTC

Hurray! I've been able to make all 3 of my tasks fail, essentially on-demand! All of them with error: "The file exists. (0x50) - exit code 80 (0x50)" ... This genuinely excites me!

Here's what I did:
- restarted my computer
- monitored Task Manager's Performance tab on the CPU selection, to make sure "Up time" was less than 10 minutes
- started BOINC
- saw the canary files
- exited BOINC
- confirmed the canary files were still present
- started BOINC again
- ...and watched the tasks fail.

Good thing I didn't mind failing them :)

Next thing I'll do (later today if I find time) will be to test whether it is "must see canary on task start within 10 minutes of up-time" or "must see canary on task start within 10 minutes of logged-in time"

Either way, though... This algorithm doesn't jive well. Are you able to make changes to it? Perhaps we could work together to develop a better algorithm that hopefully still accomplishes your goals, without killing tasks?

Let me know,
Thanks,
Jacob

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 37392 - Posted: 24 Jul 2014 | 14:36:00 UTC - in response to Message 37391.
Last modified: 24 Jul 2014 | 14:36:20 UTC

Either way, though... This algorithm doesn't jive well. Are you able to make changes to it? Perhaps we could work together to develop a better algorithm that hopefully still accomplishes your goals, without killing tasks?


It might be a matter of:
1) Removing the canary file on a normal shutdown of BOINC (this could solve the majority of the issues!)
2) Consider removing the 10-minute limit, since... Maybe the machine restarted, and had been sitting at a login screen for several hours, before user logged in to start BOINC

Thoughts?

GPUGRID Role account
Send message
Joined: 15 Feb 07
Posts: 134
Credit: 1,349,535,983
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 37393 - Posted: 24 Jul 2014 | 16:09:11 UTC - in response to Message 37392.

Jacob,

Can you explain exactly the circumstances under which you are getting a false activation of the trap? It sounds to me something like:

* You've stopped BOINC because you want the machine for something else. Some of the WUs have only just started running, and haven't reached their first checkpoint, so leave canary files.
* You turn off the machine
* Later,you turn it back on again and the WUs that had barely started are incorrectly assumed to have been defective and aborted.

Is this really a such common occurrence? The window of vulnerability for a WU is pretty narrow - the interval between starting and first checkpoint should only be a few minutes.

Anyway, you've hit on a reasonable improvement - to remove the canary if the tasks are responding to a suspend request from the client.

Matt

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 37394 - Posted: 24 Jul 2014 | 16:28:09 UTC - in response to Message 37393.
Last modified: 24 Jul 2014 | 16:39:30 UTC

Matt,

I do all sorts of crazy fun stuff with my computer. Sometimes, I suspend BOINC, because I need the CPUs for something else. A lot of times, I actually close BOINC, because I want the CPUs and the memory, for my main game, iRacing. :) But I think the culprit scenario is likely a bit different. Here goes.

The "triggering" scenario goes something like this:
- I'm doing something that requires a restart. Maybe I'm installing new software. Go with that as the assumption. Let's say Windows Update required a restart, and I clicked OK to restart Windows. Canary files are not present, because tasks checkpointed before I clicked OK.
- I restart, log in, and immediately pause or exit BOINC (bolded for emphasis as the condition that doesn't jive well with the current canary implementation), because I want resources available. Maybe I realized I have to update additional software, that I know will require a restart, and I want to make this installation go quicker. Or maybe I HAVE A RACE RIGHT NOW (and so, close BOINC, to give me resources for iRacing). So, BOINC gets closed. Canary files are present, because tasks started before I closed. Right?
- So, later, I start BOINC. And then cry. Because all my GPUGrid work is lost. I have 3 GPUs, and all 3 tasks (which could have been up to 30 hours of work) are lost. I weep the tears of a thousand kernels, swept away in an erroneous exit condition. :)

Personally, I think the exit condition might not be needed at all. Have you seen a reason to require it? I assume you want to keep it.

If the tasks are responding to a suspend request from the client (ie: BOINC is closed normally, right? That's what you meant, right?), then... Yes, removing the canary file should solve the problem for my scenario above. It won't solve all the problems (as, I could kill BOINC in Task Manager, and then canary files would still be present, and also I think upgrading BOINC causes the tasks to be killed ungracefully), but it should solve the normal scenarios (normal shutdowns).

Can you implement it? I'd love to test it.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 7,520
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 37395 - Posted: 24 Jul 2014 | 18:39:11 UTC - in response to Message 37393.
Last modified: 24 Jul 2014 | 18:42:27 UTC

I've noticed that GPUGrid tasks fail with the "file exists" error when I'm restarting my PC immediately after a restart. I thought that I should wait for the workunits made their first checkpoint to avoid this error, but I didn't thought that it's a protective algorithm.
Two (or more) fast system restart is needed (for me) when the USB controllers on my motherboard became unusable in Windows XP after a Windows 7 session on that PC, and I have to physically switch off the power from the PC to fix it. Fast system restart(s) is also needed when updating different drivers / software in succession, or when fixing other hardware related problems (for example: I have a PCIe ethernet controller card in this motherboard. At some point the ethernet card has disappeared from device manager, so there was no network connectivity on this PC which is crucial. I had to restart the PC several times, and make changes in the BIOS to fix it)
So this problem can be solved by making this protective algorithm complete: it should delete the canary file during a graceful shutdown.
EDIT: an additional safety algorithm could be this: the workunit should abort itself when it's progressing very slowly (for example: if it couldn't finish in 5 days)

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 37396 - Posted: 24 Jul 2014 | 21:26:22 UTC - in response to Message 37395.

Despite having a primary SSD and secondary Boinc data drive on my main Win7 system, I still use a 30sec cc_config start delay,


    <options>
    <start_delay>30</start_delay>
    </options>


After system installations or updates, followed by a system restart, there is still a bit to be done, so if Boinc immediately tries to start loading and running numerous tasks the WU's are competing for resources with each other and the system. If you restart within 30sec of a previous restart tasks might be forcibly shut down even before they start running never mind checkpoint.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 37418 - Posted: 26 Jul 2014 | 3:51:05 UTC

Any progress?

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 37585 - Posted: 16 Aug 2014 | 3:53:51 UTC - in response to Message 37418.

Matt,

Has there been any progress on improving the canary-file-detection? I almost got bit by it again, when I installed a round of Windows updates, logged into Windows (which launches BOINC), immediately exited BOINC, so I could install round 2 of updates.

Good thing I remembered about the canary issue, and remember to wait until it deleted the files to close BOINC. But, closing BOINC normally should have deleted the canary files.

Please fix this.

GPUGRID Role account
Send message
Joined: 15 Feb 07
Posts: 134
Credit: 1,349,535,983
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 37588 - Posted: 16 Aug 2014 | 8:55:21 UTC - in response to Message 37585.

Jacob,

It's on the todo list. It'll get done early September, after vacaciones.

Matt[/u]

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 37867 - Posted: 9 Sep 2014 | 16:12:57 UTC

Sure hope this gets fixed. Updating my machines from 7.4.8 to 7.4.18, carefully shutting down 7.4.8 before installing the new client yielded 3 aborted GPUGrid WUs out of 7. This happens only with GPUGrid WUs, no other projects that I run (many) behave in this way.

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 38134 - Posted: 28 Sep 2014 | 15:13:53 UTC - in response to Message 37588.

Jacob,

It's on the todo list. It'll get done early September, after vacaciones.

Matt[/u]



Early September? 2014?

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 38137 - Posted: 28 Sep 2014 | 21:08:46 UTC - in response to Message 38134.

coming with the 6.5 app under testing on beta now

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 38139 - Posted: 28 Sep 2014 | 22:14:43 UTC - in response to Message 38137.

Thank you. Are there minimum requirements for getting tasks on that beta app?

Post to thread

Message boards : Number crunching : Problem - Tasks error when exiting/resuming using 334.67 drivers

//