NOELIA tasks - when suspended or exited, often crash drivers

Message boards : Number crunching : NOELIA tasks - when suspended or exited, often crash drivers
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32317 - Posted: 27 Aug 2013, 12:45:36 UTC - in response to Message 31978.  
Last modified: 27 Aug 2013, 12:47:55 UTC

I greatly appreciate the stability my machine has had over the past couple weeks, due to not suspending any NOELIA tasks.

When she gets back, please consider investigating what causes the driver reset and watchdog timeouts, when NOELIA tasks are suspended or BOINC is exited while one is running. I believe some exit logic in the code is not returning quickly enough.

Thank you,
Jacob Klein
ID: 32317 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Stefan
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 5 Mar 13
Posts: 348
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 32327 - Posted: 27 Aug 2013, 17:14:46 UTC - in response to Message 32317.  

Did you try nanoprobe's suggested fix?
ID: 32327 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32329 - Posted: 27 Aug 2013, 17:35:52 UTC - in response to Message 32327.  
Last modified: 27 Aug 2013, 17:36:11 UTC

His suggested fix is to disable TDR, which I use for games and for other GPU applications. I rely on it. So, no, I didn't try it.

So far as I know, the bug is in the NOELIA tasks.
ID: 32329 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile skgiven
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32334 - Posted: 27 Aug 2013, 20:13:55 UTC - in response to Message 32329.  
Last modified: 27 Aug 2013, 21:44:41 UTC

Even with a 20second registry configured delay, Noelia's WU's still trigger a driver restart when suspended, changing app, CPU/Boinc Snooze or closing Boinc.
Old news, hitherto not acted upon...
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help
ID: 32334 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
flashawk

Send message
Joined: 18 Jun 12
Posts: 297
Credit: 3,572,627,986
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 32340 - Posted: 28 Aug 2013, 1:00:05 UTC

Strange, that's never happened to me and I've suspended them dozens of times.
ID: 32340 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32408 - Posted: 28 Aug 2013, 19:22:35 UTC - in response to Message 32329.  

His suggested fix is to disable TDR, which I use for games and for other GPU applications. I rely on it. So, no, I didn't try it.

So far as I know, the bug is in the NOELIA tasks.

Same here: the watchdog saves me from real GPU errors often enough that I don't want to disable it.

MrS
Scanning for our furry friends since Jan 2002
ID: 32408 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile skgiven
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32416 - Posted: 28 Aug 2013, 20:32:19 UTC - in response to Message 32408.  

I don't disable it either, I use a 20sec delay (but I don't game). I've had numerous experiences where the mouse arrow freezes for a few seconds and then everything is as was (without a driver restart and without WU's crashing). Prior to using it I had numerous crashy-the-driver experiences!
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help
ID: 32416 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile MJH

Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32458 - Posted: 29 Aug 2013, 14:04:07 UTC - in response to Message 32416.  

The next beta will have additional critical section locking that will hopefully mitigate this problem.

MJH
ID: 32458 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32463 - Posted: 29 Aug 2013, 14:41:23 UTC - in response to Message 32458.  

Thank you a million times over for setting aside some time to solve this.
I am ecstatic - cannot wait to test your change!

When it's ready, please let us know which version of the app to use, and which task types to look for to test against.

Thank you,
Jacob
ID: 32463 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile MJH

Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32474 - Posted: 29 Aug 2013, 15:39:30 UTC - in response to Message 32463.  

Try out 8.02. Give it a damn good suspending.

MJH
ID: 32474 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32478 - Posted: 29 Aug 2013, 17:04:35 UTC - in response to Message 32474.  
Last modified: 29 Aug 2013, 17:05:37 UTC

Try out 8.02. Give it a damn good suspending.

MJH


Awesome - Initial testing looks very promising! I cannot immediately make it crash. I will do more testing (especially with the exclusive app logic that suspends tasks) later tonight. Edit: I may have been able to make it still crash. Will test more later.

What did you change/fix? I'm a developer, and am very curious about what the change was. Also, is it a change that could improve exit-logic for non-NOELIA tasks?
ID: 32478 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile MJH

Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32480 - Posted: 29 Aug 2013, 17:12:36 UTC - in response to Message 32478.  


What did you change/fix? I'm a developer, and am very curious about what the change was.


The problem stems from BOINC killing off the process while a GPU operation is underway. The fix is to add
BOINC critical section assertions around GPU operations. In the old app, not all GPU operations were so locked.
http://boinc.berkeley.edu/trac/wiki/BasicApi

There may be other circumstances under which a driver hang can be induced, but this should substantially reduce the incidence rate.


Also, is it a change that could improve exit-logic for non-NOELIA tasks?


It'll be good for all WUs. Indeed, its not obvious why those poor NOELIAs always took the brunt of it.

MJH
ID: 32480 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32494 - Posted: 29 Aug 2013, 18:58:44 UTC

Hey MJH, glad to have you back! The project feels alive again.. thanks!

MrS
Scanning for our furry friends since Jan 2002
ID: 32494 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32573 - Posted: 31 Aug 2013, 16:00:15 UTC - in response to Message 32474.  

Try out 8.02. Give it a damn good suspending.

MJH


8.04 KLEBE tasks are still causing driver resets :(

My scenario is that I have 2 of them running - 1 on my GTX 460 and 1 on my GTX 660 Ti, and I'm choosing "Suspend GPU" from the system tray.

Can you please see if you need to add any more critical section mutexes?
Thanks,
Jacob
ID: 32573 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile MJH

Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32584 - Posted: 1 Sep 2013, 9:49:42 UTC - in response to Message 32573.  


8.04 KLEBE tasks are still causing driver resets :(


As frequently as before?

MJH
ID: 32584 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32585 - Posted: 1 Sep 2013, 11:59:06 UTC - in response to Message 32584.  
Last modified: 1 Sep 2013, 12:02:07 UTC

Frequency is quite hard to conclusively prove.
I'll admit, though that it feels like it is crashing less when suspending single tasks.
So, I think you're heading in the right direction, but have more work to do.

The main crashes I'm seeing now are when I choose "Snooze GPU" from the system tray; I still *sometimes* get a driver reset when it tries to suspend 2 running NOELIA tasks.

Are there any more critical sections that need specified?
ID: 32585 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Hype

Send message
Joined: 21 Nov 11
Posts: 10
Credit: 8,509,903
RAC: 0
Level
Ser
Scientific publications
wat
Message 33577 - Posted: 21 Oct 2013, 19:25:02 UTC
Last modified: 21 Oct 2013, 19:26:02 UTC

I've got 2 WU's which I couldn't start anymore because as soon as I resume them the nVidia Driver will crash.

http://www.gpugrid.net/workunit.php?wuid=4864368
http://www.gpugrid.net/workunit.php?wuid=4862898

This one also just crashed with "computation error":

http://www.gpugrid.net/workunit.php?wuid=4856494

Actually half of the WU's I tried to do crashed... :(
ID: 33577 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33578 - Posted: 21 Oct 2013, 19:29:25 UTC - in response to Message 33577.  
Last modified: 21 Oct 2013, 19:30:15 UTC

Hype,

This thread (titled: When suspended or exited, often crash drivers) discusses an issue that occurred when tasks were suspended or BOINC was exited normally, where the drivers would often crash. So far as we all know, recent GPUGrid application versions have actually fixed the issue in this thread.

I assume you have a different error. Could you please read this other thread, http://www.gpugrid.net/forum_thread.php?id=3491, as it describes a currently-open issue that might explain the behavior you are seeing. If your issue is still different, then please open a new thread.

Thanks,
Jacob
ID: 33578 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile robertmiles

Send message
Joined: 16 Apr 09
Posts: 503
Credit: 769,991,668
RAC: 0
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33596 - Posted: 24 Oct 2013, 0:35:57 UTC

Unclear if these are crashing the driver; there is no message saying that it has.

Task http://www.gpugrid.net/result.php?resultid=7393220

Several of these errors, usually with the screen going black for about a second:

10/23/2013 6:21:36 PM | GPUGRID | Task trypsin_lig_12_2-NOELIA_RCrep_eq-0-1-RND5589_0 exited with zero status but no 'finished' file
10/23/2013 6:21:36 PM | GPUGRID | If this happens repeatedly you may need to reset the project.
10/23/2013 6:21:36 PM | GPUGRID | Restarting task trypsin_lig_12_2-NOELIA_RCrep_eq-0-1-RND5589_0 using acemdbeta version 814 (cuda42) in slot 1
10/23/2013 6:22:41 PM | GPUGRID | Task trypsin_lig_12_2-NOELIA_RCrep_eq-0-1-RND5589_0 exited with zero status but no 'finished' file
10/23/2013 6:22:41 PM | GPUGRID | If this happens repeatedly you may need to reset the project.
10/23/2013 6:22:41 PM | GPUGRID | Restarting task trypsin_lig_12_2-NOELIA_RCrep_eq-0-1-RND5589_0 using acemdbeta version 814 (cuda42) in slot 1

I reset the project as indicated. The task then disappeared from my computer.

Also, a number of the recent NOELIA workunits on this computer have given error -97:

http://www.gpugrid.net/result.php?resultid=7393428
http://www.gpugrid.net/result.php?resultid=7393415
http://www.gpugrid.net/result.php?resultid=7393189

SWAN : FATAL : Cuda driver error 999 in file 'swanlibnv2.cpp' in line 1963.
ACEMD beta version v8.14 (cuda42)


and one SANTI_MAR workunit:

http://www.gpugrid.net/result.php?resultid=7392716

Also error -97.
This one might be from overheating the GPU. No overclocking done.
Short runs (2-3 hours on fastest card) v8.14 (cuda42)
ID: 33596 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile robertmiles

Send message
Joined: 16 Apr 09
Posts: 503
Credit: 769,991,668
RAC: 0
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33598 - Posted: 24 Oct 2013, 1:14:51 UTC

More of the same type of problem, on a different NOELIA task.

http://www.gpugrid.net/result.php?resultid=7393628

I've noticed that the problem occurs most often when I do something
that affects most of the screen, such as opening or closing a program
that uses most of the screen - I see the scrren going black for about
a second.

I aborted this task.
ID: 33598 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · Next

Message boards : Number crunching : NOELIA tasks - when suspended or exited, often crash drivers

©2025 Universitat Pompeu Fabra