Message boards :
Number crunching :
NOELIA tasks - when suspended or exited, often crash drivers
Message board moderation
Previous · 1 · 2 · 3 · 4 · Next
| Author | Message |
|---|---|
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I greatly appreciate the stability my machine has had over the past couple weeks, due to not suspending any NOELIA tasks. When she gets back, please consider investigating what causes the driver reset and watchdog timeouts, when NOELIA tasks are suspended or BOINC is exited while one is running. I believe some exit logic in the code is not returning quickly enough. Thank you, Jacob Klein |
|
Send message Joined: 5 Mar 13 Posts: 348 Credit: 0 RAC: 0 Level ![]() Scientific publications ![]() |
Did you try nanoprobe's suggested fix? |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
His suggested fix is to disable TDR, which I use for games and for other GPU applications. I rely on it. So, no, I didn't try it. So far as I know, the bug is in the NOELIA tasks. |
skgivenSend message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Even with a 20second registry configured delay, Noelia's WU's still trigger a driver restart when suspended, changing app, CPU/Boinc Snooze or closing Boinc. Old news, hitherto not acted upon... FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help |
|
Send message Joined: 18 Jun 12 Posts: 297 Credit: 3,572,627,986 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Strange, that's never happened to me and I've suspended them dozens of times. |
|
Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
His suggested fix is to disable TDR, which I use for games and for other GPU applications. I rely on it. So, no, I didn't try it. Same here: the watchdog saves me from real GPU errors often enough that I don't want to disable it. MrS Scanning for our furry friends since Jan 2002 |
skgivenSend message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I don't disable it either, I use a 20sec delay (but I don't game). I've had numerous experiences where the mouse arrow freezes for a few seconds and then everything is as was (without a driver restart and without WU's crashing). Prior to using it I had numerous crashy-the-driver experiences! FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help |
MJHSend message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level ![]() Scientific publications ![]()
|
The next beta will have additional critical section locking that will hopefully mitigate this problem. MJH |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Thank you a million times over for setting aside some time to solve this. I am ecstatic - cannot wait to test your change! When it's ready, please let us know which version of the app to use, and which task types to look for to test against. Thank you, Jacob |
MJHSend message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level ![]() Scientific publications ![]()
|
Try out 8.02. Give it a damn good suspending. MJH |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Try out 8.02. Give it a damn good suspending. Awesome - Initial testing looks very promising! I cannot immediately make it crash. I will do more testing (especially with the exclusive app logic that suspends tasks) later tonight. Edit: I may have been able to make it still crash. Will test more later. What did you change/fix? I'm a developer, and am very curious about what the change was. Also, is it a change that could improve exit-logic for non-NOELIA tasks? |
MJHSend message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level ![]() Scientific publications ![]()
|
The problem stems from BOINC killing off the process while a GPU operation is underway. The fix is to add BOINC critical section assertions around GPU operations. In the old app, not all GPU operations were so locked. http://boinc.berkeley.edu/trac/wiki/BasicApi There may be other circumstances under which a driver hang can be induced, but this should substantially reduce the incidence rate.
It'll be good for all WUs. Indeed, its not obvious why those poor NOELIAs always took the brunt of it. MJH |
|
Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Hey MJH, glad to have you back! The project feels alive again.. thanks! MrS Scanning for our furry friends since Jan 2002 |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Try out 8.02. Give it a damn good suspending. 8.04 KLEBE tasks are still causing driver resets :( My scenario is that I have 2 of them running - 1 on my GTX 460 and 1 on my GTX 660 Ti, and I'm choosing "Suspend GPU" from the system tray. Can you please see if you need to add any more critical section mutexes? Thanks, Jacob |
MJHSend message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level ![]() Scientific publications ![]()
|
As frequently as before? MJH |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Frequency is quite hard to conclusively prove. I'll admit, though that it feels like it is crashing less when suspending single tasks. So, I think you're heading in the right direction, but have more work to do. The main crashes I'm seeing now are when I choose "Snooze GPU" from the system tray; I still *sometimes* get a driver reset when it tries to suspend 2 running NOELIA tasks. Are there any more critical sections that need specified? |
|
Send message Joined: 21 Nov 11 Posts: 10 Credit: 8,509,903 RAC: 0 Level ![]() Scientific publications
|
I've got 2 WU's which I couldn't start anymore because as soon as I resume them the nVidia Driver will crash. http://www.gpugrid.net/workunit.php?wuid=4864368 http://www.gpugrid.net/workunit.php?wuid=4862898 This one also just crashed with "computation error": http://www.gpugrid.net/workunit.php?wuid=4856494 Actually half of the WU's I tried to do crashed... :( |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Hype, This thread (titled: When suspended or exited, often crash drivers) discusses an issue that occurred when tasks were suspended or BOINC was exited normally, where the drivers would often crash. So far as we all know, recent GPUGrid application versions have actually fixed the issue in this thread. I assume you have a different error. Could you please read this other thread, http://www.gpugrid.net/forum_thread.php?id=3491, as it describes a currently-open issue that might explain the behavior you are seeing. If your issue is still different, then please open a new thread. Thanks, Jacob |
robertmilesSend message Joined: 16 Apr 09 Posts: 503 Credit: 769,991,668 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Unclear if these are crashing the driver; there is no message saying that it has. Task http://www.gpugrid.net/result.php?resultid=7393220 Several of these errors, usually with the screen going black for about a second: 10/23/2013 6:21:36 PM | GPUGRID | Task trypsin_lig_12_2-NOELIA_RCrep_eq-0-1-RND5589_0 exited with zero status but no 'finished' file 10/23/2013 6:21:36 PM | GPUGRID | If this happens repeatedly you may need to reset the project. 10/23/2013 6:21:36 PM | GPUGRID | Restarting task trypsin_lig_12_2-NOELIA_RCrep_eq-0-1-RND5589_0 using acemdbeta version 814 (cuda42) in slot 1 10/23/2013 6:22:41 PM | GPUGRID | Task trypsin_lig_12_2-NOELIA_RCrep_eq-0-1-RND5589_0 exited with zero status but no 'finished' file 10/23/2013 6:22:41 PM | GPUGRID | If this happens repeatedly you may need to reset the project. 10/23/2013 6:22:41 PM | GPUGRID | Restarting task trypsin_lig_12_2-NOELIA_RCrep_eq-0-1-RND5589_0 using acemdbeta version 814 (cuda42) in slot 1 I reset the project as indicated. The task then disappeared from my computer. Also, a number of the recent NOELIA workunits on this computer have given error -97: http://www.gpugrid.net/result.php?resultid=7393428 http://www.gpugrid.net/result.php?resultid=7393415 http://www.gpugrid.net/result.php?resultid=7393189 SWAN : FATAL : Cuda driver error 999 in file 'swanlibnv2.cpp' in line 1963. ACEMD beta version v8.14 (cuda42) and one SANTI_MAR workunit: http://www.gpugrid.net/result.php?resultid=7392716 Also error -97. This one might be from overheating the GPU. No overclocking done. Short runs (2-3 hours on fastest card) v8.14 (cuda42) |
robertmilesSend message Joined: 16 Apr 09 Posts: 503 Credit: 769,991,668 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
More of the same type of problem, on a different NOELIA task. http://www.gpugrid.net/result.php?resultid=7393628 I've noticed that the problem occurs most often when I do something that affects most of the screen, such as opening or closing a program that uses most of the screen - I see the scrren going black for about a second. I aborted this task. |
©2025 Universitat Pompeu Fabra