Message boards :
Number crunching :
NOELIA tasks - when suspended or exited, often crash drivers
Message board moderation
Previous · 1 · 2 · 3 · 4 · Next
| Author | Message |
|---|---|
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Guys, Hello Matt, The crash is on suspend. I've seen it happen when: - I click "Activity -> Suspend GPU" - I right-click the tray to choose "Snooze GPU" - I manually suspend the task by clicking the task "suspend" button in BOINC - as well as when BOINC suspends work due to me starting an app that is configured as an <exclusive_app> in my config.xml file. I do use the "Leave applications in memory while suspended" setting, so I never lose my CPU tasks' work, and I don't believe that option affects the GPU tasks. However, next time I get a NOELIA task, I will try testing with that option off. Have you been able to reproduce the issue? |
|
Send message Joined: 27 Nov 11 Posts: 11 Credit: 1,021,749,297 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Starting to get a few funky NOELIA tasks as well. GTX 580 + GTX 670, win7 x64, boinc 7.0.64 http://www.gpugrid.net/result.php?resultid=6900910 http://www.gpugrid.net/result.php?resultid=6902281 http://www.gpugrid.net/result.php?resultid=6891033 Heck even a Nathan. http://www.gpugrid.net/result.php?resultid=6899851 Any thoughts? |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
JugNut, This thread is focusing on the issue "NOELIA tasks - when suspended or exited, often crash drivers", trying to reproduce it, and trying to fix it. For your other issues, please consider creating a separate thread. Thanks, Jacob |
|
Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
For me driver restart happens everytime with the current Noelias (or better: I have not observed it not happening) but not with other WUs. Win 8 drivers 320.18 and 314.22 (the last 2 WHQLs), "leave apps in memory" active (but I read it doesn't apply to GPUs, as it would be far too risky to leave something dirty or to run out of memory) and a GTX660Ti (Kepler). There's been some discussion whether disabling the driver watchdog helps. Simply increasing the timer didn't help for me (the screen just kept freezing longer), whereas SK said it did help in his case. Another user said disabling the watchdog altogether would help, but I haven't tried this myself. MrS Scanning for our furry friends since Jan 2002 |
|
Send message Joined: 26 Feb 12 Posts: 184 Credit: 222,376,233 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
For me driver restart happens everytime with the current Noelias (or better: I have not observed it not happening) but not with other WUs. Win 8 drivers 320.18 and 314.22 (the last 2 WHQLs), "leave apps in memory" active (but I read it doesn't apply to GPUs, as it would be far too risky to leave something dirty or to run out of memory) and a GTX660Ti (Kepler). Try a test with the watchdog disabled. Seems to be working for me. I'm also running XP but I don't know if that has anything to do with it. |
skgivenSend message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Driver restart capability was introduced to mainstream desktops with the release of Vista, through the Windows Display Driver Model (WDDM). Driver restarting is unlikely to be an issue in XP, as the display driver architecture was very different. There are some differences between Vista (WDDM 1.0), W7 (1.1) and W8 (1.2). http://en.wikipedia.org/wiki/Windows_Display_Driver_Model FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help |
nateSend message Joined: 6 Jun 11 Posts: 124 Credit: 2,928,865 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
To answer your questions from earlier Jacob, we have not been able to reproduce the error, unfortunately. We only have one box running right now testing Windows 7, and we have not received a NOELIA since those tasks are now dwindling in number. We of course don't doubt that it is real, considering so many people confirming it, we just haven't been able to troubleshoot it yet. Even so, if it is being caused by the driver watchdog or some other Windows bug, we might not be able to do much about it. It will be interesting to see if it still occurs with the watchdog disabled. That should tell us a lot. |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Thank you for responding, Nate. I doubt the issue is a Windows bug or nVidia bug, since it only happens on those NOELIA (klebe) tasks, but I guess you never know. My understanding is that the bug isn't being caused by a driver watchdog; the bug is causing the driver watchdog to be tripped. Since you think it may be valuable information, to you, to see if it occurs with the watchdog disabled, then... I will invest effort into knowing how to do that with Windows 8, so that I can be prepared to do additional testing when I get one of these units again. http://msdn.microsoft.com/en-us/library/windows/hardware/ff570088%28v=vs.85%29.aspx appears to have very useful information, including registry key settings in some of the child nodes in the tree at the left... that should also hopefully be applicable to Windows 8. You had mentioned previously, I thought, that you do have a way to test work units before issuing them even on the beta application. If possible, you might take some of those NOELIA (klebe) tasks, and run them (and suspend them several times) locally on that Windows 7 box, in a non-Production environment. That may prove to be beneficial to the user base. Again, I'll do what I can to provide more information to you. Thanks, Jacob |
|
Send message Joined: 5 Mar 13 Posts: 348 Credit: 0 RAC: 0 Level ![]() Scientific publications ![]() |
Yes we do test the WU's, but unfortunately (at the moment) we test them locally on our Linux machines and not running GPUGrid. So we don't catch such problems. We now have a Windows machine though, which we should slowly start using for that purpose. Although a bug such as this would have passed even from such inspection as it requires a bit of fiddling around. It is something we need to do though. |
|
Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I'd also suggest setting up a test box with several OS's. Of course you can't test everything in advance, but if problems are reported under specific configurations you could react more quickly by just booting into an affected OS. Otherwise I agree with Jacob: "My understanding is that the bug isn't being caused by the driver watchdog; the bug is causing the driver watchdog to be tripped." MrS Scanning for our furry friends since Jan 2002 |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
NOELIA tasks are still very much an issue for me. Suspending them, or closing BOINC, results in graphics driver resets nearly every time. I wish there was a way to opt not to receive them. They are that difficult to work with. The klebe ones give me the most trouble. Can anything be done? Windows 8.1 Preview x64 nVidia GeForce 326.01 x64 Beta GTX 660 Ti GTX 460 |
|
Send message Joined: 1 Dec 12 Posts: 24 Credit: 60,122,950 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]()
|
Do not suspend, don't exit BOINC, haha. ;) Nope I don't know, but I recognize your story/experiences. A suspend will also lead to a driver error at my own system. So that's why I'm touching nothing in BOINC when it's crunching @ GPUGRID. :P |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Since I cannot trust any GPUGrid.net units to shutdown gracefully anymore, here has been my workaround: - Set cc_config.xml to stop computation while certain apps/games are running, using <exclusive_app> entries. - Before I launch an exclusive app/game, right-click the BOINC Manager tray icon, and select "Snooze GPU". - Wait 10 seconds. Sometimes all 3 of my GPUs will crash, and I'll get 3 separate TDR errors with flickering Windows, resulting in Windows balloons saying that the GPU driver has crashed and restarted. Sometimes I get none. But waiting 10 seconds is how long I have to wait to find out. - If there were driver crashes, shutdown any GPU monitoring software, and restart that software (since the values are messed up from the driver crashing) - Now launch the app/game. Because "Snooze GPU" keeps them snoozed for 1 hour, and because I have <exclusive_app> entries in place, I can be sure that the computation will not resume my app/game is running - When I'm done with my app/game, BOINC will resume crunching on its own (after the 1-hour timeout has expired from the "Snooze GPU" command). This is VERY UNFORTUNATE that I have to do this tedious workaround any time I want to use my GPU. Has GPUGrid.net made ANY PROGRESS into finding out the cause of these driver crashes when GPUGrid.net tasks are suspended or shutdown on Windows?? - Jacob |
|
Send message Joined: 5 Mar 13 Posts: 348 Credit: 0 RAC: 0 Level ![]() Scientific publications ![]() |
Unfortunately no :( There is really no time right now to focus on this. I understand it is quite a big problem and we are aware of it. |
|
Send message Joined: 26 Feb 12 Posts: 184 Credit: 222,376,233 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
This is VERY UNFORTUNATE that I have to do this tedious workaround any time I want to use my GPU. It's a Windows problem, not a GPUGrid problem.
|
|
Send message Joined: 5 Mar 13 Posts: 348 Credit: 0 RAC: 0 Level ![]() Scientific publications ![]() |
Hm, good to know. But it is a Windows GPUGrid problem and not generally a problem all GPU Boinc projects have on Windows, right? |
|
Send message Joined: 26 Feb 12 Posts: 184 Credit: 222,376,233 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Hm, good to know. It can be a problem with any GPU project and many games. http://msdn.microsoft.com/en-us/library/windows/hardware/ff553893%28v=vs.85%29.aspx I've posted a fix that worked for me a couple of different places on this site but I'll post it here again for anyone to try. My fix goes a little farther than the Windows suggestion and it works for me. I can suspend and restart tasks, reboot the computer with tasks running, even do a hard shut down and restart. The tasks always restart from where they were with no errors or driver timeout/restarts. YMMV but this has worked well for me on ATI and Nvidia cards. Copy and paste the entire code below (including the Windows Registry Editor Version 5.00 part) into notepad. Rename it timeout fix.reg or something else if you'd like as long as it ends with the .reg extension. After renaming it right click on it and open it with registry editor. You'll get warnings about editing the registry. Just click yes and the code will be added to your registry. Reboot and you should be good to go. This should stop the driver has stopped responding messages and the errors to the WUs when the driver restarts. It will not affect anything else in the registry if it doesn't work. Windows Registry Editor Version 5.00 [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Watchdog] "DisableBugCheck"="1" [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Watchdog\Display] "EaRecovery"="0" |
|
Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
It's a kind-of-a-windows-problem which gets triggered by Noelias tasks, certainly not the "god old" trouble free Nathans and I don't think the Santi SRs either, which I'm running now. Since nanoprobes fix of completely turning off the driver watchdog and recovery cures these errors, they seem to be triggered by the GPU not responding to the driver for > 2s, the default watchdog timeout. SK once set the watchdog timeout to 20s and reported no further errors. I tried with 5s and got a frozen display for ~20 s, and then the driver reset. So: is Noelia doing anything special upon ending / suspending WUs? Anything that takes >2s to complete? MrS Scanning for our furry friends since Jan 2002 |
|
Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Interesting: TJ is reporting no driver reset problems with the current Noelias over there. Can't test myself since I've currently got a healthy supply of POEMs. MrS Scanning for our furry friends since Jan 2002 |
GDFSend message Joined: 14 Mar 07 Posts: 1958 Credit: 629,356 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() |
We have turned down priority on Noelia tasks. You should get less and less until she gets back. gdf |
©2025 Universitat Pompeu Fabra