NOELIA tasks - when suspended or exited, often crash drivers

Message boards : Number crunching : NOELIA tasks - when suspended or exited, often crash drivers
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 30419 - Posted: 26 May 2013, 16:50:51 UTC - in response to Message 30418.  
Last modified: 26 May 2013, 16:51:44 UTC

Guys,

Is the crash on the suspend or the restart? Do you have "keep application in memory when suspended" set? What if you change to the alternative?

Matt


Hello Matt,

The crash is on suspend.

I've seen it happen when:
- I click "Activity -> Suspend GPU"
- I right-click the tray to choose "Snooze GPU"
- I manually suspend the task by clicking the task "suspend" button in BOINC
- as well as when BOINC suspends work due to me starting an app that is configured as an <exclusive_app> in my config.xml file.

I do use the "Leave applications in memory while suspended" setting, so I never lose my CPU tasks' work, and I don't believe that option affects the GPU tasks. However, next time I get a NOELIA task, I will try testing with that option off.

Have you been able to reproduce the issue?
ID: 30419 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
JugNut

Send message
Joined: 27 Nov 11
Posts: 11
Credit: 1,021,749,297
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 30439 - Posted: 27 May 2013, 7:37:38 UTC
Last modified: 27 May 2013, 7:40:43 UTC

ID: 30439 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 30444 - Posted: 27 May 2013, 14:05:57 UTC - in response to Message 30439.  
Last modified: 27 May 2013, 14:06:32 UTC

JugNut,

This thread is focusing on the issue "NOELIA tasks - when suspended or exited, often crash drivers", trying to reproduce it, and trying to fix it.

For your other issues, please consider creating a separate thread.

Thanks,
Jacob
ID: 30444 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 30497 - Posted: 28 May 2013, 18:42:03 UTC

For me driver restart happens everytime with the current Noelias (or better: I have not observed it not happening) but not with other WUs. Win 8 drivers 320.18 and 314.22 (the last 2 WHQLs), "leave apps in memory" active (but I read it doesn't apply to GPUs, as it would be far too risky to leave something dirty or to run out of memory) and a GTX660Ti (Kepler).

There's been some discussion whether disabling the driver watchdog helps. Simply increasing the timer didn't help for me (the screen just kept freezing longer), whereas SK said it did help in his case. Another user said disabling the watchdog altogether would help, but I haven't tried this myself.

MrS
Scanning for our furry friends since Jan 2002
ID: 30497 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
nanoprobe

Send message
Joined: 26 Feb 12
Posts: 184
Credit: 222,376,233
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 30516 - Posted: 29 May 2013, 1:27:54 UTC - in response to Message 30497.  

For me driver restart happens everytime with the current Noelias (or better: I have not observed it not happening) but not with other WUs. Win 8 drivers 320.18 and 314.22 (the last 2 WHQLs), "leave apps in memory" active (but I read it doesn't apply to GPUs, as it would be far too risky to leave something dirty or to run out of memory) and a GTX660Ti (Kepler).

There's been some discussion whether disabling the driver watchdog helps. Simply increasing the timer didn't help for me (the screen just kept freezing longer), whereas SK said it did help in his case. Another user said disabling the watchdog altogether would help, but I haven't tried this myself.

MrS

Try a test with the watchdog disabled. Seems to be working for me. I'm also running XP but I don't know if that has anything to do with it.
ID: 30516 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile skgiven
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 30519 - Posted: 29 May 2013, 8:25:39 UTC - in response to Message 30516.  

Driver restart capability was introduced to mainstream desktops with the release of Vista, through the Windows Display Driver Model (WDDM). Driver restarting is unlikely to be an issue in XP, as the display driver architecture was very different.
There are some differences between Vista (WDDM 1.0), W7 (1.1) and W8 (1.2).

http://en.wikipedia.org/wiki/Windows_Display_Driver_Model


FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help
ID: 30519 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile nate

Send message
Joined: 6 Jun 11
Posts: 124
Credit: 2,928,865
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwat
Message 30522 - Posted: 29 May 2013, 9:02:28 UTC

To answer your questions from earlier Jacob, we have not been able to reproduce the error, unfortunately. We only have one box running right now testing Windows 7, and we have not received a NOELIA since those tasks are now dwindling in number. We of course don't doubt that it is real, considering so many people confirming it, we just haven't been able to troubleshoot it yet. Even so, if it is being caused by the driver watchdog or some other Windows bug, we might not be able to do much about it. It will be interesting to see if it still occurs with the watchdog disabled. That should tell us a lot.
ID: 30522 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 30523 - Posted: 29 May 2013, 9:45:41 UTC - in response to Message 30522.  
Last modified: 29 May 2013, 10:11:49 UTC

Thank you for responding, Nate.

I doubt the issue is a Windows bug or nVidia bug, since it only happens on those NOELIA (klebe) tasks, but I guess you never know. My understanding is that the bug isn't being caused by a driver watchdog; the bug is causing the driver watchdog to be tripped.

Since you think it may be valuable information, to you, to see if it occurs with the watchdog disabled, then... I will invest effort into knowing how to do that with Windows 8, so that I can be prepared to do additional testing when I get one of these units again. http://msdn.microsoft.com/en-us/library/windows/hardware/ff570088%28v=vs.85%29.aspx appears to have very useful information, including registry key settings in some of the child nodes in the tree at the left... that should also hopefully be applicable to Windows 8.

You had mentioned previously, I thought, that you do have a way to test work units before issuing them even on the beta application. If possible, you might take some of those NOELIA (klebe) tasks, and run them (and suspend them several times) locally on that Windows 7 box, in a non-Production environment. That may prove to be beneficial to the user base.

Again, I'll do what I can to provide more information to you.

Thanks,
Jacob
ID: 30523 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Stefan
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 5 Mar 13
Posts: 348
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 30524 - Posted: 29 May 2013, 10:26:56 UTC - in response to Message 30523.  
Last modified: 29 May 2013, 10:30:23 UTC

Yes we do test the WU's, but unfortunately (at the moment) we test them locally on our Linux machines and not running GPUGrid. So we don't catch such problems.
We now have a Windows machine though, which we should slowly start using for that purpose. Although a bug such as this would have passed even from such inspection as it requires a bit of fiddling around. It is something we need to do though.
ID: 30524 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 30549 - Posted: 30 May 2013, 10:00:54 UTC

I'd also suggest setting up a test box with several OS's. Of course you can't test everything in advance, but if problems are reported under specific configurations you could react more quickly by just booting into an affected OS.

Otherwise I agree with Jacob: "My understanding is that the bug isn't being caused by the driver watchdog; the bug is causing the driver watchdog to be tripped."

MrS
Scanning for our furry friends since Jan 2002
ID: 30549 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 31557 - Posted: 16 Jul 2013, 13:52:37 UTC - in response to Message 29318.  

NOELIA tasks are still very much an issue for me.
Suspending them, or closing BOINC, results in graphics driver resets nearly every time.

I wish there was a way to opt not to receive them.
They are that difficult to work with.

The klebe ones give me the most trouble.
Can anything be done?

Windows 8.1 Preview x64
nVidia GeForce 326.01 x64 Beta
GTX 660 Ti
GTX 460
ID: 31557 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
FoldingNator

Send message
Joined: 1 Dec 12
Posts: 24
Credit: 60,122,950
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwat
Message 31672 - Posted: 19 Jul 2013, 23:12:20 UTC - in response to Message 31557.  

Do not suspend, don't exit BOINC, haha. ;)

Nope I don't know, but I recognize your story/experiences. A suspend will also lead to a driver error at my own system. So that's why I'm touching nothing in BOINC when it's crunching @ GPUGRID. :P
ID: 31672 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 31849 - Posted: 6 Aug 2013, 18:19:08 UTC - in response to Message 31672.  
Last modified: 6 Aug 2013, 18:21:59 UTC

Since I cannot trust any GPUGrid.net units to shutdown gracefully anymore, here has been my workaround:

- Set cc_config.xml to stop computation while certain apps/games are running, using <exclusive_app> entries.
- Before I launch an exclusive app/game, right-click the BOINC Manager tray icon, and select "Snooze GPU".
- Wait 10 seconds. Sometimes all 3 of my GPUs will crash, and I'll get 3 separate TDR errors with flickering Windows, resulting in Windows balloons saying that the GPU driver has crashed and restarted. Sometimes I get none. But waiting 10 seconds is how long I have to wait to find out.
- If there were driver crashes, shutdown any GPU monitoring software, and restart that software (since the values are messed up from the driver crashing)
- Now launch the app/game. Because "Snooze GPU" keeps them snoozed for 1 hour, and because I have <exclusive_app> entries in place, I can be sure that the computation will not resume my app/game is running
- When I'm done with my app/game, BOINC will resume crunching on its own (after the 1-hour timeout has expired from the "Snooze GPU" command).

This is VERY UNFORTUNATE that I have to do this tedious workaround any time I want to use my GPU.

Has GPUGrid.net made ANY PROGRESS into finding out the cause of these driver crashes when GPUGrid.net tasks are suspended or shutdown on Windows??


- Jacob
ID: 31849 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Stefan
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 5 Mar 13
Posts: 348
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 31862 - Posted: 7 Aug 2013, 13:04:41 UTC - in response to Message 31849.  

Unfortunately no :( There is really no time right now to focus on this. I understand it is quite a big problem and we are aware of it.
ID: 31862 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
nanoprobe

Send message
Joined: 26 Feb 12
Posts: 184
Credit: 222,376,233
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 31863 - Posted: 7 Aug 2013, 13:22:55 UTC - in response to Message 31849.  

This is VERY UNFORTUNATE that I have to do this tedious workaround any time I want to use my GPU.

Has GPUGrid.net made ANY PROGRESS into finding out the cause of these driver crashes when GPUGrid.net tasks are suspended or shutdown on Windows??


It's a Windows problem, not a GPUGrid problem.
ID: 31863 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Stefan
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 5 Mar 13
Posts: 348
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 31864 - Posted: 7 Aug 2013, 13:25:37 UTC
Last modified: 7 Aug 2013, 13:26:19 UTC

Hm, good to know.
But it is a Windows GPUGrid problem and not generally a problem all GPU Boinc projects have on Windows, right?
ID: 31864 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
nanoprobe

Send message
Joined: 26 Feb 12
Posts: 184
Credit: 222,376,233
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 31872 - Posted: 7 Aug 2013, 15:38:31 UTC - in response to Message 31864.  

Hm, good to know.
But it is a Windows GPUGrid problem and not generally a problem all GPU Boinc projects have on Windows, right?

It can be a problem with any GPU project and many games.

http://msdn.microsoft.com/en-us/library/windows/hardware/ff553893%28v=vs.85%29.aspx

I've posted a fix that worked for me a couple of different places on this site but I'll post it here again for anyone to try. My fix goes a little farther than the Windows suggestion and it works for me. I can suspend and restart tasks, reboot the computer with tasks running, even do a hard shut down and restart. The tasks always restart from where they were with no errors or driver timeout/restarts. YMMV but this has worked well for me on ATI and Nvidia cards.

Copy and paste the entire code below (including the
Windows Registry Editor Version 5.00 part) into notepad. Rename it timeout fix.reg or something else if you'd like as long as it ends with the .reg extension.
After renaming it right click on it and open it with registry editor. You'll get warnings about editing the registry. Just click yes and the code will be added to your registry. Reboot and you should be good to go. This should stop the driver has stopped responding messages and the errors to the WUs when the driver restarts. It will not affect anything else in the registry if it doesn't work.


Windows Registry Editor Version 5.00

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Watchdog]
"DisableBugCheck"="1"

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Watchdog\Display]
"EaRecovery"="0"
ID: 31872 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 31883 - Posted: 7 Aug 2013, 18:50:39 UTC

It's a kind-of-a-windows-problem which gets triggered by Noelias tasks, certainly not the "god old" trouble free Nathans and I don't think the Santi SRs either, which I'm running now.

Since nanoprobes fix of completely turning off the driver watchdog and recovery cures these errors, they seem to be triggered by the GPU not responding to the driver for > 2s, the default watchdog timeout. SK once set the watchdog timeout to 20s and reported no further errors. I tried with 5s and got a frozen display for ~20 s, and then the driver reset.

So: is Noelia doing anything special upon ending / suspending WUs? Anything that takes >2s to complete?

MrS
Scanning for our furry friends since Jan 2002
ID: 31883 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 31920 - Posted: 9 Aug 2013, 21:00:29 UTC

Interesting: TJ is reporting no driver reset problems with the current Noelias over there.
Can't test myself since I've currently got a healthy supply of POEMs.

MrS
Scanning for our furry friends since Jan 2002
ID: 31920 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist

Send message
Joined: 14 Mar 07
Posts: 1958
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 31978 - Posted: 12 Aug 2013, 22:36:56 UTC - in response to Message 31920.  

We have turned down priority on Noelia tasks. You should get less and less until she gets back.

gdf
ID: 31978 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · Next

Message boards : Number crunching : NOELIA tasks - when suspended or exited, often crash drivers

©2025 Universitat Pompeu Fabra