Advanced search

Message boards : Number crunching : NOELIA tasks - when suspended or exited, often crash drivers

Author Message
Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 29318 - Posted: 4 Apr 2013 | 23:31:13 UTC

Devs,

If a NOELIA short run task is running, then:
- when I exit BOINC, or
- I suspend the task
...
it often crashes the NVIDIA driver, and leads to Computation Errors on tasks that are running across all GPUs, causing me to lose work, even from other projects.

This sounds very similar to what was happening when NOELIA tasks were in Beta.
Could you please investigate, and see if you can reproduce the issue?
Again, this is causing me to lose work for other projects. :(

Windows 8 x64, BOINC 7.0.60 Beta, nVidia 314.22 WHQL, GTX 660 Ti (usually runs 2 GPUGrid tasks), GTX 460 (usually runs 2 World Community Grid HCC tasks)

nucleon
Send message
Joined: 18 Dec 11
Posts: 10
Credit: 172,348,621
RAC: 0
Level
Ile
Scientific publications
watwatwatwatwatwatwatwat
Message 29319 - Posted: 5 Apr 2013 | 12:04:46 UTC - in response to Message 29318.

Thanks.

You figured out why some of my machines are crashing. I'll make sure the task doesn't suspend.

-- Craig

tomba
Send message
Joined: 21 Feb 09
Posts: 497
Credit: 700,690,702
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 29629 - Posted: 1 May 2013 | 16:37:04 UTC - in response to Message 29318.

... GTX 660 Ti (usually runs 2 GPUGrid tasks)

How do you get a GTX 660 TI to run TWO GPUGrid tasks??
____________

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 29630 - Posted: 1 May 2013 | 16:44:38 UTC - in response to Message 29629.
Last modified: 1 May 2013 | 16:46:00 UTC

How do you get a GTX 660 TI to run TWO GPUGrid tasks??

You use an app_config.xml file.
I'd recommend doing plenty of research beforehand, though, using the following links:
http://boinc.berkeley.edu/wiki/Client_configuration#Application_configuration
http://www.gpugrid.net/forum_thread.php?id=3319
http://www.gpugrid.net/forum_thread.php?id=3331

And if you happen to notice any tasks completing immediately while still granting credit, which is a bug we're still tracking down, then please discontinue the use of the app_config.xml file, and post your results/info here:
http://www.gpugrid.net/forum_thread.php?id=3332

Regards,
Jacob

tomba
Send message
Joined: 21 Feb 09
Posts: 497
Credit: 700,690,702
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 29631 - Posted: 1 May 2013 | 16:52:32 UTC - in response to Message 29630.

How do you get a GTX 660 TI to run TWO GPUGrid tasks??

You use an app_config.xml file.
Jacob

Blimey!! I'm on the case!!!
Tom
____________

Jozef J
Send message
Joined: 7 Jun 12
Posts: 112
Credit: 1,140,895,172
RAC: 315,774
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 29676 - Posted: 4 May 2013 | 14:07:35 UTC

it often crashes the NVIDIA driver, and leads to Computation Errors on tasks that are running across all GPUs, causing me to lose work, even from other projects---------
I have exactly the same problems, always few times per week/monts nvidia drivers crash-blue screen on my win 8 64bit and I can not get over 620k rac-two months. Before I attacked 650k and climb higher on same HW configurations
I've tried everything and the problem is clearly in favor of NVIDIA and GPUGRID. All the problems started about two months ago when all the people seeking solutions to find why they have problems in the noelia Tasks and cuda 4.2 .. the inability make available counting on TITAN nvidia cards probably just confirms problems with nvidia and GPUGRID. Or they have enough people to count the project..

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 30033 - Posted: 16 May 2013 | 21:18:00 UTC - in response to Message 29676.
Last modified: 16 May 2013 | 21:18:48 UTC

GPUGrid.net Devs:

This is still a big problem for me.

If NOELIA tasks are suspended (which happens for me a lot because I have several <exclusive_app> settings configured in cc_config.xml).... It often crashes the driver, which crashes the game/program I'm about to run too!

The error is:
Display driver stopped responding and has recovered

You should be able to easily reproduce this by letting a NOELIA task run for a bit, then hit the Suspend button. Do that over and over, 30 times, and see if you get any errors.

My only workaround right now is, if I know I'm going to be suspending the GPU because I want to run a certain application, I have to suspend it manually, to let it crash the driver, before I run the application. You have made it so I cannot use <exclusive_app> or <exclusive_gpu_app> effectively anymore.

PLEASE FIX THIS!
- Jacob Klein

Dylan
Send message
Joined: 16 Jul 12
Posts: 98
Credit: 386,043,752
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwat
Message 30035 - Posted: 17 May 2013 | 0:49:08 UTC

I have this exact same problem. My specs are
Windows 8x64
2 GTX 670
314.07 WHQL driver
Boinc version 7.0.64 x64

My cpu is an Intel i7-3820 overclocked to 4.5 GHz which runs 5 WU's of World Community Grid and the rest of the cores left to power the GPU's.

tomba
Send message
Joined: 21 Feb 09
Posts: 497
Credit: 700,690,702
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 30290 - Posted: 24 May 2013 | 14:34:41 UTC

Today I replaced my GTX 460 with a GTX 660. My first WU is a Noelia, which looks like it will complete in 12 hours; 25% done in three hours. Much better!

I added the app_config.xml file posted here, which BOINC included in its startup.

I am disappointed not to have seen a second WU running yet. Any thoughts?

BOINC log below.


____________

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 30292 - Posted: 24 May 2013 | 14:45:33 UTC - in response to Message 30290.

Today I replaced my GTX 460 with a GTX 660. My first WU is a Noelia, which looks like it will complete in 12 hours; 25% done in three hours. Much better!
I added the app_config.xml file posted here, which BOINC included in its startup.
I am disappointed not to have seen a second WU running yet. Any thoughts?

You didn't say what you have in your app config, just "posted here". I don't see it in this thread at least. Apparently you're trying to run 2 WUs concurrently. If so, they won't make the 24 hour deadline. The new NATHANS are even longer. Are you trying to increase your credit? Even if they run without problem, you will end up with lower credit than running 1X on a GTX 660.

tomba
Send message
Joined: 21 Feb 09
Posts: 497
Credit: 700,690,702
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 30299 - Posted: 24 May 2013 | 15:33:06 UTC - in response to Message 30292.

You didn't say what you have in your app config, just "posted here". I don't see it in this thread at least.

Thank you for responding. You're right. In this thread there is only a pointer to another thread. Sorry for the confusion.

Apparently you're trying to run 2 WUs concurrently. If so, they won't make the 24 hour deadline. The new NATHANS are even longer. Are you trying to increase your credit? Even if they run without problem, you will end up with lower credit than running 1X on a GTX 660.

Ah! That's not what I had understood: that 50% + 50% = 100% but no bonuses... I just wonder why there has been so much kerfuffle here on a 'feature' (2x) that benefits no-one.

Whatever, if only for a challenge I'd like to give 2x a try. Can you tell me why it does not work for the .XML file below? Thanks.

<app_config>
<app>
<name>acemdlong</name>
<max_concurrent>9999</max_concurrent>
<gpu_versions>
<gpu_usage>1</gpu_usage>
<cpu_usage>0.001</cpu_usage>
</gpu_versions>
</app>
<app>
<name>acemd2</name>
<max_concurrent>9999</max_concurrent>
<gpu_versions>
<gpu_usage>1</gpu_usage>
<cpu_usage>0.001</cpu_usage>
</gpu_versions>
</app>
<app>
<name>acemdshort</name>
<max_concurrent>9999</max_concurrent>
<gpu_versions>
<gpu_usage>1</gpu_usage>
<cpu_usage>0.001</cpu_usage>
</gpu_versions>
</app>
</app_config>

____________

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 30302 - Posted: 24 May 2013 | 15:50:16 UTC - in response to Message 30299.

Apparently you're trying to run 2 WUs concurrently. If so, they won't make the 24 hour deadline. The new NATHANS are even longer. Are you trying to increase your credit? Even if they run without problem, you will end up with lower credit than running 1X on a GTX 660.

Ah! That's not what I had understood: that 50% + 50% = 100% but no bonuses... I just wonder why there has been so much kerfuffle here on a 'feature' (2x) that benefits no-one.

Jacob was running 2X on his 660 Ti with 3GB on the MUCH shorter NATHAN WUs that are now unfortunately gone. If your GPU won't make the 24hr deadline (including DL, UL & reporting time), then you will miss the 24hr bonus and your credit will take a significant hit. That's even if everything runs optimally: errors are likely to be more frequent. Running 1X should be better for the project too as the time from WU generation to WU completion will most likely be less, an issue here.

captainjack
Send message
Joined: 9 May 13
Posts: 171
Credit: 3,608,319,747
RAC: 18,464,951
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 30304 - Posted: 24 May 2013 | 16:15:39 UTC

tomba,

To answer you question about the app_config, the

<gpu_usage>1</gpu_usage>

statement tells BOINC how many gpu's to use for each task. Currently it is set to use 1 full GPU for each task so only 1 task will run on each gpu at a time. If you set it to
<gpu_usage>0.5</gpu_usage>

that would tell BOINC to use half of a GPU for each task which would allow 2 tasks to run on each GPU.

Be sure to post your test results so we can see if it helped.

tomba
Send message
Joined: 21 Feb 09
Posts: 497
Credit: 700,690,702
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 30305 - Posted: 24 May 2013 | 17:05:28 UTC - in response to Message 30304.

tomba,

To answer you question about the app_config, the
<gpu_usage>1</gpu_usage>

statement tells BOINC how many gpu's to use for each task. Currently it is set to use 1 full GPU for each task so only 1 task will run on each gpu at a time. If you set it to
<gpu_usage>0.5</gpu_usage>

that would tell BOINC to use half of a GPU for each task which would allow 2 tasks to run on each GPU.

Be sure to post your test results so we can see if it helped.


Wow! That works! Thank you!! I'm now running two Noelias:



...and below are the pre- and post- 2x results from my GPU Monitor gadget.

I will certainly report back on results! Many thanks.



[/img]
____________

tomba
Send message
Joined: 21 Feb 09
Posts: 497
Credit: 700,690,702
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 30307 - Posted: 24 May 2013 | 17:31:52 UTC - in response to Message 30305.

I will certainly report back on results!

It's very early days but, for both running Noelia WUs, the "Remaining (estimated)" time is counting down much faster than one per second...
____________

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 30310 - Posted: 24 May 2013 | 18:30:03 UTC - in response to Message 30307.

tomba,

When comparing "before" and "after", to see how much faster the tasks are processed, assuming the tasks have the same amount of work to be done in them, then... it's best to look at the "Run Time" value of the results, after they're done.

Also, this thread is about NOELIA tasks crashing. I'm interested in your results, but I have created a thread that documents the performance testing/results of app_config changes; could you please post to it instead? It's called "GPU Task Performance (vs. CPU core usage, app_config, multiple GPU tasks on 1 GPU, etc.)", and is located here: http://www.gpugrid.net/forum_thread.php?id=3331

I'd like to keep this thread focused on the NOELIA problems, which crash the drivers. It's an ongoing issue, and I'm hoping the admins will please look into it :(

Regards,
Jacob

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 30375 - Posted: 26 May 2013 | 0:23:50 UTC - in response to Message 29318.
Last modified: 26 May 2013 | 0:26:26 UTC

Developers:

Has anything been done about this?

NOELIA tasks are still, when suspended, sometimes:
- crashing my drivers
- yielding computation errors even for other GPU tasks
- crashing the whole OS with DPC Watchdog Timeout errors

Please fix this obnoxious behavior!
I originally posted this thread over 6 weeks ago.
Where is the response?



Devs,

If a NOELIA short run task is running, then:
- when I exit BOINC, or
- I suspend the task
...
it often crashes the NVIDIA driver, and leads to Computation Errors on tasks that are running across all GPUs, causing me to lose work, even from other projects.

This sounds very similar to what was happening when NOELIA tasks were in Beta.
Could you please investigate, and see if you can reproduce the issue?
Again, this is causing me to lose work for other projects. :(

Windows 8 x64, BOINC 7.0.60 Beta, nVidia 314.22 WHQL, GTX 660 Ti (usually runs 2 GPUGrid tasks), GTX 460 (usually runs 2 World Community Grid HCC tasks)

Stefan
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 5 Mar 13
Posts: 348
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 30405 - Posted: 26 May 2013 | 11:46:59 UTC - in response to Message 30375.

I will forward it. From a quick forum search it seems to be W7/W8 and driver related. So there might not be much we can do. Are you certain it only happens with Noelias and no other WUs?
Unfortunately you need to keep in mind that at this point there are no dedicated GPUGrid (software) "developers". Our manpower is very limited and the user base very big. This means that with all the hardware/software combinations that work in GPUGrid there are bound to be problems for which we do not have the manpower to fix. We also prefer to dedicate more time on science for obvious reasons.
However since this seems to happen to a few users I will pass it along to MJH who might know something more. Let's hope we find a fix.

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 30406 - Posted: 26 May 2013 | 11:56:48 UTC - in response to Message 30405.
Last modified: 26 May 2013 | 11:57:16 UTC

I will forward it. From a quick forum search it seems to be W7/W8 and driver related. So there might not be much we can do. Are you certain it only happens with Noelias and no other WUs?
Unfortunately you need to keep in mind that at this point there are no dedicated GPUGrid (software) "developers". Our manpower is very limited and the user base very big. This means that with all the hardware/software combinations that work in GPUGrid there are bound to be problems for which we do not have the manpower to fix. We also prefer to dedicate more time on science for obvious reasons.
However since this seems to happen to a few users I will pass it along to MJH who might know something more. Let's hope we find a fix.


The issue happens sometimes when GPU tasks are suspended. This means it will hopefully be easy for you guys to reproduce.

I believe I've only seen the problem on NOELIA tasks. For reference, I'm using Windows 8 x64, with the new v320.18 WHQL drivers.

It should be a matter of letting the task run for a some time (15 seconds), then suspending it... then just keep doing that several times, and hopefully you'll see the problem after a few tries. I'd be curious to know if you (or anyone in GPUGrid) can reproduce it?

Thanks,
Jacob

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 30418 - Posted: 26 May 2013 | 16:43:49 UTC - in response to Message 30406.

Guys,

Is the crash on the suspend or the restart? Do you have "keep application in memory when suspended" set? What if you change to the alternative?

Matt

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 30419 - Posted: 26 May 2013 | 16:50:51 UTC - in response to Message 30418.
Last modified: 26 May 2013 | 16:51:44 UTC

Guys,

Is the crash on the suspend or the restart? Do you have "keep application in memory when suspended" set? What if you change to the alternative?

Matt


Hello Matt,

The crash is on suspend.

I've seen it happen when:
- I click "Activity -> Suspend GPU"
- I right-click the tray to choose "Snooze GPU"
- I manually suspend the task by clicking the task "suspend" button in BOINC
- as well as when BOINC suspends work due to me starting an app that is configured as an <exclusive_app> in my config.xml file.

I do use the "Leave applications in memory while suspended" setting, so I never lose my CPU tasks' work, and I don't believe that option affects the GPU tasks. However, next time I get a NOELIA task, I will try testing with that option off.

Have you been able to reproduce the issue?

JugNut
Send message
Joined: 27 Nov 11
Posts: 11
Credit: 1,021,749,297
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 30439 - Posted: 27 May 2013 | 7:37:38 UTC
Last modified: 27 May 2013 | 7:40:43 UTC

Starting to get a few funky NOELIA tasks as well. GTX 580 + GTX 670, win7 x64,
boinc 7.0.64

http://www.gpugrid.net/result.php?resultid=6900910
http://www.gpugrid.net/result.php?resultid=6902281
http://www.gpugrid.net/result.php?resultid=6891033

Heck even a Nathan.

http://www.gpugrid.net/result.php?resultid=6899851

Any thoughts?

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 30444 - Posted: 27 May 2013 | 14:05:57 UTC - in response to Message 30439.
Last modified: 27 May 2013 | 14:06:32 UTC

JugNut,

This thread is focusing on the issue "NOELIA tasks - when suspended or exited, often crash drivers", trying to reproduce it, and trying to fix it.

For your other issues, please consider creating a separate thread.

Thanks,
Jacob

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 30497 - Posted: 28 May 2013 | 18:42:03 UTC

For me driver restart happens everytime with the current Noelias (or better: I have not observed it not happening) but not with other WUs. Win 8 drivers 320.18 and 314.22 (the last 2 WHQLs), "leave apps in memory" active (but I read it doesn't apply to GPUs, as it would be far too risky to leave something dirty or to run out of memory) and a GTX660Ti (Kepler).

There's been some discussion whether disabling the driver watchdog helps. Simply increasing the timer didn't help for me (the screen just kept freezing longer), whereas SK said it did help in his case. Another user said disabling the watchdog altogether would help, but I haven't tried this myself.

MrS
____________
Scanning for our furry friends since Jan 2002

nanoprobe
Send message
Joined: 26 Feb 12
Posts: 184
Credit: 222,376,233
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 30516 - Posted: 29 May 2013 | 1:27:54 UTC - in response to Message 30497.

For me driver restart happens everytime with the current Noelias (or better: I have not observed it not happening) but not with other WUs. Win 8 drivers 320.18 and 314.22 (the last 2 WHQLs), "leave apps in memory" active (but I read it doesn't apply to GPUs, as it would be far too risky to leave something dirty or to run out of memory) and a GTX660Ti (Kepler).

There's been some discussion whether disabling the driver watchdog helps. Simply increasing the timer didn't help for me (the screen just kept freezing longer), whereas SK said it did help in his case. Another user said disabling the watchdog altogether would help, but I haven't tried this myself.

MrS

Try a test with the watchdog disabled. Seems to be working for me. I'm also running XP but I don't know if that has anything to do with it.

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 30519 - Posted: 29 May 2013 | 8:25:39 UTC - in response to Message 30516.

Driver restart capability was introduced to mainstream desktops with the release of Vista, through the Windows Display Driver Model (WDDM). Driver restarting is unlikely to be an issue in XP, as the display driver architecture was very different.
There are some differences between Vista (WDDM 1.0), W7 (1.1) and W8 (1.2).

http://en.wikipedia.org/wiki/Windows_Display_Driver_Model


____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Profile nate
Send message
Joined: 6 Jun 11
Posts: 124
Credit: 2,928,865
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwat
Message 30522 - Posted: 29 May 2013 | 9:02:28 UTC

To answer your questions from earlier Jacob, we have not been able to reproduce the error, unfortunately. We only have one box running right now testing Windows 7, and we have not received a NOELIA since those tasks are now dwindling in number. We of course don't doubt that it is real, considering so many people confirming it, we just haven't been able to troubleshoot it yet. Even so, if it is being caused by the driver watchdog or some other Windows bug, we might not be able to do much about it. It will be interesting to see if it still occurs with the watchdog disabled. That should tell us a lot.

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 30523 - Posted: 29 May 2013 | 9:45:41 UTC - in response to Message 30522.
Last modified: 29 May 2013 | 10:11:49 UTC

Thank you for responding, Nate.

I doubt the issue is a Windows bug or nVidia bug, since it only happens on those NOELIA (klebe) tasks, but I guess you never know. My understanding is that the bug isn't being caused by a driver watchdog; the bug is causing the driver watchdog to be tripped.

Since you think it may be valuable information, to you, to see if it occurs with the watchdog disabled, then... I will invest effort into knowing how to do that with Windows 8, so that I can be prepared to do additional testing when I get one of these units again. http://msdn.microsoft.com/en-us/library/windows/hardware/ff570088%28v=vs.85%29.aspx appears to have very useful information, including registry key settings in some of the child nodes in the tree at the left... that should also hopefully be applicable to Windows 8.

You had mentioned previously, I thought, that you do have a way to test work units before issuing them even on the beta application. If possible, you might take some of those NOELIA (klebe) tasks, and run them (and suspend them several times) locally on that Windows 7 box, in a non-Production environment. That may prove to be beneficial to the user base.

Again, I'll do what I can to provide more information to you.

Thanks,
Jacob

Stefan
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 5 Mar 13
Posts: 348
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 30524 - Posted: 29 May 2013 | 10:26:56 UTC - in response to Message 30523.
Last modified: 29 May 2013 | 10:30:23 UTC

Yes we do test the WU's, but unfortunately (at the moment) we test them locally on our Linux machines and not running GPUGrid. So we don't catch such problems.
We now have a Windows machine though, which we should slowly start using for that purpose. Although a bug such as this would have passed even from such inspection as it requires a bit of fiddling around. It is something we need to do though.

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 30549 - Posted: 30 May 2013 | 10:00:54 UTC

I'd also suggest setting up a test box with several OS's. Of course you can't test everything in advance, but if problems are reported under specific configurations you could react more quickly by just booting into an affected OS.

Otherwise I agree with Jacob: "My understanding is that the bug isn't being caused by the driver watchdog; the bug is causing the driver watchdog to be tripped."

MrS
____________
Scanning for our furry friends since Jan 2002

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 31557 - Posted: 16 Jul 2013 | 13:52:37 UTC - in response to Message 29318.

NOELIA tasks are still very much an issue for me.
Suspending them, or closing BOINC, results in graphics driver resets nearly every time.

I wish there was a way to opt not to receive them.
They are that difficult to work with.

The klebe ones give me the most trouble.
Can anything be done?

Windows 8.1 Preview x64
nVidia GeForce 326.01 x64 Beta
GTX 660 Ti
GTX 460

FoldingNator
Send message
Joined: 1 Dec 12
Posts: 24
Credit: 60,122,950
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwat
Message 31672 - Posted: 19 Jul 2013 | 23:12:20 UTC - in response to Message 31557.

Do not suspend, don't exit BOINC, haha. ;)

Nope I don't know, but I recognize your story/experiences. A suspend will also lead to a driver error at my own system. So that's why I'm touching nothing in BOINC when it's crunching @ GPUGRID. :P

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 31849 - Posted: 6 Aug 2013 | 18:19:08 UTC - in response to Message 31672.
Last modified: 6 Aug 2013 | 18:21:59 UTC

Since I cannot trust any GPUGrid.net units to shutdown gracefully anymore, here has been my workaround:

- Set cc_config.xml to stop computation while certain apps/games are running, using <exclusive_app> entries.
- Before I launch an exclusive app/game, right-click the BOINC Manager tray icon, and select "Snooze GPU".
- Wait 10 seconds. Sometimes all 3 of my GPUs will crash, and I'll get 3 separate TDR errors with flickering Windows, resulting in Windows balloons saying that the GPU driver has crashed and restarted. Sometimes I get none. But waiting 10 seconds is how long I have to wait to find out.
- If there were driver crashes, shutdown any GPU monitoring software, and restart that software (since the values are messed up from the driver crashing)
- Now launch the app/game. Because "Snooze GPU" keeps them snoozed for 1 hour, and because I have <exclusive_app> entries in place, I can be sure that the computation will not resume my app/game is running
- When I'm done with my app/game, BOINC will resume crunching on its own (after the 1-hour timeout has expired from the "Snooze GPU" command).

This is VERY UNFORTUNATE that I have to do this tedious workaround any time I want to use my GPU.

Has GPUGrid.net made ANY PROGRESS into finding out the cause of these driver crashes when GPUGrid.net tasks are suspended or shutdown on Windows??


- Jacob

Stefan
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 5 Mar 13
Posts: 348
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 31862 - Posted: 7 Aug 2013 | 13:04:41 UTC - in response to Message 31849.

Unfortunately no :( There is really no time right now to focus on this. I understand it is quite a big problem and we are aware of it.

nanoprobe
Send message
Joined: 26 Feb 12
Posts: 184
Credit: 222,376,233
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 31863 - Posted: 7 Aug 2013 | 13:22:55 UTC - in response to Message 31849.

This is VERY UNFORTUNATE that I have to do this tedious workaround any time I want to use my GPU.

Has GPUGrid.net made ANY PROGRESS into finding out the cause of these driver crashes when GPUGrid.net tasks are suspended or shutdown on Windows??


It's a Windows problem, not a GPUGrid problem.
____________

Stefan
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 5 Mar 13
Posts: 348
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 31864 - Posted: 7 Aug 2013 | 13:25:37 UTC
Last modified: 7 Aug 2013 | 13:26:19 UTC

Hm, good to know.
But it is a Windows GPUGrid problem and not generally a problem all GPU Boinc projects have on Windows, right?

nanoprobe
Send message
Joined: 26 Feb 12
Posts: 184
Credit: 222,376,233
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 31872 - Posted: 7 Aug 2013 | 15:38:31 UTC - in response to Message 31864.

Hm, good to know.
But it is a Windows GPUGrid problem and not generally a problem all GPU Boinc projects have on Windows, right?

It can be a problem with any GPU project and many games.

http://msdn.microsoft.com/en-us/library/windows/hardware/ff553893%28v=vs.85%29.aspx

I've posted a fix that worked for me a couple of different places on this site but I'll post it here again for anyone to try. My fix goes a little farther than the Windows suggestion and it works for me. I can suspend and restart tasks, reboot the computer with tasks running, even do a hard shut down and restart. The tasks always restart from where they were with no errors or driver timeout/restarts. YMMV but this has worked well for me on ATI and Nvidia cards.

Copy and paste the entire code below (including the
Windows Registry Editor Version 5.00 part) into notepad. Rename it timeout fix.reg or something else if you'd like as long as it ends with the .reg extension.
After renaming it right click on it and open it with registry editor. You'll get warnings about editing the registry. Just click yes and the code will be added to your registry. Reboot and you should be good to go. This should stop the driver has stopped responding messages and the errors to the WUs when the driver restarts. It will not affect anything else in the registry if it doesn't work.


Windows Registry Editor Version 5.00

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Watchdog]
"DisableBugCheck"="1"

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Watchdog\Display]
"EaRecovery"="0"

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 31883 - Posted: 7 Aug 2013 | 18:50:39 UTC

It's a kind-of-a-windows-problem which gets triggered by Noelias tasks, certainly not the "god old" trouble free Nathans and I don't think the Santi SRs either, which I'm running now.

Since nanoprobes fix of completely turning off the driver watchdog and recovery cures these errors, they seem to be triggered by the GPU not responding to the driver for > 2s, the default watchdog timeout. SK once set the watchdog timeout to 20s and reported no further errors. I tried with 5s and got a frozen display for ~20 s, and then the driver reset.

So: is Noelia doing anything special upon ending / suspending WUs? Anything that takes >2s to complete?

MrS
____________
Scanning for our furry friends since Jan 2002

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 31920 - Posted: 9 Aug 2013 | 21:00:29 UTC

Interesting: TJ is reporting no driver reset problems with the current Noelias over there.
Can't test myself since I've currently got a healthy supply of POEMs.

MrS
____________
Scanning for our furry friends since Jan 2002

Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 14 Mar 07
Posts: 1957
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 31978 - Posted: 12 Aug 2013 | 22:36:56 UTC - in response to Message 31920.

We have turned down priority on Noelia tasks. You should get less and less until she gets back.

gdf

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32317 - Posted: 27 Aug 2013 | 12:45:36 UTC - in response to Message 31978.
Last modified: 27 Aug 2013 | 12:47:55 UTC

I greatly appreciate the stability my machine has had over the past couple weeks, due to not suspending any NOELIA tasks.

When she gets back, please consider investigating what causes the driver reset and watchdog timeouts, when NOELIA tasks are suspended or BOINC is exited while one is running. I believe some exit logic in the code is not returning quickly enough.

Thank you,
Jacob Klein

Stefan
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 5 Mar 13
Posts: 348
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 32327 - Posted: 27 Aug 2013 | 17:14:46 UTC - in response to Message 32317.

Did you try nanoprobe's suggested fix?

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32329 - Posted: 27 Aug 2013 | 17:35:52 UTC - in response to Message 32327.
Last modified: 27 Aug 2013 | 17:36:11 UTC

His suggested fix is to disable TDR, which I use for games and for other GPU applications. I rely on it. So, no, I didn't try it.

So far as I know, the bug is in the NOELIA tasks.

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32334 - Posted: 27 Aug 2013 | 20:13:55 UTC - in response to Message 32329.
Last modified: 27 Aug 2013 | 21:44:41 UTC

Even with a 20second registry configured delay, Noelia's WU's still trigger a driver restart when suspended, changing app, CPU/Boinc Snooze or closing Boinc.
Old news, hitherto not acted upon...
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

flashawk
Send message
Joined: 18 Jun 12
Posts: 297
Credit: 3,572,627,986
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 32340 - Posted: 28 Aug 2013 | 1:00:05 UTC

Strange, that's never happened to me and I've suspended them dozens of times.

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32408 - Posted: 28 Aug 2013 | 19:22:35 UTC - in response to Message 32329.

His suggested fix is to disable TDR, which I use for games and for other GPU applications. I rely on it. So, no, I didn't try it.

So far as I know, the bug is in the NOELIA tasks.

Same here: the watchdog saves me from real GPU errors often enough that I don't want to disable it.

MrS
____________
Scanning for our furry friends since Jan 2002

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32416 - Posted: 28 Aug 2013 | 20:32:19 UTC - in response to Message 32408.

I don't disable it either, I use a 20sec delay (but I don't game). I've had numerous experiences where the mouse arrow freezes for a few seconds and then everything is as was (without a driver restart and without WU's crashing). Prior to using it I had numerous crashy-the-driver experiences!
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32458 - Posted: 29 Aug 2013 | 14:04:07 UTC - in response to Message 32416.

The next beta will have additional critical section locking that will hopefully mitigate this problem.

MJH

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32463 - Posted: 29 Aug 2013 | 14:41:23 UTC - in response to Message 32458.

Thank you a million times over for setting aside some time to solve this.
I am ecstatic - cannot wait to test your change!

When it's ready, please let us know which version of the app to use, and which task types to look for to test against.

Thank you,
Jacob

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32474 - Posted: 29 Aug 2013 | 15:39:30 UTC - in response to Message 32463.

Try out 8.02. Give it a damn good suspending.

MJH

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32478 - Posted: 29 Aug 2013 | 17:04:35 UTC - in response to Message 32474.
Last modified: 29 Aug 2013 | 17:05:37 UTC

Try out 8.02. Give it a damn good suspending.

MJH


Awesome - Initial testing looks very promising! I cannot immediately make it crash. I will do more testing (especially with the exclusive app logic that suspends tasks) later tonight. Edit: I may have been able to make it still crash. Will test more later.

What did you change/fix? I'm a developer, and am very curious about what the change was. Also, is it a change that could improve exit-logic for non-NOELIA tasks?

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32480 - Posted: 29 Aug 2013 | 17:12:36 UTC - in response to Message 32478.


What did you change/fix? I'm a developer, and am very curious about what the change was.


The problem stems from BOINC killing off the process while a GPU operation is underway. The fix is to add
BOINC critical section assertions around GPU operations. In the old app, not all GPU operations were so locked.
http://boinc.berkeley.edu/trac/wiki/BasicApi

There may be other circumstances under which a driver hang can be induced, but this should substantially reduce the incidence rate.


Also, is it a change that could improve exit-logic for non-NOELIA tasks?


It'll be good for all WUs. Indeed, its not obvious why those poor NOELIAs always took the brunt of it.

MJH

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32494 - Posted: 29 Aug 2013 | 18:58:44 UTC

Hey MJH, glad to have you back! The project feels alive again.. thanks!

MrS
____________
Scanning for our furry friends since Jan 2002

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32573 - Posted: 31 Aug 2013 | 16:00:15 UTC - in response to Message 32474.

Try out 8.02. Give it a damn good suspending.

MJH


8.04 KLEBE tasks are still causing driver resets :(

My scenario is that I have 2 of them running - 1 on my GTX 460 and 1 on my GTX 660 Ti, and I'm choosing "Suspend GPU" from the system tray.

Can you please see if you need to add any more critical section mutexes?
Thanks,
Jacob

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32584 - Posted: 1 Sep 2013 | 9:49:42 UTC - in response to Message 32573.


8.04 KLEBE tasks are still causing driver resets :(


As frequently as before?

MJH

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32585 - Posted: 1 Sep 2013 | 11:59:06 UTC - in response to Message 32584.
Last modified: 1 Sep 2013 | 12:02:07 UTC

Frequency is quite hard to conclusively prove.
I'll admit, though that it feels like it is crashing less when suspending single tasks.
So, I think you're heading in the right direction, but have more work to do.

The main crashes I'm seeing now are when I choose "Snooze GPU" from the system tray; I still *sometimes* get a driver reset when it tries to suspend 2 running NOELIA tasks.

Are there any more critical sections that need specified?

Hype
Send message
Joined: 21 Nov 11
Posts: 10
Credit: 8,509,903
RAC: 0
Level
Ser
Scientific publications
wat
Message 33577 - Posted: 21 Oct 2013 | 19:25:02 UTC
Last modified: 21 Oct 2013 | 19:26:02 UTC

I've got 2 WU's which I couldn't start anymore because as soon as I resume them the nVidia Driver will crash.

http://www.gpugrid.net/workunit.php?wuid=4864368
http://www.gpugrid.net/workunit.php?wuid=4862898

This one also just crashed with "computation error":

http://www.gpugrid.net/workunit.php?wuid=4856494

Actually half of the WU's I tried to do crashed... :(

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33578 - Posted: 21 Oct 2013 | 19:29:25 UTC - in response to Message 33577.
Last modified: 21 Oct 2013 | 19:30:15 UTC

Hype,

This thread (titled: When suspended or exited, often crash drivers) discusses an issue that occurred when tasks were suspended or BOINC was exited normally, where the drivers would often crash. So far as we all know, recent GPUGrid application versions have actually fixed the issue in this thread.

I assume you have a different error. Could you please read this other thread, http://www.gpugrid.net/forum_thread.php?id=3491, as it describes a currently-open issue that might explain the behavior you are seeing. If your issue is still different, then please open a new thread.

Thanks,
Jacob

Profile robertmiles
Send message
Joined: 16 Apr 09
Posts: 503
Credit: 755,434,080
RAC: 186,180
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33596 - Posted: 24 Oct 2013 | 0:35:57 UTC

Unclear if these are crashing the driver; there is no message saying that it has.

Task http://www.gpugrid.net/result.php?resultid=7393220

Several of these errors, usually with the screen going black for about a second:

10/23/2013 6:21:36 PM | GPUGRID | Task trypsin_lig_12_2-NOELIA_RCrep_eq-0-1-RND5589_0 exited with zero status but no 'finished' file
10/23/2013 6:21:36 PM | GPUGRID | If this happens repeatedly you may need to reset the project.
10/23/2013 6:21:36 PM | GPUGRID | Restarting task trypsin_lig_12_2-NOELIA_RCrep_eq-0-1-RND5589_0 using acemdbeta version 814 (cuda42) in slot 1
10/23/2013 6:22:41 PM | GPUGRID | Task trypsin_lig_12_2-NOELIA_RCrep_eq-0-1-RND5589_0 exited with zero status but no 'finished' file
10/23/2013 6:22:41 PM | GPUGRID | If this happens repeatedly you may need to reset the project.
10/23/2013 6:22:41 PM | GPUGRID | Restarting task trypsin_lig_12_2-NOELIA_RCrep_eq-0-1-RND5589_0 using acemdbeta version 814 (cuda42) in slot 1

I reset the project as indicated. The task then disappeared from my computer.

Also, a number of the recent NOELIA workunits on this computer have given error -97:

http://www.gpugrid.net/result.php?resultid=7393428
http://www.gpugrid.net/result.php?resultid=7393415
http://www.gpugrid.net/result.php?resultid=7393189

SWAN : FATAL : Cuda driver error 999 in file 'swanlibnv2.cpp' in line 1963.
ACEMD beta version v8.14 (cuda42)


and one SANTI_MAR workunit:

http://www.gpugrid.net/result.php?resultid=7392716

Also error -97.
This one might be from overheating the GPU. No overclocking done.
Short runs (2-3 hours on fastest card) v8.14 (cuda42)

Profile robertmiles
Send message
Joined: 16 Apr 09
Posts: 503
Credit: 755,434,080
RAC: 186,180
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33598 - Posted: 24 Oct 2013 | 1:14:51 UTC

More of the same type of problem, on a different NOELIA task.

http://www.gpugrid.net/result.php?resultid=7393628

I've noticed that the problem occurs most often when I do something
that affects most of the screen, such as opening or closing a program
that uses most of the screen - I see the scrren going black for about
a second.

I aborted this task.

TJ
Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33601 - Posted: 24 Oct 2013 | 8:33:44 UTC - in response to Message 33598.

Could be a driver issue, try updating to the latest (beta) driver.
____________
Greetings from TJ

Profile robertmiles
Send message
Joined: 16 Apr 09
Posts: 503
Credit: 755,434,080
RAC: 186,180
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33655 - Posted: 28 Oct 2013 | 18:25:01 UTC - in response to Message 33596.

Unclear if these are crashing the driver; there is no message saying that it has.

Task http://www.gpugrid.net/result.php?resultid=7393220

Several of these errors, usually with the screen going black for about a second:

10/23/2013 6:21:36 PM | GPUGRID | Task trypsin_lig_12_2-NOELIA_RCrep_eq-0-1-RND5589_0 exited with zero status but no 'finished' file
10/23/2013 6:21:36 PM | GPUGRID | If this happens repeatedly you may need to reset the project.


I couldn't find the beta test drivers. However, when the 331.58 and 331.65 drivers came out, I installed them. No such crashes under either of these.

Profile robertmiles
Send message
Joined: 16 Apr 09
Posts: 503
Credit: 755,434,080
RAC: 186,180
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33684 - Posted: 30 Oct 2013 | 20:26:08 UTC - in response to Message 33655.

Unclear if these are crashing the driver; there is no message saying that it has.

Task http://www.gpugrid.net/result.php?resultid=7393220

Several of these errors, usually with the screen going black for about a second:

10/23/2013 6:21:36 PM | GPUGRID | Task trypsin_lig_12_2-NOELIA_RCrep_eq-0-1-RND5589_0 exited with zero status but no 'finished' file
10/23/2013 6:21:36 PM | GPUGRID | If this happens repeatedly you may need to reset the project.


I couldn't find the beta test drivers. However, when the 331.58 and 331.65 drivers came out, I installed them. No such crashes under either of these.


Correction - the 331.65 driver makes such crashes less frequent (perhaps every other day), but does not stop them completely.

Profile robertmiles
Send message
Joined: 16 Apr 09
Posts: 503
Credit: 755,434,080
RAC: 186,180
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33686 - Posted: 30 Oct 2013 | 21:53:46 UTC

Crashes less frequent with the 331.65 driver, but not gone. GPU workunits for other BOINC projects still running properly.

Another -97 error:
http://www.gpugrid.net/result.php?resultid=7410463

Another repeated Windows crashes until I aborted the workunit:
http://www.gpugrid.net/result.php?resultid=7410730

That computer now has GPUGRID on No new tasks. I plan to leave it there until I install some update that looks likely to fix this problem.

My other computer (with a slower GPU, Windows 7, and probably an older driver) still has GPUGRID enabled since it isn't showing showing this problem.

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33687 - Posted: 30 Oct 2013 | 21:56:48 UTC - in response to Message 33686.

some update that looks likely to fix this problem.


??? explain please ::grabs popcorn::

Profile robertmiles
Send message
Joined: 16 Apr 09
Posts: 503
Credit: 755,434,080
RAC: 186,180
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33785 - Posted: 5 Nov 2013 | 20:15:34 UTC - in response to Message 33687.

some update that looks likely to fix this problem.


??? explain please ::grabs popcorn::


I'm waiting for an Nvidia driver release, a BOINC release, or a GPUGRID application release before I enable GPUGRID workunits on that computer again.

Profile robertmiles
Send message
Joined: 16 Apr 09
Posts: 503
Credit: 755,434,080
RAC: 186,180
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33977 - Posted: 22 Nov 2013 | 1:57:41 UTC - in response to Message 33785.
Last modified: 22 Nov 2013 | 1:59:58 UTC

some update that looks likely to fix this problem.


??? explain please ::grabs popcorn::


I'm waiting for an Nvidia driver release, a BOINC release, or a GPUGRID application release before I enable GPUGRID workunits on that computer again.


I've now installed BOINC 7.2.28 on the computer with the problem. I tried installing the 331.82 Nvidia driver a few times; it never installed correctly. I'm back to the 331.65 driver.

GPUGRID workunits have run properly on that computer for the last few days, with no more driver crashes.

Profile robertmiles
Send message
Joined: 16 Apr 09
Posts: 503
Credit: 755,434,080
RAC: 186,180
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33993 - Posted: 23 Nov 2013 | 19:27:51 UTC - in response to Message 33977.

some update that looks likely to fix this problem.


??? explain please ::grabs popcorn::


I'm waiting for an Nvidia driver release, a BOINC release, or a GPUGRID application release before I enable GPUGRID workunits on that computer again.


I've now installed BOINC 7.2.28 on the computer with the problem. I tried installing the 331.82 Nvidia driver a few times; it never installed correctly. I'm back to the 331.65 driver.

GPUGRID workunits have run properly on that computer for the last few days, with no more driver crashes.


This wasn't enough to fully fix the problem; however, the driver crashes no longer crash Windows also. I'll watch to see if the new crashes cause enough problems that I need to put GPUGRID back in No new tasks on that computer.

Dagorath
Send message
Joined: 16 Mar 11
Posts: 509
Credit: 179,005,236
RAC: 0
Level
Ile
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34002 - Posted: 24 Nov 2013 | 0:56:55 UTC

I've been trying to crash NOELIA tasks on my 660T1 by repeatedly (20 X) suspending and resuming them but I can't. This seems like a Windows bug to me. Suggest installing Linux to fix it or a script that peeks at the names of the GPUgrid tasks in your cache every 60 secs and aborts them if they're NOELIA. Either way would be a path of lesser resistance leading to greater productivity and oneness with the Buddha ;)

____________
BOINC <<--- credit whores, pedants, alien hunters

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34003 - Posted: 24 Nov 2013 | 1:12:40 UTC - in response to Message 34002.

The 8.14 application version fixed the problem that this thread described. It has been resolved for a while now..

Dagorath
Send message
Joined: 16 Mar 11
Posts: 509
Credit: 179,005,236
RAC: 0
Level
Ile
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34004 - Posted: 24 Nov 2013 | 1:24:50 UTC - in response to Message 34003.

But I couldn't crash them with the previous version either.

____________
BOINC <<--- credit whores, pedants, alien hunters

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34013 - Posted: 24 Nov 2013 | 10:38:12 UTC - in response to Message 34004.

It was a Windows only fix.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Post to thread

Message boards : Number crunching : NOELIA tasks - when suspended or exited, often crash drivers

//