Author |
Message |
|
Hi guys, the title says it all, all my Cuda apps are crashing while computing.
I checked all my temps and i benchmarked my graphic card to see if i had a temperatures problem. No problem with my 9800 GX2 (running 80°C at 100 %usage), all is ok but boinc cuda computation.
I'm running Seven 64 bits and 196.21 Gpu Drivers under 6.10.18 Boinc Manager.
If someone could give me some help, thank you by advance (sorry for my poor english).
Apps which crash :
- Seti Cuda 100% crashing (no wu finished)
- Einstein Cuda 100% crashing (crash at the moment a second Gpu is active)
- GpuGrid 85% crashing Some wu 100% completed, not able to compute on both gpu, if i do so i get compute errors on all wus. |
|
|
liveonc Send message
Joined: 1 Jan 10 Posts: 292 Credit: 41,567,650 RAC: 0 Level
Scientific publications
|
Did you enable SLI+ I don't own either a an 9800 GX2 or GTX295, but I tried running my 2 260GTX in SLI once before I read that it doesn't support that. I got at that time 3 errors out of every 4 WU.
____________
|
|
|
|
I disabled the Sli Switch and i'll give it a try for the next 48 Hours, thx for reply and help. |
|
|
JeremySend message
Joined: 15 Feb 09 Posts: 55 Credit: 3,542,733 RAC: 0 Level
Scientific publications
|
Your CPU overclock is a bit aggressive. Does it pass Linpack/Intel Burn Test stress testing? I had issues with my overclock that only showed in GPU apps for some reason. Double check your system stability if the SLi switch doesn't do anything for you. You need to be able to run FurMark and Intel Burn Test at the same time for 10-15 minutes without errors in either. Errors may not show in games or other BOINC apps, but it'll show in the CUDA apps.
I had to throw a little more voltage at my CPU to get the whole thing 100% stable.
____________
C2Q, GTX 660ti |
|
|
|
Thank you Jeremy, my system is not as much overclocked than displayed (3.6Ghz), and all is tested "rockstable" since a lot of time. I did CPU burn tests, Gpu Burn tests, memory tests. All seems to be ok OCed or not, same computation errors with non overclocked system.
I just don't understand what happen, never had this problem before switching from vista 64 to seven 64. Maybe Gpu drivers related, i just don't know. |
|
|
skgivenVolunteer moderator Volunteer tester
Send message
Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level
Scientific publications
|
Perhaps one of your settings has changed, or needs to change?
Check that you have disabled "Use GPU When Computer is in Use", and make sure you use "Leave Applications in Memory While Suspended".
- Boinc Manager, Advanced View, Advanced Preferences, Processor Usage, and then Disk and Memory Usage.
Most of your failures are with Beta tasks, and many are after a short time, so it looks worse than it is.
Are you running any other GPU tasks? If so, stick to one at a time (Especially for Beta tests)! You might even want to disable Betas in your online preferences.
One last thing to try. Set Boinc to use 75% of the CPU's rather than 100% (only applicable if you are performing CPU tasks as well as GPUGrid tasks).
Good Luck, |
|
|
Zydor Send message
Joined: 8 Feb 09 Posts: 252 Credit: 1,309,451 RAC: 0 Level
Scientific publications
|
I noticed you have 196.21 loaded. NVidia acknowledged a bug in 196.21 within days of release, that prevented overclocking - it certainly stopped my 9800GTX in its tracks, although it did not seem to affect everyone, it was (is) widespread. It maybe that some sideffect of that is causing you issues, pure guess of course, but given the nature of the bug affecting o/c, its not impossible.
They released 196.34 days afterwards, a Beta release, but the only change being to fix the o/c bug 196.21 WQHL driver. Maybe worth a shot.
Regards
Zy |
|
|
JeremySend message
Joined: 15 Feb 09 Posts: 55 Credit: 3,542,733 RAC: 0 Level
Scientific publications
|
Just to clarify, that bug only affects software overclocks. If the overclock is burned into the BIOS the card won't be affected by that bug.
____________
C2Q, GTX 660ti |
|
|
liveonc Send message
Joined: 1 Jan 10 Posts: 292 Credit: 41,567,650 RAC: 0 Level
Scientific publications
|
I must agree that 3.6Ghz is a bit agressive on a Q6600. Especially for 24/7 use. I only have mine up to 3.0Ghz. I've OC'd Q6600 for 24/7 up to 3.2Ghz, but unless UR liquid, I don't see how it's going to be a good idea to OC so much & the 9800 GX2. I got errors & lots of them, when I OC'd a 8800GT to more than Core Clock 700MHz (vs. 600MHz standard) Shader Clock 1728MHz (vs. 1500MHz standard) & 2000MHz (vs. 1800MHz standard).
____________
|
|
|
|
Thank You all for answers.
Liveonc, my GPU isn't OCed, and my CPU is very well watercooled with 1st choice watercooling parts. With OCCt test that makes your CPU really burn (lingo test i think), my CPU T° stays, for the hotest core, at 57°C, wich is very well under such burning test.
I really don't think that's it's CPU related, but i made a new machine for test purpose, i investigate where is the problem.
At this moment, i can say that my doubts about graphic drivers issue under Seven 64 seem to be the problem. I reinstall a ghost of vista and no more errors with this OS. |
|
|
|
I found the answer to my question about such compute errors.
Lesson carefully owners of double gpu video cards.
The problem seems to be drivers related. GPU compute apps and boinc detect two gpu but seem to give same gpu two jobs, this makes crash one of the two jobs.
To prevent such crash, go to nvidia panel and deactive Sli switch, then, Boinc still recognize two GPUs and driver will give each job on each gpu chip instead of two jobs on same gpu. |
|
|
|
I have always :
11/03/2010 16:00:16 GPUGRID Output file g106-TONI_CAPBIND99SB-16-100-RND9786_0_1 for task g106-TONI_CAPBIND99SB-16-100-RND9786_0 absent
11/03/2010 16:00:16 GPUGRID Output file g106-TONI_CAPBIND99SB-16-100-RND9786_0_2 for task g106-TONI_CAPBIND99SB-16-100-RND9786_0 absent
11/03/2010 16:00:16 GPUGRID Output file g106-TONI_CAPBIND99SB-16-100-RND9786_0_3 for task g106-TONI_CAPBIND99SB-16-100-RND9786_0 absent
|
|
|
|
I have the same problem here, but starting with collartz wu. I have two GPU (GTX260, 8800GT and so no SLI) and I get this error for all GPU-tasks from every project. It seems that BOINC coudn´t find a CUDA device. This happened without any interaction with the computer (user was miles away :-)). Is it BOINC- or driver related?
A reboot will solve the problem!
BOINC 6.10.36 NVIDIA 196.21 WIN7 64 Bit |
|
|
DataC Send message
Joined: 15 Feb 10 Posts: 9 Credit: 16,891,220 RAC: 0 Level
Scientific publications
|
Hi. I was told to post here, but ive seen many other people in this
forum who have had similar problems and aparently never got it resolved...
Every work unit that downloads fails in 10 seconds or less. None of my
other cuda apps crash or fail, and I have tried just about every version
of my video card driver that I can find--notta.
If anyone has any ideas, PLEASE let me know. Here are my system specs:
OS : Windows 7/Windows Server 2008 R2 Version 6.1 Build 7600
Number of processors : 2
Processor type : x86 Intel Pentium, Level 6, Revision 5898
CPU speed: 2898 MHz
Total physical memory: 4124988 KB
Available physical memory: 3037680 KB
Total virtual memory: 2097024 KB
Available virtual memory: 1940944 KB
Number of CUDA devices : 1
Device 0 : GeForce 9800 GT |
|
|
GDFVolunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message
Joined: 14 Mar 07 Posts: 1957 Credit: 629,356 RAC: 0 Level
Scientific publications
|
Try to install the 196.75 driver.
gdf |
|
|
MarkJ Volunteer moderator Volunteer tester Send message
Joined: 24 Dec 08 Posts: 738 Credit: 200,909,904 RAC: 0 Level
Scientific publications
|
I had a couple die on me this morning.
# There are 2 devices supporting CUDA
# Device 0: "GeForce GTX 295"
# Clock rate: 1.24 GHz
# Total amount of global memory: 939524096 bytes
# Number of multiprocessors: 30
# Number of cores: 240
# Device 1: "GeForce GTX 295"
# Clock rate: 1.24 GHz
# Total amount of global memory: 939524096 bytes
# Number of multiprocessors: 30
# Number of cores: 240
MDIO ERROR: cannot open file "restart.coor"
SWAN : FATAL : Cuda driver error 3e7 in file '../swan/swanlib_nv.cpp' in line 186.
GDF have you guys tried the 197.13 drivers? Is there any point in updating to them? Currently I am running 196.21.
Now the NDA with nvidia has expired is it possible to use the cuda 3.0 DLL's in the faint hope they will fix something?
____________
BOINC blog |
|
|