Message boards :
Number crunching :
SANTI WU Killed My GPU
Message board moderation
| Author | Message |
|---|---|
|
Send message Joined: 21 Feb 09 Posts: 497 Credit: 700,690,702 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Win7 Home, ASUS GTX 660, GPUGrid 24/7: Overnight my PC died. Restarted. When BOINC came up I got three black screens, a beep and a STOP 116. Restarted in safe mode w/ networking. Found the STOP was about the GPU. Unchecked the two BOINC items in msconfig and did a normal boot. All was well. Started boincmgr manually. It died in the same way. Rebooted. Found about 15 files relating to the current WU and deleted them. Started boincmgr. The current WU was still there but I just had time to do a suspend. Immediately the WU errored and I got a new WU. All is now working properly. Here is the offending WU. What the heck happened?? |
|
Send message Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I don't know what happened, but I have seen that many many times at my 660 and Santi SR's. So I do only LR's now on the 660. However today I had also a Fatal cuda driver error. But in my case it only resulted in the GPU-clock to down clock. I did a reboot and it is okay again. It seems that in your case the error was not properly handled and BOINC came in a sort of loop. Luckily you where able to find the offending files and deleted them. Greetings from TJ |
|
Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Here is the offending WU. What the heck happened?? I see you are still on the 327.23 drivers. Why not try 331.65? I think they implement a later version of CUDA, and work fine on my GTX 660s on the Longs, including many Santis. |
|
Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
|
|
Send message Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Sounds familiar to this problem, doesn't it? I don't think so, as there was no power outage in my case, I can not speak for Tomba. And even when one of my PSU burnt down and the main fuse went off, and all PC's where abruptly shut down, the GPUGRID WU did start nicely without problems after I reboot the systems. So I think this is caused by something in the Santi WU's as I have seen it only with these. And I watch my systems closely. Greetings from TJ |
|
Send message Joined: 9 Jul 10 Posts: 1 Credit: 296,167,092 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
In the last 24 hours I have had a very similar issue where the nVidia kernel keeps crashing when doing GPU units. Today I've removed the nVidia drivers and cleaned by system of any profile or configs left from them. Then reloaded the latest 331.65 drivers for my dual 670 GTX in SLI mode and suspended all GPU jobs. Then one at a time I re enabled them (only running GPU for S@H and GPUGRID), the S@H ones have been for now for over an hour. I then disabled S@H and tried it with GPUGRID, and within seconds the screen locks and then reports the kernel crash from nVidia. Thankfully I could re-suspend the GPU and gain control again. I suspended the active GPU unit and allowed another one to start working and it has locked up since. I've now aborted the offending unit. Like the title of this thread it was a SANTI that was causing the issue, more precisely this one : I223-SANTI_baxbim2-21-32-RND7738_0 http://www.gpugrid.net/workunit.php?wuid=4919966 The now interesting thing is I have another SANTI one working just fine : I104-SANTI_baxbim2-23-32-RND3377_0, but we'll wait and see over the next 24 hours. Storm |
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Storm wrote: Today I've removed the nVidia drivers and cleaned by system of any profile or configs left from them. Then reloaded the latest 331.65 drivers for my dual 670 GTX in SLI mode and suspended all GPU jobs. FYI: SLI is not recommended while crunching on the GPUs. |
MJHSend message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level ![]() Scientific publications ![]()
|
New Beta app coming later today that should help with this. MJH |
|
Send message Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Yesterday another Santi LR resulted in this error: SWAN : FATAL : Cuda driver error 715 in file 'swanlibnv2.cpp' in line 1969. And as a result the GPU clock was downclocked so I needed to reboot again. No power failure or power outing, power cuts. It is something with Santi's WU and a GTX660, as I have not yet seen this error on my GTX770, running since August. Greetings from TJ |
|
Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
It is something with Santi's WU and a GTX660, as I have not yet seen this error on my GTX770, running since August. Maybe I have just been lucky, but ever since implementing my under-clocking and over-volting fix http://www.gpugrid.net/forum_thread.php?id=3466&nowrap=true#33677, I have not had an error, or even an instance of slow-running. The GTX 660s may be susceptible to problems if they are factory overclocked more for example, or don't have as large heatsinks as the others relative to their heat output. But it appears that they can be fixed, and there is nothing inherently wrong with the chip itself for running the current work units. By the way, I now suspect that the slow-running is triggered by hitting against the GPU power limit, which probably causes the self-protective circuity in the chip to reduce its clock rate. Then, it never resumes the higher rate until you reboot it. Increasing the power limit helps avoid this problem, as long as you monitor the resulting temperature (the limit is there for a reason). |
|
Send message Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
My 660 is not factory overclocked and I have set the clock speed little lower than what it would be without intervention. Temperature is okay I think as it runs at 63°C evenly. And also my 660 has run 21 days continuously without any error, so indeed as you say Jim the 660 is okay for this project. Greetings from TJ |
|
Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Yes, you look good. |
|
Send message Joined: 14 Oct 11 Posts: 31 Credit: 81,420,504 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Just chipping in with a "me too" post: GTX 660, 311.06 drivers, 112x-SANTI_MAR419cap-0-8-RND0309 Driver crashed about 4 times, followed by spontaneous reboot. It did that twice before I was able to locate and terminate this WU through the BOINC GUI. (GPU was underclocked during the 2nd attempt.) |
ColeslawSend message Joined: 24 Jul 08 Posts: 36 Credit: 363,857,679 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I have a box that has two GT 430's that are having this problem. Q8200 CPU 4GB Ram Win 7 Pro x64 bit Other GPU projects run fine. I also have a BETA from GPUGrid running on it OK too. It is when I resume the SANTI work units that I get a BSOD after drivers crash a few times. This happened with Drivers Ver. 331.82 And of course I kill the BETA by forgetting to exit BOINC when I'm running some updates.
|
©2025 Universitat Pompeu Fabra