Message boards :
Graphics cards (GPUs) :
All WUs on GTX660 failing
Message board moderation
| Author | Message |
|---|---|
|
Send message Joined: 21 Mar 09 Posts: 35 Credit: 591,434,551 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Have just built a new system with a GTX660 with the intention of running GPUGRID (encouraged by another system with a GTX660 that is working well). So far all 4 WUs that I have got start processing but fail after between 2 and 10 minutes. I think they have all had the "Driver has recovered after stopped working" message in Windows which I guess is an indication that the GPU hardware has failed. Both the old and new cards are factory overclocked (old one 1006 (1072 boost), new one 1033 (1098 boost). Memory on both is 1502.3 (which I think is not overclocked). GPU Core clocks reported by GPU-Z when running are 1162.7 (new) and 1123.5 (old) respectively. The new 660 is fine running PrimeGrid (two concurrent - pegged at 99% busy according to GPU-Z). Environmentals on both systems seem fine (temp about 60 degrees). I tried reducing the clocks on the new card to the same level as the old one (1006 (1072 boost)). Still failed (but did run for nearly 10 mins - longer than the others). So does this look like I just need to keep backing off the factory overclock, or is there anything else to look at? Here are the two systems: old (working) system: 145220 new (failing) system: 155065 Thanks.. |
|
Send message Joined: 24 Dec 08 Posts: 738 Credit: 200,909,904 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Have just built a new system with a GTX660 with the intention of running GPUGRID (encouraged by another system with a GTX660 that is working well). So far all 4 WUs that I have got start processing but fail after between 2 and 10 minutes. I think they have all had the "Driver has recovered after stopped working" message in Windows which I guess is an indication that the GPU hardware has failed. Both the old and new cards are factory overclocked (old one 1006 (1072 boost), new one 1033 (1098 boost). Memory on both is 1502.3 (which I think is not overclocked). GPU Core clocks reported by GPU-Z when running are 1162.7 (new) and 1123.5 (old) respectively. The new 660 is fine running PrimeGrid (two concurrent - pegged at 99% busy according to GPU-Z). Environmentals on both systems seem fine (temp about 60 degrees). I'd suggest drivers. I notice the failing system has an ATI as well as Nvidia card. From memory you had to install ATI driver first followed by Nvidia. Also get drivers from Nvidia.com or GeForce.com and do clean install, don't rely on windows to install drivers. BOINC blog |
|
Send message Joined: 21 Mar 09 Posts: 35 Credit: 591,434,551 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Thanks Mark, All my systems have both ATI and NVIDIA cards and this hasn't caused me issues in the past (at least not like this!) so far, and all of the drivers are explicitly downloaded and installed. One difference is that the working system has 310.90 while the failing system has 314.07. I might try the older driver on the new system and see if that helps. |
BeyondSend message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I'd suggest drivers. I notice the failing system has an ATI as well as Nvidia card. From memory you had to install ATI driver first followed by Nvidia. Also get drivers from Nvidia.com or GeForce.com and do clean install, don't rely on windows to install drivers. Eight of my systems have both ATI/AMD and NVidia. It makes no difference which driver is installed first. What does make a difference on at least some AMD chipset MBs is that the ATI/AMD GPU should be in PCIe slot 0 (master) and the NVidia GPU(s) in other slots . |
|
Send message Joined: 21 Mar 09 Posts: 35 Credit: 591,434,551 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
The system is Intel chipset. Just tried with card set to reference clocks (980/1033). No help - WU crashed in less than 2 minutes. Will try to install different drivers and see whether that helps... |
|
Send message Joined: 26 Feb 12 Posts: 184 Credit: 222,376,233 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Driver resets are a known issue in windows that few people know about. There is a simple regedit that should cure the driver timeouts. I have used it on all of my ATI cards and it has worked perfectly. I don't see any reason why it shouldn't work on Nvidia cards too. I posted it once before but I don't know if anyone here tried it. I can post it again if you'd like to try it. The regedit will not do anything else to your system if it doesn't work. |
|
Send message Joined: 21 Mar 09 Posts: 35 Credit: 591,434,551 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Well it's looking like hardware. I swapped the two cards and the original (that works fine in the old system) is working fine in the new system, while the new card (that was failing in the new system) is also failing in the old system (errored the GPUGRID WU that was in process within a minute or two). ...Now to try to explain this to try to organise a replacement... Thanks to all for your help. |
BeyondSend message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Driver resets are a known issue in windows that few people know about. There is a simple regedit that should cure the driver timeouts. I have used it on all of my ATI cards and it has worked perfectly. I don't see any reason why it shouldn't work on Nvidia cards too. I posted it once before but I don't know if anyone here tried it. I can post it again if you'd like to try it. The regedit will not do anything else to your system if it doesn't work. Sure, post it, and thanks in advance. |
|
Send message Joined: 26 Aug 11 Posts: 100 Credit: 2,863,609,686 RAC: 292 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Well it's looking like hardware. I swapped the two cards and the original (that works fine in the old system) is working fine in the new system, while the new card (that was failing in the new system) is also failing in the old system (errored the GPUGRID WU that was in process within a minute or two). I had a similar problem with a factory overclocked GTX660Ti - it was failing on both my systems. I managed to get Amazon to swap it for a standard clocked card and the failures stopped. |
|
Send message Joined: 21 Mar 09 Posts: 35 Credit: 591,434,551 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Thanks - I'll certainly try to negotiate something like that (this card is even failing when I set it back to reference clocks!). The reason I got the card was not because of its overclock, but because it has good cooling Interestingly the card has actually started working now, but I think this is related to the type of WU: all of the failing WUs were NOELIA_klebe which generate TDP % well into the 90's according to GPU-Z). wThe WU that is working OK (so far) is a NOELIA-1MG which GPU-Z reports as 98% GPU busy but TDP % is only in the mid-60's. Also looks like this will take twice as long to run as the _klebe_'s - no way to get the 24 hour bonus here! For comparison, TDP % is around mid-70's running a pair of PRIMEGRID PPS Sieves concurrently. @nanoprobe - I'll try the timer delay reqistry setting - thanks. Only some of the failures have been accompanied by the "Stopped Responding" message, however - the rest have just quietly stopped. |
skgivenSend message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
98% GPU usage on W7 sounds high and 60% TDP sounds low. What's the memory controller load? 1% per chance! FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help |
|
Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Try to run some fairly recent 3D Mark - if this is not stable you've certainly got a problem. And one which is easy to reproduce for the RMA guys. MrS Scanning for our furry friends since Jan 2002 |
|
Send message Joined: 12 Dec 11 Posts: 34 Credit: 86,423,547 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Is this a machine you are dedicating to GPUGRID? You may have a hardware issue, but seeing as Windows 7 is the least efficient OS for the project and you are having potential issues with the software, I would encourage you to try Linux. This would help narrow down if it is a software or hardware issue, and if it succeeds, you will have a very efficient dedicated machine for GPUGRID. |
|
Send message Joined: 21 Mar 09 Posts: 35 Credit: 591,434,551 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Yes the memory load was 1% (also on the WU running now on the "good" 660 - it's been running for almost 24 hours and is saying it is 59% complete). Looks like you have seen or heard of this before?! The WU eventually failed after mor than 1000,000 seconds with: <core_client_version>7.0.64</core_client_version> <![CDATA[ <message> The system cannot find the path specified. (0x3) - exit code 3 (0x3) </message> <stderr_txt> MDIO: cannot open file "output.restart.coor" SWAN : FATAL : Cuda driver error 999 in file 'swanlibnv2.cpp' in line 1574. Assertion failed: a, file swanlibnv2.cpp, line 59 This application has requested the Runtime to terminate it in an unusual way. Please contact the application's support team for more information. </stderr_txt> ]]> |
|
Send message Joined: 18 Jun 12 Posts: 297 Credit: 3,572,627,986 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
The WU eventually failed after mor than 1000,000 seconds That sucks, what a waste of time and power. You ran that work unit for 11.5 days? Or maybe you misplaced a decimal point, that really sucks. I think it's 111,587.05, still that's 31 hours. |
|
Send message Joined: 21 Mar 09 Posts: 35 Credit: 591,434,551 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Whoops my mistake - should have been 100,000 secs. Sorry.. |
|
Send message Joined: 21 Mar 09 Posts: 35 Credit: 591,434,551 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Hi MrS, I ran the test from the free version of 3DMark 11. It ran fine with results in the midrange of those with same processor and graphics card. Also ran FurMark. It also seemed to run fine. It actually ran with TDP approaching 110%, and you could see that the card had reduced both core clock and voltage. Temps stabilised at 68 degrees with factory fan settings. From memory this would appear to be stressing the card more that GPUGrid was doing when the WUs failed. |
|
Send message Joined: 21 Mar 09 Posts: 35 Credit: 591,434,551 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Just read the thread "Current Noelia WUs" and that explains some of what I am seeing (log-running Noelia_MGs) but of course my 660s are 2GB. Does explain the fact that the Noelia_MGs took twice as ling to complete on my (1 GB) 560 ti. I tried uninstalling the drivers (again) and reinstalled (310.90). Fetched another WU - got a Noelia_klebe. It actually ran for 15 minutes before crashing (maybe a bit of FurMark helped burn it in :-) ). Environmentals looked OK (certainly much less that FurMark) - GPU 93%, TDP 90%, Mem Controller 34%, Temp 65 degrees. Then I turned off long tasks and got a short WU. It also crashed, just shy of 10 minutes 145220 I'm just about out of ideas here.. |
skgivenSend message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I find the 314 driver to be more reliable than the 310 driver on a GTX660 (W7), and I'm not keen on the 320.x drivers and nor would you be if you read the release notes. BTW. That 145220 WU failed on another system and was not resent, so far. FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help |
|
Send message Joined: 26 Feb 12 Posts: 184 Credit: 222,376,233 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Driver resets are a known issue in windows that few people know about. There is a simple regedit that should cure the driver timeouts. I have used it on all of my ATI cards and it has worked perfectly. I don't see any reason why it shouldn't work on Nvidia cards too. I posted it once before but I don't know if anyone here tried it. I can post it again if you'd like to try it. The regedit will not do anything else to your system if it doesn't work. Been away for a few days and I'm at work now. I'll post it when I get home this evening. |
©2025 Universitat Pompeu Fabra