Advanced search

Message boards : Graphics cards (GPUs) : All WUs on GTX660 failing

Author Message
dyeman
Send message
Joined: 21 Mar 09
Posts: 35
Credit: 591,434,551
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 31427 - Posted: 12 Jul 2013 | 11:40:36 UTC

Have just built a new system with a GTX660 with the intention of running GPUGRID (encouraged by another system with a GTX660 that is working well). So far all 4 WUs that I have got start processing but fail after between 2 and 10 minutes. I think they have all had the "Driver has recovered after stopped working" message in Windows which I guess is an indication that the GPU hardware has failed. Both the old and new cards are factory overclocked (old one 1006 (1072 boost), new one 1033 (1098 boost). Memory on both is 1502.3 (which I think is not overclocked). GPU Core clocks reported by GPU-Z when running are 1162.7 (new) and 1123.5 (old) respectively. The new 660 is fine running PrimeGrid (two concurrent - pegged at 99% busy according to GPU-Z). Environmentals on both systems seem fine (temp about 60 degrees).

I tried reducing the clocks on the new card to the same level as the old one (1006 (1072 boost)). Still failed (but did run for nearly 10 mins - longer than the others).

So does this look like I just need to keep backing off the factory overclock, or is there anything else to look at? Here are the two systems:

old (working) system: 145220
new (failing) system: 155065

Thanks..

MarkJ
Volunteer moderator
Volunteer tester
Send message
Joined: 24 Dec 08
Posts: 738
Credit: 200,909,904
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 31429 - Posted: 12 Jul 2013 | 12:47:49 UTC - in response to Message 31427.

Have just built a new system with a GTX660 with the intention of running GPUGRID (encouraged by another system with a GTX660 that is working well). So far all 4 WUs that I have got start processing but fail after between 2 and 10 minutes. I think they have all had the "Driver has recovered after stopped working" message in Windows which I guess is an indication that the GPU hardware has failed. Both the old and new cards are factory overclocked (old one 1006 (1072 boost), new one 1033 (1098 boost). Memory on both is 1502.3 (which I think is not overclocked). GPU Core clocks reported by GPU-Z when running are 1162.7 (new) and 1123.5 (old) respectively. The new 660 is fine running PrimeGrid (two concurrent - pegged at 99% busy according to GPU-Z). Environmentals on both systems seem fine (temp about 60 degrees).

I tried reducing the clocks on the new card to the same level as the old one (1006 (1072 boost)). Still failed (but did run for nearly 10 mins - longer than the others).

So does this look like I just need to keep backing off the factory overclock, or is there anything else to look at? Here are the two systems:

old (working) system: 145220
new (failing) system: 155065

Thanks..


I'd suggest drivers. I notice the failing system has an ATI as well as Nvidia card. From memory you had to install ATI driver first followed by Nvidia. Also get drivers from Nvidia.com or GeForce.com and do clean install, don't rely on windows to install drivers.

____________
BOINC blog

dyeman
Send message
Joined: 21 Mar 09
Posts: 35
Credit: 591,434,551
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 31433 - Posted: 12 Jul 2013 | 14:35:44 UTC

Thanks Mark,
All my systems have both ATI and NVIDIA cards and this hasn't caused me issues in the past (at least not like this!) so far, and all of the drivers are explicitly downloaded and installed. One difference is that the working system has 310.90 while the failing system has 314.07. I might try the older driver on the new system and see if that helps.

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 31439 - Posted: 12 Jul 2013 | 18:17:40 UTC - in response to Message 31429.
Last modified: 12 Jul 2013 | 18:18:46 UTC

I'd suggest drivers. I notice the failing system has an ATI as well as Nvidia card. From memory you had to install ATI driver first followed by Nvidia. Also get drivers from Nvidia.com or GeForce.com and do clean install, don't rely on windows to install drivers.

Eight of my systems have both ATI/AMD and NVidia. It makes no difference which driver is installed first. What does make a difference on at least some AMD chipset MBs is that the ATI/AMD GPU should be in PCIe slot 0 (master) and the NVidia GPU(s) in other slots .

dyeman
Send message
Joined: 21 Mar 09
Posts: 35
Credit: 591,434,551
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 31453 - Posted: 13 Jul 2013 | 1:49:39 UTC - in response to Message 31439.

The system is Intel chipset.

Just tried with card set to reference clocks (980/1033). No help - WU crashed in less than 2 minutes.

Will try to install different drivers and see whether that helps...

nanoprobe
Send message
Joined: 26 Feb 12
Posts: 184
Credit: 222,376,233
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 31454 - Posted: 13 Jul 2013 | 3:39:03 UTC

Driver resets are a known issue in windows that few people know about. There is a simple regedit that should cure the driver timeouts. I have used it on all of my ATI cards and it has worked perfectly. I don't see any reason why it shouldn't work on Nvidia cards too. I posted it once before but I don't know if anyone here tried it. I can post it again if you'd like to try it. The regedit will not do anything else to your system if it doesn't work.

dyeman
Send message
Joined: 21 Mar 09
Posts: 35
Credit: 591,434,551
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 31457 - Posted: 13 Jul 2013 | 9:40:49 UTC - in response to Message 31454.

Well it's looking like hardware. I swapped the two cards and the original (that works fine in the old system) is working fine in the new system, while the new card (that was failing in the new system) is also failing in the old system (errored the GPUGRID WU that was in process within a minute or two).

...Now to try to explain this to try to organise a replacement...

Thanks to all for your help.

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 31463 - Posted: 13 Jul 2013 | 11:56:14 UTC - in response to Message 31454.

Driver resets are a known issue in windows that few people know about. There is a simple regedit that should cure the driver timeouts. I have used it on all of my ATI cards and it has worked perfectly. I don't see any reason why it shouldn't work on Nvidia cards too. I posted it once before but I don't know if anyone here tried it. I can post it again if you'd like to try it. The regedit will not do anything else to your system if it doesn't work.

Sure, post it, and thanks in advance.

TheFiend
Send message
Joined: 26 Aug 11
Posts: 100
Credit: 2,569,652,477
RAC: 2,368,022
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 31465 - Posted: 13 Jul 2013 | 13:06:16 UTC - in response to Message 31457.

Well it's looking like hardware. I swapped the two cards and the original (that works fine in the old system) is working fine in the new system, while the new card (that was failing in the new system) is also failing in the old system (errored the GPUGRID WU that was in process within a minute or two).

...Now to try to explain this to try to organise a replacement...

Thanks to all for your help.


I had a similar problem with a factory overclocked GTX660Ti - it was failing on both my systems. I managed to get Amazon to swap it for a standard clocked card and the failures stopped.

dyeman
Send message
Joined: 21 Mar 09
Posts: 35
Credit: 591,434,551
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 31494 - Posted: 14 Jul 2013 | 5:39:57 UTC - in response to Message 31465.

Thanks - I'll certainly try to negotiate something like that (this card is even failing when I set it back to reference clocks!). The reason I got the card was not because of its overclock, but because it has good cooling

Interestingly the card has actually started working now, but I think this is related to the type of WU: all of the failing WUs were NOELIA_klebe which generate TDP % well into the 90's according to GPU-Z). wThe WU that is working OK (so far) is a NOELIA-1MG which GPU-Z reports as 98% GPU busy but TDP % is only in the mid-60's. Also looks like this will take twice as long to run as the _klebe_'s - no way to get the 24 hour bonus here! For comparison, TDP % is around mid-70's running a pair of PRIMEGRID PPS Sieves concurrently.

@nanoprobe - I'll try the timer delay reqistry setting - thanks. Only some of the failures have been accompanied by the "Stopped Responding" message, however - the rest have just quietly stopped.

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 31499 - Posted: 14 Jul 2013 | 10:43:04 UTC - in response to Message 31494.

98% GPU usage on W7 sounds high and 60% TDP sounds low.
What's the memory controller load?
1% per chance!
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 31509 - Posted: 14 Jul 2013 | 14:28:38 UTC

Try to run some fairly recent 3D Mark - if this is not stable you've certainly got a problem. And one which is easy to reproduce for the RMA guys.

MrS
____________
Scanning for our furry friends since Jan 2002

matlock
Send message
Joined: 12 Dec 11
Posts: 34
Credit: 86,423,547
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 31521 - Posted: 14 Jul 2013 | 19:02:14 UTC

Is this a machine you are dedicating to GPUGRID? You may have a hardware issue, but seeing as Windows 7 is the least efficient OS for the project and you are having potential issues with the software, I would encourage you to try Linux. This would help narrow down if it is a software or hardware issue, and if it succeeds, you will have a very efficient dedicated machine for GPUGRID.

dyeman
Send message
Joined: 21 Mar 09
Posts: 35
Credit: 591,434,551
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 31531 - Posted: 15 Jul 2013 | 1:16:40 UTC - in response to Message 31499.
Last modified: 15 Jul 2013 | 1:20:11 UTC

Yes the memory load was 1% (also on the WU running now on the "good" 660 - it's been running for almost 24 hours and is saying it is 59% complete). Looks like you have seen or heard of this before?!

The WU eventually failed after mor than 1000,000 seconds with:

<core_client_version>7.0.64</core_client_version>
<![CDATA[
<message>
The system cannot find the path specified.
(0x3) - exit code 3 (0x3)
</message>
<stderr_txt>
MDIO: cannot open file "output.restart.coor"
SWAN : FATAL : Cuda driver error 999 in file 'swanlibnv2.cpp' in line 1574.
Assertion failed: a, file swanlibnv2.cpp, line 59

This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.

</stderr_txt>
]]>

flashawk
Send message
Joined: 18 Jun 12
Posts: 297
Credit: 3,572,627,986
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 31532 - Posted: 15 Jul 2013 | 3:09:43 UTC - in response to Message 31531.

The WU eventually failed after mor than 1000,000 seconds


That sucks, what a waste of time and power. You ran that work unit for 11.5 days? Or maybe you misplaced a decimal point, that really sucks.

I think it's 111,587.05, still that's 31 hours.

dyeman
Send message
Joined: 21 Mar 09
Posts: 35
Credit: 591,434,551
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 31533 - Posted: 15 Jul 2013 | 3:24:35 UTC - in response to Message 31532.

Whoops my mistake - should have been 100,000 secs. Sorry..

dyeman
Send message
Joined: 21 Mar 09
Posts: 35
Credit: 591,434,551
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 31534 - Posted: 15 Jul 2013 | 5:05:40 UTC - in response to Message 31509.

Hi MrS,
I ran the test from the free version of 3DMark 11. It ran fine with results in the midrange of those with same processor and graphics card. Also ran FurMark. It also seemed to run fine. It actually ran with TDP approaching 110%, and you could see that the card had reduced both core clock and voltage. Temps stabilised at 68 degrees with factory fan settings. From memory this would appear to be stressing the card more that GPUGrid was doing when the WUs failed.

dyeman
Send message
Joined: 21 Mar 09
Posts: 35
Credit: 591,434,551
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 31538 - Posted: 15 Jul 2013 | 10:06:10 UTC - in response to Message 31499.

Just read the thread "Current Noelia WUs" and that explains some of what I am seeing (log-running Noelia_MGs) but of course my 660s are 2GB. Does explain the fact that the Noelia_MGs took twice as ling to complete on my (1 GB) 560 ti.

I tried uninstalling the drivers (again) and reinstalled (310.90). Fetched another WU - got a Noelia_klebe. It actually ran for 15 minutes before crashing (maybe a bit of FurMark helped burn it in :-) ). Environmentals looked OK (certainly much less that FurMark) - GPU 93%, TDP 90%, Mem Controller 34%, Temp 65 degrees.

Then I turned off long tasks and got a short WU. It also crashed, just shy of 10 minutes 145220

I'm just about out of ideas here..



Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 31541 - Posted: 15 Jul 2013 | 14:17:31 UTC - in response to Message 31538.
Last modified: 15 Jul 2013 | 14:18:44 UTC

I find the 314 driver to be more reliable than the 310 driver on a GTX660 (W7), and I'm not keen on the 320.x drivers and nor would you be if you read the release notes.

BTW. That 145220 WU failed on another system and was not resent, so far.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

nanoprobe
Send message
Joined: 26 Feb 12
Posts: 184
Credit: 222,376,233
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 31542 - Posted: 15 Jul 2013 | 15:30:23 UTC - in response to Message 31463.

Driver resets are a known issue in windows that few people know about. There is a simple regedit that should cure the driver timeouts. I have used it on all of my ATI cards and it has worked perfectly. I don't see any reason why it shouldn't work on Nvidia cards too. I posted it once before but I don't know if anyone here tried it. I can post it again if you'd like to try it. The regedit will not do anything else to your system if it doesn't work.

Sure, post it, and thanks in advance.


Been away for a few days and I'm at work now. I'll post it when I get home this evening.

dyeman
Send message
Joined: 21 Mar 09
Posts: 35
Credit: 591,434,551
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 31552 - Posted: 16 Jul 2013 | 2:49:26 UTC - in response to Message 31541.

I'm trying another tack here - wonder if there is some bad memory on the card? After the short WU failed, I underclocked the memory a bit. The next short WU worked OK! OK there were a couple after that did not, but they have failed on other machines also. I've lowered the memory clock a bit more and will see how it goes. I've found a few memory testing tools which I will try also - if there's a memory problem, hopefully I can find a way to reliably show it that will allow the seller to verify the problem.

I notice that there are two BIOS versions for this card - one for cards with Hynix memory, the other for cards with Samsung memory. My card appears to be Samsung.

dyeman
Send message
Joined: 21 Mar 09
Posts: 35
Credit: 591,434,551
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 31569 - Posted: 17 Jul 2013 | 0:57:51 UTC - in response to Message 31552.

Ran some memory tests (memtestcl and memtestG80) with no errors (but probably need to run for a day or two to do a thorough check). Then set the card to reference GPU core clock (980/1045), and memory underclocked from 6008 to 5494. It has processed a NOELIA klebe successfully, and is now nearly 2 hours into a NOELIA 7MG (that is working properly with 38% memory controller load).

I also installed the 320.49 driver but this doesn't appear to have made any difference one way or the other (WUs were still crashing before I underclocked the memory).

Profile dskagcommunity
Avatar
Send message
Joined: 28 Apr 11
Posts: 460
Credit: 842,161,339
RAC: 1,630,920
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 31587 - Posted: 17 Jul 2013 | 13:18:00 UTC
Last modified: 17 Jul 2013 | 13:19:23 UTC

Hm thats normal here. If wus begin to fail overvolt by 0.025V and underclock the mem by 100mhz was the recommend solution here in the forum. And it really works for me since a longer time now.
____________
DSKAG Austria Research Team: http://www.research.dskag.at



nanoprobe
Send message
Joined: 26 Feb 12
Posts: 184
Credit: 222,376,233
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 31602 - Posted: 17 Jul 2013 | 22:36:44 UTC - in response to Message 31463.

Driver resets are a known issue in windows that few people know about. There is a simple regedit that should cure the driver timeouts. I have used it on all of my ATI cards and it has worked perfectly. I don't see any reason why it shouldn't work on Nvidia cards too. I posted it once before but I don't know if anyone here tried it. I can post it again if you'd like to try it. The regedit will not do anything else to your system if it doesn't work.

Sure, post it, and thanks in advance.

Sorry for the delay. Copy and paste the entire code below (including the
Windows Registry Editor Version 5.00 part) into notepad. Rename it fix.reg or something else if you'd like as long as it ends with the .reg extension.
After renaming it right click on it and open it with registry editor. You'll get warnings about editing the registry. Just click yes and the code will be added to your registry. Reboot and you should be good to go. This should stop the driver has stopped responding messages and the errors to the WUs when the driver restarts.


Windows Registry Editor Version 5.00

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Watchdog]
"DisableBugCheck"="1"

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Watchdog\Display]
"EaRecovery"="0"

TJ
Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 31606 - Posted: 17 Jul 2013 | 23:20:11 UTC - in response to Message 31531.

I run 320.18 and 320.49 with the GTX660's and the errors are Noelia's or a short run from Santi. Almost none Nathan's error out.
Some WU's have a MCU load from 1% and took 50 hours on my rig, but without error. I guess its luck, if the Noelia's finish or error. I don't make all the fuss with the drivers as other projects WU's seem to do well with them.
____________
Greetings from TJ

Post to thread

Message boards : Graphics cards (GPUs) : All WUs on GTX660 failing

//