Advanced search

Message boards : Graphics cards (GPUs) : suddenly too many errors

Author Message
Viktor Svantner
Send message
Joined: 13 Feb 11
Posts: 24
Credit: 5,999,975,589
RAC: 2,022,157
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 42037 - Posted: 27 Oct 2015 | 21:26:41 UTC
Last modified: 27 Oct 2015 | 21:28:48 UTC

I have had recently too many errors with my 24/7 rig containing triple GTX 970:
https://www.gpugrid.net/results.php?userid=73475

Links:
https://www.gpugrid.net/result.php?resultid=14642662
https://www.gpugrid.net/result.php?resultid=14642713

On my 2nd rig 24/7 with double GTX 980Ti goes sometimes like:
https://www.gpugrid.net/result.php?resultid=14641885

ANY HELP NEEDED as my 1st rig is switched off at the moment due to these troubles.

PS: No changes made to the hardware so far. Previously no problem, suddenly this.
____________

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2057
Credit: 15,019,107,069
RAC: 6,047,791
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 42038 - Posted: 27 Oct 2015 | 23:43:24 UTC - in response to Message 42037.
Last modified: 27 Oct 2015 | 23:44:05 UTC

I have had recently too many errors with my 24/7 rig containing triple GTX 970:
https://www.gpugrid.net/results.php?userid=73475

Links:
https://www.gpugrid.net/result.php?resultid=14642662
Excerpt from this task's stderr.txt:
# GPU 1 : 78C # GPU 0 : 89C # GPU 1 : 80C # GPU 0 : 91C # GPU 0 : 93C # GPU 0 : 95C # GPU 1 : 82C
These temperatures are too high. You'll fry your cards.
But the error message at the end of the output file says:
SWAN : FATAL Unable to load module .mshake_kernel.cu. (999)
It usually happens when you stop a task too early after it's started.

https://www.gpugrid.net/result.php?resultid=14642713
Another excerpt from this task's stderr.txt:
# GPU 2 : 63C # GPU 0 : 93C # GPU 1 : 83C # GPU 0 : 94C # GPU 1 : 84C # GPU 0 : 95C # GPU 1 : 85C
95°C is way too high!
I suspect that these cards have non standard cooling with axial fans, and emit the heat inside the case, heating each other.
You should use only one such card in this computer, or at least replace one of the card to have at least one slot space between the two cards for proper airflow, and install some fans which remove the hot air from the case.

On my 2nd rig 24/7 with double GTX 980Ti goes sometimes like:
https://www.gpugrid.net/result.php?resultid=14641885Perhaps you should lower its GPU clock a little to increase it's stability.

ANY HELP NEEDED as my 1st rig is switched off at the moment due to these troubles.

PS: No changes made to the hardware so far. Previously no problem, suddenly this.
You will (if not already have) damage your cards permanently if you run them above 80°C.
Every 10°C rise in temperatures halve the lifetime of the card, but above 80°C every 5°C rise does the same.
Above 90°C there's a high chance of an immediate fatal failure of the GPU chip.

Viktor Svantner
Send message
Joined: 13 Feb 11
Posts: 24
Credit: 5,999,975,589
RAC: 2,022,157
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 42040 - Posted: 28 Oct 2015 | 6:18:18 UTC - in response to Message 42038.
Last modified: 28 Oct 2015 | 6:46:26 UTC

So, I guess that I have to put the temperatures down and it should be fine. I just don´t understand why this happened after say 8 months of standard working at the same conditions.

BTW, this one is fine with temperatures, but was also an error:
https://www.gpugrid.net/result.php?resultid=14644966
or
https://www.gpugrid.net/result.php?resultid=14641885

Any clues here? Or this just sometimes happens?

Viktor Svantner
Send message
Joined: 13 Feb 11
Posts: 24
Credit: 5,999,975,589
RAC: 2,022,157
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 42042 - Posted: 28 Oct 2015 | 11:01:39 UTC - in response to Message 42038.

or this one, just recently:

https://www.gpugrid.net/result.php?resultid=14645743

____________

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2057
Credit: 15,019,107,069
RAC: 6,047,791
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 42043 - Posted: 28 Oct 2015 | 11:18:10 UTC - in response to Message 42042.
Last modified: 28 Oct 2015 | 11:19:07 UTC

or this one, just recently:

https://www.gpugrid.net/result.php?resultid=14645743
This is an overly overclocked card, perhaps you should reduce the memory clock to 3505MHz, and if it didn't help then the GPU clock by 20MHz decrements, until it gets stable.

Betting Slip
Send message
Joined: 5 Jan 09
Posts: 669
Credit: 2,498,095,550
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 42044 - Posted: 28 Oct 2015 | 11:20:26 UTC - in response to Message 42042.
Last modified: 28 Oct 2015 | 11:26:40 UTC

or this one, just recently:

https://www.gpugrid.net/result.php?resultid=14645743


Bare with me on this one Viktor

Install your graphics driver again but over the top of the last one. I had a situation like this a few years ago with dual cards and a new driver always needed to be installed twice.

It could be what Retvari said about OC.

Hey, its worth a shot. Don't forget to suspend any running GPUGrid WU's

Viktor Svantner
Send message
Joined: 13 Feb 11
Posts: 24
Credit: 5,999,975,589
RAC: 2,022,157
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 42046 - Posted: 28 Oct 2015 | 14:19:39 UTC - in response to Message 42044.
Last modified: 28 Oct 2015 | 14:20:01 UTC

O.K. I´ll try. It is so annoying. As I said before, with the same OC no problem for months and now this. BTW, it´s factory overclocked.
____________

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2057
Credit: 15,019,107,069
RAC: 6,047,791
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 42047 - Posted: 28 Oct 2015 | 16:46:07 UTC - in response to Message 42046.

BTW, it´s factory overclocked.
Everyone, who use factory overclocked cards (including me) should remember:
The factory made these cards to play games on them 4-5 hours per day, not to crunch on them in 24 hours of 7 days of week.
Nothing severe happens, when there's a glitch in a frame while you are playing, but when this glitch occur while crunching a workunit, this will result in an error, and you'll lose the actual workunit, and the time and the electricity. If this happens too often, then the time lost to the failed workunits could easily exceed the time gained by the faster processing, making the overclocking counter-productive.
So in the terms of overclocking: less is more.

fzb
Send message
Joined: 27 Mar 09
Posts: 1
Credit: 54,924,244
RAC: 2,511
Level
Thr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 42050 - Posted: 28 Oct 2015 | 18:32:41 UTC

this might Sound trivial but if it worked for months and now the temps are higher, you might try clean dust that gathered on the cards/coolers and check the temps after that

Dayle Diamond
Send message
Joined: 5 Dec 12
Posts: 77
Credit: 1,462,592,439
RAC: 139,624
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 42055 - Posted: 29 Oct 2015 | 5:52:24 UTC - in response to Message 42050.

I second FZB on this.

I too recently had a work unit become unstable and fail on me after months of quiet.

Just cleaned out my two 970s tonight after at least six months left alone in my case. Used a 'Datavac electric duster' instead of an air can. In high dust environments like mine, the cans just don't last too long.

Disgusting amounts of dust flew everywhere.

I'm getting much cooler temps!

Viktor Svantner
Send message
Joined: 13 Feb 11
Posts: 24
Credit: 5,999,975,589
RAC: 2,022,157
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 42056 - Posted: 29 Oct 2015 | 6:03:01 UTC - in response to Message 42047.

I buy OC cards not because of the OC, but because of the quality build. The temperatures were high because of the 3way sli setup (no room between), I already made changes.

Thanks heaps all of You for help.

Post to thread

Message boards : Graphics cards (GPUs) : suddenly too many errors