Message boards :
Graphics cards (GPUs) :
suddenly too many errors
Message board moderation
| Author | Message |
|---|---|
|
Send message Joined: 13 Feb 11 Posts: 25 Credit: 7,516,466,698 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I have had recently too many errors with my 24/7 rig containing triple GTX 970: https://www.gpugrid.net/results.php?userid=73475 Links: https://www.gpugrid.net/result.php?resultid=14642662 https://www.gpugrid.net/result.php?resultid=14642713 On my 2nd rig 24/7 with double GTX 980Ti goes sometimes like: https://www.gpugrid.net/result.php?resultid=14641885 ANY HELP NEEDED as my 1st rig is switched off at the moment due to these troubles. PS: No changes made to the hardware so far. Previously no problem, suddenly this. |
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I have had recently too many errors with my 24/7 rig containing triple GTX 970:Excerpt from this task's stderr.txt: # GPU 1 : 78C # GPU 0 : 89C # GPU 1 : 80C # GPU 0 : 91C # GPU 0 : 93C # GPU 0 : 95C # GPU 1 : 82CThese temperatures are too high. You'll fry your cards. But the error message at the end of the output file says: SWAN : FATAL Unable to load module .mshake_kernel.cu. (999)It usually happens when you stop a task too early after it's started. https://www.gpugrid.net/result.php?resultid=14642713Another excerpt from this task's stderr.txt: # GPU 2 : 63C # GPU 0 : 93C # GPU 1 : 83C # GPU 0 : 94C # GPU 1 : 84C # GPU 0 : 95C # GPU 1 : 85C95°C is way too high! I suspect that these cards have non standard cooling with axial fans, and emit the heat inside the case, heating each other. You should use only one such card in this computer, or at least replace one of the card to have at least one slot space between the two cards for proper airflow, and install some fans which remove the hot air from the case. On my 2nd rig 24/7 with double GTX 980Ti goes sometimes like:You will (if not already have) damage your cards permanently if you run them above 80°C. Every 10°C rise in temperatures halve the lifetime of the card, but above 80°C every 5°C rise does the same. Above 90°C there's a high chance of an immediate fatal failure of the GPU chip. |
|
Send message Joined: 13 Feb 11 Posts: 25 Credit: 7,516,466,698 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
So, I guess that I have to put the temperatures down and it should be fine. I just don´t understand why this happened after say 8 months of standard working at the same conditions. BTW, this one is fine with temperatures, but was also an error: https://www.gpugrid.net/result.php?resultid=14644966 or https://www.gpugrid.net/result.php?resultid=14641885 Any clues here? Or this just sometimes happens? |
|
Send message Joined: 13 Feb 11 Posts: 25 Credit: 7,516,466,698 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
or this one, just recently: https://www.gpugrid.net/result.php?resultid=14645743 |
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
or this one, just recently:This is an overly overclocked card, perhaps you should reduce the memory clock to 3505MHz, and if it didn't help then the GPU clock by 20MHz decrements, until it gets stable. |
|
Send message Joined: 5 Jan 09 Posts: 670 Credit: 2,498,095,550 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
or this one, just recently: Bare with me on this one Viktor Install your graphics driver again but over the top of the last one. I had a situation like this a few years ago with dual cards and a new driver always needed to be installed twice. It could be what Retvari said about OC. Hey, its worth a shot. Don't forget to suspend any running GPUGrid WU's |
|
Send message Joined: 13 Feb 11 Posts: 25 Credit: 7,516,466,698 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
O.K. I´ll try. It is so annoying. As I said before, with the same OC no problem for months and now this. BTW, it´s factory overclocked. |
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
BTW, it´s factory overclocked.Everyone, who use factory overclocked cards (including me) should remember: The factory made these cards to play games on them 4-5 hours per day, not to crunch on them in 24 hours of 7 days of week. Nothing severe happens, when there's a glitch in a frame while you are playing, but when this glitch occur while crunching a workunit, this will result in an error, and you'll lose the actual workunit, and the time and the electricity. If this happens too often, then the time lost to the failed workunits could easily exceed the time gained by the faster processing, making the overclocking counter-productive. So in the terms of overclocking: less is more. |
|
Send message Joined: 27 Mar 09 Posts: 1 Credit: 103,615,743 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
this might Sound trivial but if it worked for months and now the temps are higher, you might try clean dust that gathered on the cards/coolers and check the temps after that |
|
Send message Joined: 5 Dec 12 Posts: 84 Credit: 1,663,883,415 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I second FZB on this. I too recently had a work unit become unstable and fail on me after months of quiet. Just cleaned out my two 970s tonight after at least six months left alone in my case. Used a 'Datavac electric duster' instead of an air can. In high dust environments like mine, the cans just don't last too long. Disgusting amounts of dust flew everywhere. I'm getting much cooler temps! |
|
Send message Joined: 13 Feb 11 Posts: 25 Credit: 7,516,466,698 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I buy OC cards not because of the OC, but because of the quality build. The temperatures were high because of the 3way sli setup (no room between), I already made changes. Thanks heaps all of You for help. |
©2025 Universitat Pompeu Fabra