Advanced search

Message boards : Graphics cards (GPUs) : Getting Errors recently on one card

Author Message
Profile caffeineyellow5
Avatar
Send message
Joined: 30 Jul 14
Posts: 225
Credit: 2,658,976,345
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwat
Message 42182 - Posted: 17 Nov 2015 | 20:23:19 UTC

I have had these cards installed and untouched since November 5 without problems I can't trace to heat. Since I got the heat issue settled on the 2 cards in the machine, the third card that runs significantly cooler is having errors and not just lock up heat type related errors. I even after two errored out results on the same card in such a short period of time turned the clock down to negative numbers thinking there may be an overclock issue from the card defaults, but even with negative clock numbers the errors have happened for 2 more days. Any ideas. Like I said, it worked fine for almost 9 days brand new and then this started on one of the 3 cards and continued. I can't even say they were all one type of WU because 5 were NOELIA and one was SDOERR. The system is not even anywhere it can be touched or bumped and it is sitting on a concrete floor on rubber feet, so shaking would not be a factor either. I really want to avoid reseating unless the error is something that is solved by that exclusively. It is the lowest card on the board and to reseat that one I have to unplug them all since the release is blocked by each additional one toward the top and it is a very heavy system.

The system is this one https://www.gpugrid.net/show_host_detail.php?hostid=263632

The error work units from this system are these https://www.gpugrid.net/results.php?hostid=263632&show_names=1&state=5

And the error tasks I am referring to are these:
https://www.gpugrid.net/result.php?resultid=14687930
https://www.gpugrid.net/result.php?resultid=14686553
https://www.gpugrid.net/result.php?resultid=14684984
https://www.gpugrid.net/result.php?resultid=14684937
https://www.gpugrid.net/result.php?resultid=14682165
https://www.gpugrid.net/result.php?resultid=14691871

As you can see the common error on these is usually as follows:
ERROR: file force.cpp line 513: TCL evaluation of [calcforces]
04:37:23 (8132): called boinc_finish
____________
1 Corinthians 9:16 "For though I preach the gospel, I have nothing to glory of: for necessity is laid upon me; yea, woe is unto me, if I preach not the gospel!"
Ephesians 6:18-20, please ;-)
http://tbc-pa.org

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2078
Credit: 15,138,245,390
RAC: 4,472,159
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 42184 - Posted: 18 Nov 2015 | 0:26:17 UTC - in response to Message 42182.

Well, I don't know the reason of those errors, but your perfectly working other card is too hot.
If it stays this hot (above 80°C) for a long term it won't be working this perfectly too long.

Profile caffeineyellow5
Avatar
Send message
Joined: 30 Jul 14
Posts: 225
Credit: 2,658,976,345
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwat
Message 42186 - Posted: 18 Nov 2015 | 8:43:01 UTC - in response to Message 42184.

The other 2 do get warmer during the day then cool at night. Soon in the winter months it will be below 80 day and night. I should bring them down to a lower top level. I have that set now at 87 and I was thinking of 83. I just switched out 3 980 for 3 980 TI Classified. They run the same temp as the other cards, but these are rated at a higher max running temp. Those were 87 max and 83 recommended or below. These new ones are about 4 degrees higher rated, but I was not sure if those recommended numbers should always be the same no matter what card is in there. These do have the better fans and the different architecture. I will bring the temps max down some. Those 980 cards are now in other of my machines and ran at an 87 max on the MSI Afterburner for over a year with seemingly no problems in the winter months and with A/C cooler air in the summer. It is the fall and spring that get to it really. lol No air on and no cold air from outside.

I suppose I will turn it off and reseat all the cards just to troubleshoot.

Post to thread

Message boards : Graphics cards (GPUs) : Getting Errors recently on one card