Advanced search

Message boards : Graphics cards (GPUs) : 2x GPU - One GPU Errors Frequently

Author Message
Redirect Left
Send message
Joined: 8 Dec 12
Posts: 23
Credit: 178,296,781
RAC: 24,884
Level
Ile
Scientific publications
watwatwatwatwatwat
Message 50395 - Posted: 5 Sep 2018 | 0:59:46 UTC
Last modified: 5 Sep 2018 | 1:18:18 UTC

Hi there.

I am currently running two GPUs, a GTX 760, and a GTX 670. Neither are amazing for number crunching, but they're decent with games and my PC is on 24/7 - so spare time is donated to varying BOINC projects, so things taking time doesn't matter.

Anywho, tasks on the GTX 670 have a habit of erroring, with an unknown reason. An example of one of these tasks is here; http://www.gpugrid.net/result.php?resultid=18676941 - the error is related to the simulation becoming unstable.

It isn't a PSU / power issue, the draw on the PSU is less than 65% of its output on the rails related to graphics, i've tried to re-install drivers, didn't fixc anything. Both GPUs are 2GB GDDR5 VRAM editions.

Is it possible the GTX 670 isn't properly supported by GPUGrid anymore? If this is the case, is it possible to exclude the 670 from GPUGrid, but continue using the 760?

- Cheers!

tullio
Send message
Joined: 8 May 18
Posts: 157
Credit: 40,517,545
RAC: 6,256
Level
Val
Scientific publications
wat
Message 50396 - Posted: 5 Sep 2018 | 8:02:27 UTC - in response to Message 50395.

I have a GTX 750 Ti on a Linux box, and a GTX 1050 Ti on a Windows 10 PC, none overclocked. On the Linux the GPU temperature reaches at most 63 C, on the Windows PC 80 C and then it crashes with an error message similar to yours.
Tullio

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 329
Credit: 251,529,463
RAC: 544,060
Level
Asn
Scientific publications
wat
Message 50410 - Posted: 5 Sep 2018 | 20:34:32 UTC

First thing I would do is to increase the fan speed on the 670 to 100% and see if the errors reduce.

The other thing would be to run another BOINC client and exclude the 670 gpu in the cc_config.xml file.

<ignore_nvidia_dev>N</ignore_nvidia_dev>
Ignore (don't use) a specific NVIDIA GPU. You can ignore more than one. Replaces <ignore_cuda_dev/>. Requires a client restart.
Example: <ignore_nvidia_dev>0</ignore_nvidia_dev> will ignore the first NVIDIA GPU in the system.

JoergF
Avatar
Send message
Joined: 20 Apr 15
Posts: 283
Credit: 1,102,216,607
RAC: 14,656
Level
Met
Scientific publications
watwatwat
Message 50411 - Posted: 5 Sep 2018 | 21:02:19 UTC

In addition to the below messages and suggestions. What if you let MSI Afterburner limit the GPU temperature to 70°C max … provided that the temperature of this card shows up in there? If not, reduce both the GPU and memory clock manually by maybe 50MHz and see where the temperature gets.
____________
I would love to see HCF1 protein folding and interaction simulations to help my little boy... someday.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2076
Credit: 15,115,994,483
RAC: 4,972,493
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 50515 - Posted: 14 Sep 2018 | 21:24:37 UTC - in response to Message 50395.

... tasks on the GTX 670 have a habit of erroring, with an unknown reason. An example of one of these tasks is here; http://www.gpugrid.net/result.php?resultid=18676941 - the error is related to the simulation becoming unstable.
This is the typical error message for too high GPU clocks at the given GPU temperature.
You should reduce the GPU clock speed of your GTX 670 (or its power target in MSI Afterburner).
Judging by the stderr.txt of your tasks, your other GPU goes up to 93°C, which is dangerously high. This surely reduces the lifetime of your card.
You should increase the airflow of that card. If the two cards are next to each other, I strongly recommend you to physically remove the older card.

Post to thread

Message boards : Graphics cards (GPUs) : 2x GPU - One GPU Errors Frequently