Advanced search

Message boards : Number crunching : Failures on Tesla k80

Author Message
TribbleRED
Send message
Joined: 30 Aug 19
Posts: 5
Credit: 223,403,448
RAC: 493,511
Level
Leu
Scientific publications
wat
Message 56154 - Posted: 27 Dec 2020 | 10:12:18 UTC
Last modified: 27 Dec 2020 | 10:13:02 UTC

I can't seem to make heads or tails of these failures. Any help would be appreciated.

GPUgrid is a new project for an already installed and running/contributing k80

Driver ver: 452.39 - Datacenter Driver for Windows
Release Date: 2020.9.30

WU runs 200~700 seconds and then fails

Exit code 195 - unknown error

http://www.gpugrid.net/result.php?resultid=32303507
http://www.gpugrid.net/result.php?resultid=32227640
http://www.gpugrid.net/result.php?resultid=32226165

Thanks in advance

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 459
Credit: 2,126,379,742
RAC: 1,088,278
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56156 - Posted: 27 Dec 2020 | 15:55:25 UTC

All your three failed tasks have been resent to other hosts and finished successfully, so we may discard that they were defective.
The true errors for failed tasks were:

#32303507

ACEMD failed:
Particle coordinate is nan

Where "Particle coordinate is nan" is the acronym for "Particle coordinate is not a number"

#32227640
ACEMD failed:
Error invoking kernel: CUDA_ERROR_LAUNCH_FAILED (719)

#32226165
ACEMD failed:
Error invoking kernel: CUDA_ERROR_LAUNCH_FAILED (719)

These errors may indicate that a too aggressive user or factory overclocking is being used on GPUs.
GPUGrid tasks are very demanding, and GPUs succeeding for other projects may fail at GPUGrid for this reason.
Try to reduce overclocking if it is at your hand, and test whether it helps.

On the other hand, you have also processed with success task #32226173, so perhaps your setting is only a bit beyond the optimal one.

Also power requirements must be taken into account at your host #572833.
Two NVIDIA Tesla K80 will demand 600 Watts (300 Watts each one) at full performance.
Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz will add 145 more watts.
Adding power for motherboard and peripherals and a bit safety margin, a minimum 1000 Watts (1200 better) PSU should be used for this host.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 812
Credit: 1,084,289,831
RAC: 1,517,561
Level
Met
Scientific publications
watwatwatwatwat
Message 56157 - Posted: 27 Dec 2020 | 17:37:19 UTC

Since the Tesla K80 is a fanless design meant for server chassis airflow, my first question is do you have enough assisted forced cooling for the card.

Your errors suggest the card is overheating.

TribbleRED
Send message
Joined: 30 Aug 19
Posts: 5
Credit: 223,403,448
RAC: 493,511
Level
Leu
Scientific publications
wat
Message 56158 - Posted: 27 Dec 2020 | 19:33:21 UTC - in response to Message 56156.

All your three failed tasks have been resent to other hosts and finished successfully, so we may discard that they were defective.
The true errors for failed tasks were:

#32303507
ACEMD failed:
Particle coordinate is nan

Where "Particle coordinate is nan" is the acronym for "Particle coordinate is not a number"

#32227640
ACEMD failed:
Error invoking kernel: CUDA_ERROR_LAUNCH_FAILED (719)

#32226165
ACEMD failed:
Error invoking kernel: CUDA_ERROR_LAUNCH_FAILED (719)

These errors may indicate that a too aggressive user or factory overclocking is being used on GPUs.


It has the same overclock as all the others... maybe its just the bin for this particular card. I'll look into it further.

The Tesla k80 has 2x GPUs on one PCB and will run as if it has two GPUs and report as if there are two GPUs when it is just one card. Chassis has 2x 1250watt PSUs
GPUGrid tasks are very demanding, and GPUs succeeding for other projects may fail at GPUGrid for this reason.
Try to reduce overclocking if it is at your hand, and test whether it helps.

On the other hand, you have also processed with success task #32226173, so perhaps your setting is only a bit beyond the optimal one.

Also power requirements must be taken into account at your host #572833.
Two NVIDIA Tesla K80 will demand 600 Watts (300 Watts each one) at full performance.
Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz will add 145 more watts.
Adding power for motherboard and peripherals and a bit safety margin, a minimum 1000 Watts (1200 better) PSU should be used for this host.

[/quote]

It has the same overclock as all the others... maybe its just the bin for this particular card. I'll look into it further.

The Tesla k80 has 2x GPUs on one PCB and will run as if it has two GPUs and report as if there are two GPUs when it is just one card. Chassis has 2x 1250watt PSUs

TribbleRED
Send message
Joined: 30 Aug 19
Posts: 5
Credit: 223,403,448
RAC: 493,511
Level
Leu
Scientific publications
wat
Message 56159 - Posted: 27 Dec 2020 | 19:35:30 UTC - in response to Message 56157.

Since the Tesla K80 is a fanless design meant for server chassis airflow, my first question is do you have enough assisted forced cooling for the card.

Your errors suggest the card is overheating.



GPU1 sits below 57c while GPU2 hasn't risen above 66c so this and the other passively cooled devices I run in this rack are not overheating. Plenty of forced airflow.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 459
Credit: 2,126,379,742
RAC: 1,088,278
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56160 - Posted: 27 Dec 2020 | 19:56:32 UTC - in response to Message 56159.

GPU1 sits below 57c while GPU2 hasn't risen above 66c so this and the other passively cooled devices I run in this rack are not overheating. Plenty of forced airflow.

Thank you for your feedback.
And taking a closer look to this system, being reported as "56 processors", I deduce that it is based on twin Xeon CPUs, since each one consists of 14 cores - 28 threads.
Great host.

jjch
Send message
Joined: 10 Nov 13
Posts: 59
Credit: 14,607,247,215
RAC: 3,167,703
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56161 - Posted: 28 Dec 2020 | 4:05:07 UTC - in response to Message 56160.
Last modified: 28 Dec 2020 | 4:05:54 UTC

It's not abnormal for GPUgrid computations to fail periodically. I have seen 20 failures in the last 4 days but my total error rate over all my systems is only about 3%.

Recently I had several Toni tasks fail and completely stall. They ran 2-3 days before I noticed it and had to abort them. Those seem to have cleared up now.

What you need to watch out for is tasks repeatedly failing one after another. This could more likely be a driver or OS problem. If you notice that scenario try rebooting and see if it clears up.

Since the driver version 452.39 is the latest for a Tesla on Windows 10, I would suggest a full deinstall using DDU and reinstall. Download the Nvidia driver directly from the Nvidia website. Don't trust Windows to install it.

If you do not already have a copy of DDU you can find the download here: https://www.guru3d.com/files-details/display-driver-uninstaller-download.html

I would not recommend manually overclocking these. The best way to get the most performance is keep them cool and let them run to their maximum boost clock.

I would recommend using GPU-Z to monitor both the GPU's on each card. Check for any abnormal behavior. Spiking temps, GPU workload, power fluctuations etc. GPU-z can be found here: https://www.techpowerup.com/gpuz/

A few other thoughts: Make sure you have the latest Windows updates. Check for any other software applications that may be running and interfering. Make sure you have enough disk space and memory. Including the settings in BOINC. Check your system BIOS along with any other system firmware and update if needed.

TribbleRED
Send message
Joined: 30 Aug 19
Posts: 5
Credit: 223,403,448
RAC: 493,511
Level
Leu
Scientific publications
wat
Message 56162 - Posted: 28 Dec 2020 | 8:44:44 UTC - in response to Message 56160.
Last modified: 28 Dec 2020 | 8:58:28 UTC

GPU1 sits below 57c while GPU2 hasn't risen above 66c so this and the other passively cooled devices I run in this rack are not overheating. Plenty of forced airflow.

Thank you for your feedback.
And taking a closer look to this system, being reported as "56 processors", I deduce that it is based on twin Xeon CPUs, since each one consists of 14 cores - 28 threads.
Great host.


Thank you sir. Indeed, a twin e5-2697v3. I have two of these that I have been testing projects to find where they excel as a testbed for larger deployments. I have taken advantage of the older K80 because one of the two k80 GPUs smokes any of the overclocked 1660 supersI have (only with certain projects) and with that it has peaked my curiosity to explore projects like GPUgrid with legacy hardware to see, locally, what I might find. As it stands even with two K80 cores it doesn't appear so far to be an efficient card(in my arsenal) for GPUgrid even when it succeeds in completing tasks.

Thank you both for your help.

TribbleRED
Send message
Joined: 30 Aug 19
Posts: 5
Credit: 223,403,448
RAC: 493,511
Level
Leu
Scientific publications
wat
Message 56163 - Posted: 28 Dec 2020 | 8:55:47 UTC - in response to Message 56161.

It's not abnormal for GPUgrid computations to fail periodically. I have seen 20 failures in the last 4 days but my total error rate over all my systems is only about 3%.

Recently I had several Toni tasks fail and completely stall. They ran 2-3 days before I noticed it and had to abort them. Those seem to have cleared up now.

What you need to watch out for is tasks repeatedly failing one after another. This could more likely be a driver or OS problem. If you notice that scenario try rebooting and see if it clears up.

Since the driver version 452.39 is the latest for a Tesla on Windows 10, I would suggest a full deinstall using DDU and reinstall. Download the Nvidia driver directly from the Nvidia website. Don't trust Windows to install it.

If you do not already have a copy of DDU you can find the download here: https://www.guru3d.com/files-details/display-driver-uninstaller-download.html

I would not recommend manually overclocking these. The best way to get the most performance is keep them cool and let them run to their maximum boost clock.

I would recommend using GPU-Z to monitor both the GPU's on each card. Check for any abnormal behavior. Spiking temps, GPU workload, power fluctuations etc. GPU-z can be found here: https://www.techpowerup.com/gpuz/

A few other thoughts: Make sure you have the latest Windows updates. Check for any other software applications that may be running and interfering. Make sure you have enough disk space and memory. Including the settings in BOINC. Check your system BIOS along with any other system firmware and update if needed.



All well advised. I'll look again at the possibilities of driver issues when I return to run more GPUgrid tasks on this card hopefully within the week. I have another K80 incoming soon for another node of the same configuration.

For any overclocking I use MSI Afterburner as it does not require an MSI branded card and is a powerful overclocking tool for what it is. If you haven't used it before for your nVidia cards go check it out.

jjch
Send message
Joined: 10 Nov 13
Posts: 59
Credit: 14,607,247,215
RAC: 3,167,703
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56174 - Posted: 28 Dec 2020 | 23:08:15 UTC - in response to Message 56163.

I definitely had a problem with one of the NVIDIA drivers for my RTX 4000 cards. It may have been version 452.57 and it would crash my server running Windows server 2019. After I upgraded to version 460.89 the problem went away.

That server also seems to have better performance than another one that is still running 452.39 so I need to get on it and upgrade the others too. I don't have any Tesla cards so I can't test your issue.

If reinstalling 452.39 for the Tesla doesn't help, you could try going back one or two versions. Version 451.82 is the first previous and 451.48 is the one before that.

I do use EVA Precision XOC and X1 for the newer Quadro's but I only set an aggressive fan curve and let the cards boost on their own. Doesn't apply for the Tesla since it is cooled by the server airflow.

Just be sure your server is pushing enough air to keep them as cool as it can. On the HP/HPE Proliant servers I set the cooling to "Increased cooling" even for the blower still cards as it still can push more air past them.

I haven't had much luck manually overclocking most of these newer cards. It is more troublesome and not much of a benefit for me.

From what I have found they run out of power and hit the Power limits anyway. If you use GPU-Z you can see that on the PerfCap Reason line.
It will show Pwr, VRel etc.

The K80's are aging tech and the FP32 performance is about 4 Tflops per GPU.
They are about the range of a Quadro M4000 or GTX 1050. If they are cheap or free I would definitely give them a go though.

Right now I needed several single slot GPU's so I went with the Quadro RTX 4000's. If you watch Ebay you can find them for around $700 occasionally less and sometimes even new. These are close to 7 Tflops each and I have found two of these beat a single RTX 5000 for about the same money.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2304
Credit: 16,114,486,240
RAC: 1,809,662
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56176 - Posted: 29 Dec 2020 | 0:18:28 UTC
Last modified: 29 Dec 2020 | 0:40:27 UTC

K80 is based on the Kepler architecture (from 2014).
I think the new app doesn't support that old GPUs, but I didn't find any reference.
EDIT: I was wrong, I've found a host with 4 pieces of Tesla K40c, and it's working.
Another host with a working GTX 670, and another with a working GTX 680.

Profile Stephen Uitti
Send message
Joined: 17 Mar 14
Posts: 3
Credit: 63,759,846
RAC: 30,101
Level
Thr
Scientific publications
watwatwatwatwatwatwatwat
Message 56245 - Posted: 6 Jan 2021 | 15:10:47 UTC - in response to Message 56176.

On Jan 2, a GTX 650 ti with driver 440.95.01 on Linux Mint 19 got a nan as above, on an ACEMD unit. This system is not overclocking in any way. Cooling is better than stock, and temperatures are nominal.
Workunit 27006917
This host has routinely computed ACEMD units for months without error.

In syslog, there were these entries:

Jan 2 16:01:56 pensar boinc[1066]: mv: cannot stat 'slots/5/output.idx': No such file or directory
Jan 2 16:01:56 pensar boinc[1066]: mv: cannot stat 'slots/5/output.dcd': No such file or directory
Jan 2 16:01:56 pensar boinc[1066]: mv: cannot stat 'slots/5/COLVAR': No such file or directory
Jan 2 16:01:56 pensar boinc[1066]: mv: cannot stat 'slots/5/log.file': No such file or directory
Jan 2 16:01:56 pensar boinc[1066]: mv: cannot stat 'slots/5/HILLS': No such file or directory
Jan 2 16:01:56 pensar boinc[1066]: mv: cannot stat 'slots/5/output.xstfile': No such file or directory

The filesystem was not out of space. I assume that output files weren't written due to the error.

I wouldn't even have noticed this error, but GPUGRID stopped giving this system more units. No error, just this in syslog:
Jan 2 00:23:13 pensar boinc[1066]: 02-Jan-2021 00:23:13 [GPUGRID] Sending scheduler request: Requested by project.
Jan 2 00:23:13 pensar boinc[1066]: 02-Jan-2021 00:23:13 [GPUGRID] Requesting new tasks for NVIDIA GPU
Jan 2 00:23:16 pensar boinc[1066]: 02-Jan-2021 00:23:16 [GPUGRID] Scheduler request completed: got 0 new tasks
Jan 2 00:23:16 pensar boinc[1066]: 02-Jan-2021 00:23:16 [GPUGRID] No tasks sent
Jan 2 00:23:16 pensar boinc[1066]: 02-Jan-2021 00:23:16 [GPUGRID] This computer has reached a limit on tasks in progress

possibly due to the error task, then later

Jan 3 00:06:02 pensar boinc[1066]: 03-Jan-2021 00:06:02 [GPUGRID] Sending scheduler request: Requested by project.
Jan 3 00:06:02 pensar boinc[1066]: 03-Jan-2021 00:06:02 [GPUGRID] Requesting new tasks for NVIDIA GPU
Jan 3 00:06:05 pensar boinc[1066]: 03-Jan-2021 00:06:05 [GPUGRID] Scheduler request completed: got 0 new tasks
Jan 3 00:06:05 pensar boinc[1066]: 03-Jan-2021 00:06:05 [GPUGRID] No tasks sent
Jan 3 00:06:05 pensar boinc[1066]: 03-Jan-2021 00:06:05 [GPUGRID] Project has no tasks available

At first, i believed this error, but GPUGRID has had plenty of tasks available.

The system is set up with PrimeGrid as a lower priority project, and it has been crunching those since.

When i get a chance, i'll reset the project and reboot the system.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 544
Credit: 4,460,126,357
RAC: 1,610,525
Level
Arg
Scientific publications
wat
Message 56246 - Posted: 6 Jan 2021 | 16:35:33 UTC - in response to Message 56245.

GPUGRID is out of new work for the moment. there are only a few hundred out in the field, and anything you get now will be resends.

need to wait for the admins to add more work.
____________

Post to thread

Message boards : Number crunching : Failures on Tesla k80