Failures on Tesla k80

Author	Message
TribbleRED Send message Joined: 30 Aug 19 Posts: 7 Credit: 808,646,881 RAC: 129 Level Scientific publications	Message 56154 - Posted: 27 Dec 2020, 10:12:18 UTC Last modified: 27 Dec 2020, 10:13:02 UTC I can't seem to make heads or tails of these failures. Any help would be appreciated. GPUgrid is a new project for an already installed and running/contributing k80 Driver ver: 452.39 - Datacenter Driver for Windows Release Date: 2020.9.30 WU runs 200~700 seconds and then fails Exit code 195 - unknown error http://www.gpugrid.net/result.php?resultid=32303507 http://www.gpugrid.net/result.php?resultid=32227640 http://www.gpugrid.net/result.php?resultid=32226165 Thanks in advance ID: 56154 · Rating: 0 · rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 595 Credit: 13,083,686,510 RAC: 31,373 Level Scientific publications	Message 56156 - Posted: 27 Dec 2020, 15:55:25 UTC All your three failed tasks have been resent to other hosts and finished successfully, so we may discard that they were defective. The true errors for failed tasks were: #32303507 ACEMD failed: Particle coordinate is nan Where "Particle coordinate is nan" is the acronym for "Particle coordinate is not a number" #32227640 ACEMD failed: Error invoking kernel: CUDA_ERROR_LAUNCH_FAILED (719) #32226165 ACEMD failed: Error invoking kernel: CUDA_ERROR_LAUNCH_FAILED (719) These errors may indicate that a too aggressive user or factory overclocking is being used on GPUs. GPUGrid tasks are very demanding, and GPUs succeeding for other projects may fail at GPUGrid for this reason. Try to reduce overclocking if it is at your hand, and test whether it helps. On the other hand, you have also processed with success task #32226173, so perhaps your setting is only a bit beyond the optimal one. Also power requirements must be taken into account at your host #572833. Two NVIDIA Tesla K80 will demand 600 Watts (300 Watts each one) at full performance. Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz will add 145 more watts. Adding power for motherboard and peripherals and a bit safety margin, a minimum 1000 Watts (1200 better) PSU should be used for this host. ID: 56156 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 0 Level Scientific publications	Message 56157 - Posted: 27 Dec 2020, 17:37:19 UTC Since the Tesla K80 is a fanless design meant for server chassis airflow, my first question is do you have enough assisted forced cooling for the card. Your errors suggest the card is overheating. ID: 56157 · Rating: 0 · rate: / Reply Quote

TribbleRED Send message Joined: 30 Aug 19 Posts: 7 Credit: 808,646,881 RAC: 129 Level Scientific publications	Message 56158 - Posted: 27 Dec 2020, 19:33:21 UTC - in response to Message 56156. All your three failed tasks have been resent to other hosts and finished successfully, so we may discard that they were defective. The true errors for failed tasks were: #32303507 ACEMD failed: Particle coordinate is nan Where "Particle coordinate is nan" is the acronym for "Particle coordinate is not a number" #32227640 ACEMD failed: Error invoking kernel: CUDA_ERROR_LAUNCH_FAILED (719) #32226165 ACEMD failed: Error invoking kernel: CUDA_ERROR_LAUNCH_FAILED (719) These errors may indicate that a too aggressive user or factory overclocking is being used on GPUs. It has the same overclock as all the others... maybe its just the bin for this particular card. I'll look into it further. The Tesla k80 has 2x GPUs on one PCB and will run as if it has two GPUs and report as if there are two GPUs when it is just one card. Chassis has 2x 1250watt PSUs GPUGrid tasks are very demanding, and GPUs succeeding for other projects may fail at GPUGrid for this reason. Try to reduce overclocking if it is at your hand, and test whether it helps. On the other hand, you have also processed with success task #32226173, so perhaps your setting is only a bit beyond the optimal one. Also power requirements must be taken into account at your host #572833. Two NVIDIA Tesla K80 will demand 600 Watts (300 Watts each one) at full performance. Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz will add 145 more watts. Adding power for motherboard and peripherals and a bit safety margin, a minimum 1000 Watts (1200 better) PSU should be used for this host. [/quote] It has the same overclock as all the others... maybe its just the bin for this particular card. I'll look into it further. The Tesla k80 has 2x GPUs on one PCB and will run as if it has two GPUs and report as if there are two GPUs when it is just one card. Chassis has 2x 1250watt PSUs ID: 56158 · Rating: 0 · rate: / Reply Quote

TribbleRED Send message Joined: 30 Aug 19 Posts: 7 Credit: 808,646,881 RAC: 129 Level Scientific publications	Message 56159 - Posted: 27 Dec 2020, 19:35:30 UTC - in response to Message 56157. Since the Tesla K80 is a fanless design meant for server chassis airflow, my first question is do you have enough assisted forced cooling for the card. Your errors suggest the card is overheating. GPU1 sits below 57c while GPU2 hasn't risen above 66c so this and the other passively cooled devices I run in this rack are not overheating. Plenty of forced airflow. ID: 56159 · Rating: 0 · rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 595 Credit: 13,083,686,510 RAC: 31,373 Level Scientific publications	Message 56160 - Posted: 27 Dec 2020, 19:56:32 UTC - in response to Message 56159. GPU1 sits below 57c while GPU2 hasn't risen above 66c so this and the other passively cooled devices I run in this rack are not overheating. Plenty of forced airflow. Thank you for your feedback. And taking a closer look to this system, being reported as "56 processors", I deduce that it is based on twin Xeon CPUs, since each one consists of 14 cores - 28 threads. Great host. ID: 56160 · Rating: 0 · rate: / Reply Quote

jjch Send message Joined: 10 Nov 13 Posts: 101 Credit: 15,776,211,122 RAC: 0 Level Scientific publications	Message 56161 - Posted: 28 Dec 2020, 4:05:07 UTC - in response to Message 56160. Last modified: 28 Dec 2020, 4:05:54 UTC It's not abnormal for GPUgrid computations to fail periodically. I have seen 20 failures in the last 4 days but my total error rate over all my systems is only about 3%. Recently I had several Toni tasks fail and completely stall. They ran 2-3 days before I noticed it and had to abort them. Those seem to have cleared up now. What you need to watch out for is tasks repeatedly failing one after another. This could more likely be a driver or OS problem. If you notice that scenario try rebooting and see if it clears up. Since the driver version 452.39 is the latest for a Tesla on Windows 10, I would suggest a full deinstall using DDU and reinstall. Download the Nvidia driver directly from the Nvidia website. Don't trust Windows to install it. If you do not already have a copy of DDU you can find the download here: https://www.guru3d.com/files-details/display-driver-uninstaller-download.html I would not recommend manually overclocking these. The best way to get the most performance is keep them cool and let them run to their maximum boost clock. I would recommend using GPU-Z to monitor both the GPU's on each card. Check for any abnormal behavior. Spiking temps, GPU workload, power fluctuations etc. GPU-z can be found here: https://www.techpowerup.com/gpuz/ A few other thoughts: Make sure you have the latest Windows updates. Check for any other software applications that may be running and interfering. Make sure you have enough disk space and memory. Including the settings in BOINC. Check your system BIOS along with any other system firmware and update if needed. ID: 56161 · Rating: 0 · rate: / Reply Quote

TribbleRED Send message Joined: 30 Aug 19 Posts: 7 Credit: 808,646,881 RAC: 129 Level Scientific publications	Message 56162 - Posted: 28 Dec 2020, 8:44:44 UTC - in response to Message 56160. Last modified: 28 Dec 2020, 8:58:28 UTC GPU1 sits below 57c while GPU2 hasn't risen above 66c so this and the other passively cooled devices I run in this rack are not overheating. Plenty of forced airflow. Thank you for your feedback. And taking a closer look to this system, being reported as "56 processors", I deduce that it is based on twin Xeon CPUs, since each one consists of 14 cores - 28 threads. Great host. Thank you sir. Indeed, a twin e5-2697v3. I have two of these that I have been testing projects to find where they excel as a testbed for larger deployments. I have taken advantage of the older K80 because one of the two k80 GPUs smokes any of the overclocked 1660 supersI have (only with certain projects) and with that it has peaked my curiosity to explore projects like GPUgrid with legacy hardware to see, locally, what I might find. As it stands even with two K80 cores it doesn't appear so far to be an efficient card(in my arsenal) for GPUgrid even when it succeeds in completing tasks. Thank you both for your help. ID: 56162 · Rating: 0 · rate: / Reply Quote

TribbleRED Send message Joined: 30 Aug 19 Posts: 7 Credit: 808,646,881 RAC: 129 Level Scientific publications	Message 56163 - Posted: 28 Dec 2020, 8:55:47 UTC - in response to Message 56161. It's not abnormal for GPUgrid computations to fail periodically. I have seen 20 failures in the last 4 days but my total error rate over all my systems is only about 3%. Recently I had several Toni tasks fail and completely stall. They ran 2-3 days before I noticed it and had to abort them. Those seem to have cleared up now. What you need to watch out for is tasks repeatedly failing one after another. This could more likely be a driver or OS problem. If you notice that scenario try rebooting and see if it clears up. Since the driver version 452.39 is the latest for a Tesla on Windows 10, I would suggest a full deinstall using DDU and reinstall. Download the Nvidia driver directly from the Nvidia website. Don't trust Windows to install it. If you do not already have a copy of DDU you can find the download here: https://www.guru3d.com/files-details/display-driver-uninstaller-download.html I would not recommend manually overclocking these. The best way to get the most performance is keep them cool and let them run to their maximum boost clock. I would recommend using GPU-Z to monitor both the GPU's on each card. Check for any abnormal behavior. Spiking temps, GPU workload, power fluctuations etc. GPU-z can be found here: https://www.techpowerup.com/gpuz/ A few other thoughts: Make sure you have the latest Windows updates. Check for any other software applications that may be running and interfering. Make sure you have enough disk space and memory. Including the settings in BOINC. Check your system BIOS along with any other system firmware and update if needed. All well advised. I'll look again at the possibilities of driver issues when I return to run more GPUgrid tasks on this card hopefully within the week. I have another K80 incoming soon for another node of the same configuration. For any overclocking I use MSI Afterburner as it does not require an MSI branded card and is a powerful overclocking tool for what it is. If you haven't used it before for your nVidia cards go check it out. ID: 56163 · Rating: 0 · rate: / Reply Quote

jjch Send message Joined: 10 Nov 13 Posts: 101 Credit: 15,776,211,122 RAC: 0 Level Scientific publications	Message 56174 - Posted: 28 Dec 2020, 23:08:15 UTC - in response to Message 56163. I definitely had a problem with one of the NVIDIA drivers for my RTX 4000 cards. It may have been version 452.57 and it would crash my server running Windows server 2019. After I upgraded to version 460.89 the problem went away. That server also seems to have better performance than another one that is still running 452.39 so I need to get on it and upgrade the others too. I don't have any Tesla cards so I can't test your issue. If reinstalling 452.39 for the Tesla doesn't help, you could try going back one or two versions. Version 451.82 is the first previous and 451.48 is the one before that. I do use EVA Precision XOC and X1 for the newer Quadro's but I only set an aggressive fan curve and let the cards boost on their own. Doesn't apply for the Tesla since it is cooled by the server airflow. Just be sure your server is pushing enough air to keep them as cool as it can. On the HP/HPE Proliant servers I set the cooling to "Increased cooling" even for the blower still cards as it still can push more air past them. I haven't had much luck manually overclocking most of these newer cards. It is more troublesome and not much of a benefit for me. From what I have found they run out of power and hit the Power limits anyway. If you use GPU-Z you can see that on the PerfCap Reason line. It will show Pwr, VRel etc. The K80's are aging tech and the FP32 performance is about 4 Tflops per GPU. They are about the range of a Quadro M4000 or GTX 1050. If they are cheap or free I would definitely give them a go though. Right now I needed several single slot GPU's so I went with the Quadro RTX 4000's. If you watch Ebay you can find them for around $700 occasionally less and sometimes even new. These are close to 7 Tflops each and I have found two of these beat a single RTX 5000 for about the same money. ID: 56174 · Rating: 0 · rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 56176 - Posted: 29 Dec 2020, 0:18:28 UTC Last modified: 29 Dec 2020, 0:40:27 UTC K80 is based on the Kepler architecture (from 2014). I think the new app doesn't support that old GPUs, but I didn't find any reference. EDIT: I was wrong, I've found a host with 4 pieces of Tesla K40c, and it's working. Another host with a working GTX 670, and another with a working GTX 680. ID: 56176 · Rating: 0 · rate: / Reply Quote

Stephen Uitti Send message Joined: 17 Mar 14 Posts: 4 Credit: 77,427,636 RAC: 0 Level Scientific publications	Message 56245 - Posted: 6 Jan 2021, 15:10:47 UTC - in response to Message 56176. On Jan 2, a GTX 650 ti with driver 440.95.01 on Linux Mint 19 got a nan as above, on an ACEMD unit. This system is not overclocking in any way. Cooling is better than stock, and temperatures are nominal. Workunit 27006917 This host has routinely computed ACEMD units for months without error. In syslog, there were these entries: Jan 2 16:01:56 pensar boinc[1066]: mv: cannot stat 'slots/5/output.idx': No such file or directory Jan 2 16:01:56 pensar boinc[1066]: mv: cannot stat 'slots/5/output.dcd': No such file or directory Jan 2 16:01:56 pensar boinc[1066]: mv: cannot stat 'slots/5/COLVAR': No such file or directory Jan 2 16:01:56 pensar boinc[1066]: mv: cannot stat 'slots/5/log.file': No such file or directory Jan 2 16:01:56 pensar boinc[1066]: mv: cannot stat 'slots/5/HILLS': No such file or directory Jan 2 16:01:56 pensar boinc[1066]: mv: cannot stat 'slots/5/output.xstfile': No such file or directory The filesystem was not out of space. I assume that output files weren't written due to the error. I wouldn't even have noticed this error, but GPUGRID stopped giving this system more units. No error, just this in syslog: Jan 2 00:23:13 pensar boinc[1066]: 02-Jan-2021 00:23:13 [GPUGRID] Sending scheduler request: Requested by project. Jan 2 00:23:13 pensar boinc[1066]: 02-Jan-2021 00:23:13 [GPUGRID] Requesting new tasks for NVIDIA GPU Jan 2 00:23:16 pensar boinc[1066]: 02-Jan-2021 00:23:16 [GPUGRID] Scheduler request completed: got 0 new tasks Jan 2 00:23:16 pensar boinc[1066]: 02-Jan-2021 00:23:16 [GPUGRID] No tasks sent Jan 2 00:23:16 pensar boinc[1066]: 02-Jan-2021 00:23:16 [GPUGRID] This computer has reached a limit on tasks in progress possibly due to the error task, then later Jan 3 00:06:02 pensar boinc[1066]: 03-Jan-2021 00:06:02 [GPUGRID] Sending scheduler request: Requested by project. Jan 3 00:06:02 pensar boinc[1066]: 03-Jan-2021 00:06:02 [GPUGRID] Requesting new tasks for NVIDIA GPU Jan 3 00:06:05 pensar boinc[1066]: 03-Jan-2021 00:06:05 [GPUGRID] Scheduler request completed: got 0 new tasks Jan 3 00:06:05 pensar boinc[1066]: 03-Jan-2021 00:06:05 [GPUGRID] No tasks sent Jan 3 00:06:05 pensar boinc[1066]: 03-Jan-2021 00:06:05 [GPUGRID] Project has no tasks available At first, i believed this error, but GPUGRID has had plenty of tasks available. The system is set up with PrimeGrid as a lower priority project, and it has been crunching those since. When i get a chance, i'll reset the project and reboot the system. ID: 56245 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1117 Credit: 40,876,970,595 RAC: 0 Level Scientific publications	Message 56246 - Posted: 6 Jan 2021, 16:35:33 UTC - in response to Message 56245. GPUGRID is out of new work for the moment. there are only a few hundred out in the field, and anything you get now will be resends. need to wait for the admins to add more work. ID: 56246 · Rating: 0 · rate: / Reply Quote