Message boards :
Number crunching :
GPU-Utilization low (& variating)
Message board moderation
| Author | Message |
|---|---|
|
Send message Joined: 5 May 22 Posts: 24 Credit: 12,458,305 RAC: 0 Level ![]() Scientific publications
|
Sun Jun 12 01:33:47 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07 Driver Version: 515.48.07 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:02:00.0 Off | N/A |
| N/A 69C P0 N/A / N/A | 3267MiB / 4096MiB | 55% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2666 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 12881 C bin/python 3261MiB |
+-----------------------------------------------------------------------------+
Sun Jun 12 01:33:57 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07 Driver Version: 515.48.07 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:02:00.0 Off | N/A |
| N/A 68C P0 N/A / N/A | 3267MiB / 4096MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2666 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 12881 C bin/python 3261MiB |
+-----------------------------------------------------------------------------+
Sun Jun 12 01:34:07 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07 Driver Version: 515.48.07 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:02:00.0 Off | N/A |
| N/A 67C P0 N/A / N/A | 3267MiB / 4096MiB | 6% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2666 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 12881 C bin/python 3261MiB |
+-----------------------------------------------------------------------------+
Sun Jun 12 01:34:17 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07 Driver Version: 515.48.07 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:02:00.0 Off | N/A |
| N/A 66C P0 N/A / N/A | 3267MiB / 4096MiB | 11% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2666 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 12881 C bin/python 3261MiB |
+-----------------------------------------------------------------------------+
Sun Jun 12 01:34:27 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07 Driver Version: 515.48.07 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:02:00.0 Off | N/A |
| N/A 69C P0 N/A / N/A | 3267MiB / 4096MiB | 4% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2666 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 12881 C bin/python 3261MiB |
+-----------------------------------------------------------------------------+
In different samples the GPU-utilization shows low, even below 10% and GPU-utilization is variating. Is this normal? Some other applications were using NVIDIA at 100% GPU-utilization (like collatz). |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level ![]() Scientific publications
|
yes this is normal for the python application.
|
|
Send message Joined: 5 May 22 Posts: 24 Credit: 12,458,305 RAC: 0 Level ![]() Scientific publications
|
Then with another application (acemd3) the GPU utilization is continuously 100%, but it looks like that the run times would be 5-6 days for the work unit: Sun Jun 12 15:34:45 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07 Driver Version: 515.48.07 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:02:00.0 Off | N/A |
| N/A 71C P0 N/A / N/A | 180MiB / 4096MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2666 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 40431 C bin/acemd3 174MiB |
+-----------------------------------------------------------------------------+
|
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level ![]() Scientific publications
|
yes it's normal. acemd3 has full utilization. python has low/intermittent utilization.
|
|
Send message Joined: 5 May 22 Posts: 24 Credit: 12,458,305 RAC: 0 Level ![]() Scientific publications
|
It looks like the GPU/driver is kicked out during running acemd: Sat Jun 18 11:10:09 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07 Driver Version: 515.48.07 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:02:00.0 Off | N/A |
| N/A 93C P0 N/A / N/A | 180MiB / 4096MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2745 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 5433 C bin/acemd3 174MiB |
+-----------------------------------------------------------------------------+
Sat Jun 18 11:10:14 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07 Driver Version: 515.48.07 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... ERR! | 00000000:02:00.0 Off | N/A |
|ERR! ERR! ERR! N/A / ERR! | GPU is lost | ERR! ERR! |
| | | ERR! |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+Is it overheating / lower power related? |
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Looks like the card fell off the bus. Have you looked at the system logs to determine why? Normally you can find the reason for memory corruption in the system logs and for communications issues in the dmesg logs. What about backleveling to the stable 510 drivers? |
|
Send message Joined: 5 May 22 Posts: 24 Credit: 12,458,305 RAC: 0 Level ![]() Scientific publications
|
Jun 18 11:02:31 mx kernel: [130120.250658] NVRM: GPU at PCI:0000:02:00: GPU-793994bc-1295-4395-dc48-7dd3d7b431e2 Jun 18 11:02:31 mx kernel: [130120.250663] NVRM: Xid (PCI:0000:02:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus. Jun 18 11:02:31 mx kernel: [130120.250665] NVRM: GPU 0000:02:00.0: GPU has fallen off the bus. Jun 18 11:02:31 mx kernel: [130120.250687] NVRM: A GPU crash dump has been created. If possible, please run Jun 18 11:03:58 mx kernel: [ 7.229407] [drm] [nvidia-drm] [GPU ID 0x00000200] Loading driver Jun 18 11:10:14 mx kernel: [ 396.557669] NVRM: GPU at PCI:0000:02:00: GPU-793994bc-1295-4395-dc48-7dd3d7b431e2 Jun 18 11:10:14 mx kernel: [ 396.557674] NVRM: Xid (PCI:0000:02:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus. Jun 18 11:10:14 mx kernel: [ 396.557676] NVRM: GPU 0000:02:00.0: GPU has fallen off the bus. Jun 18 11:10:14 mx kernel: [ 396.557697] NVRM: A GPU crash dump has been created. If possible, please run Jun 18 20:11:43 mx kernel: [ 6.301529] [drm] [nvidia-drm] [GPU ID 0x00000200] Loading driver Yes, it says like it has fallen of from the bus. It could be difficult to downgrade hte driver version. Not sure why this occurs. Maybe also Linux driver consumes higher power than Windows driver. |
|
Send message Joined: 5 May 22 Posts: 24 Credit: 12,458,305 RAC: 0 Level ![]() Scientific publications
|
I would think it is overheating related, as per when it starts up and temperature is below 92C it will not crash. It crashes quite soon after 93C is reached. Sat Jun 18 15:56:49 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.05 Driver Version: 510.73.05 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:02:00.0 Off | N/A |
| N/A 93C P0 N/A / N/A | 180MiB / 4096MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 15623 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 17273 C bin/acemd3 174MiB |
+-----------------------------------------------------------------------------+
Sat Jun 18 15:56:54 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.05 Driver Version: 510.73.05 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... ERR! | 00000000:02:00.0 Off | N/A |
|ERR! ERR! ERR! N/A / ERR! | GPU is lost | ERR! ERR! |
| | | ERR! |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
|
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
92° C. I believe it the standard throttle temp for the current Nvidia cards. Why is the card so hot is the question. Are the fans on the card not ramping up in speed to accommodate the higher temps under acemd3 compute load? Looking at your hosts and the cards listed I assume you are using a laptop since I see a mobile MX variant listed. Have you tried one of the laptop cooling pads with builtin fans to assist the laptop's cooling capabilities? Have you overclocked the card? Have you set the card fan speeds to 100% Reduce the core clock and memory clock speeds if possible but unlikely due to the laptop. |
ServicEnginICSend message Joined: 24 Sep 10 Posts: 592 Credit: 11,972,186,510 RAC: 1,447 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I would think it is overheating related, as per when it starts up I agree that the problem is most likely produced by overheating. GPU detaching from bus when reaching 93 ºC could be caused by two reasons: - A hard-coded GPU self-protection mechanism actuating. - An electromechanical problem due to soldering getting its melting point at any GPU pin, causing it to miss a good electrical conductivity (this being very dangerous in the long run!). IMHO, running ACEMD3 / ACEMD4 tasks, due to its high optimization to squeeze the maximum power from the GPU, should be limited (if anything) to very well refrigerated laptops. Previous wise advices from Keith Myers could be useful. Python tasks are currently less GPU power demanding, making them more appropriate to run at requirements-complying laptops. Running certain apps or not, can be selected at Project preferences page. Additionally, I've specifically treated laptop overheating problems at: -Message #52937 -Message #57435 |
|
Send message Joined: 5 May 22 Posts: 24 Credit: 12,458,305 RAC: 0 Level ![]() Scientific publications
|
There is probably very little to do if the laptop cooling is not good enough for this high GPU load. There is maybe one fan on top of the CPU inside this laptop and sometype of heatpipe from the GPU to the CPU fan area, but there is not separate fan for the GPU. |
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Deselect the acemd3 job tasks. They use all of a gpu to its max capabilities and will overwhelm the weak heatpipe laptop cooling solutions. Only solution is to try out one of the laptop cooling solutions. Definitely elevate the laptop off its table or resting place to get airflow through the intake vents and out the side or back vents. Assisted airflow to the intake vents is needed. Try opening the laptop so the two sides make a V and and stand the laptop up vertically on the V. I saw mining operations using that method for mining on laptops last year in pictures. That forms a natural chimney cooling effect. Or change to the Python on GPU tasks if you have at least 32GB of memory and lots of hard drive space. They use the gpu only occassionally in quick bursts so do not get hot at all. If you have neither, best to move on to other cpu and gpu projects as your hardware is insufficient to run GPUGrid. |
|
Send message Joined: 5 May 22 Posts: 24 Credit: 12,458,305 RAC: 0 Level ![]() Scientific publications
|
Also based to below discussion they suggest the laptop is with "defective GPU": https://forums.developer.nvidia.com/t/unable-to-determine-the-device-handle-for-gpu-000000-0-unknown-error/214227/8 |
|
Send message Joined: 5 May 22 Posts: 24 Credit: 12,458,305 RAC: 0 Level ![]() Scientific publications
|
It looks like the GPU is taking more power than the 90W charger adapter can provide and battery level is going down even the laptop is connected in the charger. Originally adapter was 65W, but even 90W adapter case the battery level is draining slowly. |
|
Send message Joined: 10 Nov 13 Posts: 101 Credit: 15,773,211,122 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
You may still need a higher power charger. Depending on the laptop the next step up is probably 130W. That being said, I would hate for you to fry your laptop computing for science. While it's nice to contribute to society it's not so fun to repair or replace your system. There are some things you can do to maximize cooling and reduce heat and power draw if you haven't already done them. Make sure your laptop cooling system is completely clean. No dust bunnies in the fan, heatsink or other vents. Carefully use a vacuum and/or canned air. Also, make sure you are not blocking any of the ventilation with the surface you are setting it on or other items. It really should be on a cooling pad if you can get one. It helps if the ambient temperature is reasonable. The cooler the better but it doesn't have to be freezing. If you are sweating your laptop is cooking itself. If you are inclined to do so you may be able to refresh the thermal paste with something better. Caution not to do this unless you know how to get your laptop apart and back together properly. Another thing you might try is downclocking the GPU and maybe the CPU to reduce the heat generation and power draw. If you don't already have one there are a number of programs available to do this. I hope a few of these things may be helpful for you. |
©2025 Universitat Pompeu Fabra