Message boards :
Number crunching :
ATM free energy calcuation -> GPU overheated and kicked off bus
Message board moderation
| Author | Message |
|---|---|
|
Send message Joined: 5 May 22 Posts: 24 Credit: 12,458,305 RAC: 0 Level ![]() Scientific publications
|
The NVIDIA GPU running at 93C, then suddenly kicked off from the bus due to overheating, I am guessing: Mon Mar 20 06:00:36 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.89.02 Driver Version: 525.89.02 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:02:00.0 Off | N/A |
| N/A 93C P0 N/A / N/A | 170MiB / 4096MiB | 98% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2706 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 5626 C python 164MiB |
+-----------------------------------------------------------------------------+
Mon Mar 20 06:00:41 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.89.02 Driver Version: 525.89.02 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... ERR! | 00000000:02:00.0 Off | N/A |
|ERR! ERR! ERR! N/A / ERR! | GPU is lost | ERR! ERR! |
| | | ERR! |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
|
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Yes, fan not running will do that. Fix the fan so that it runs and try again. |
|
Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,408,899,587 RAC: 0 Level ![]() Scientific publications ![]() ![]()
|
What brand and model GPU? E.g., MSI and Gigabyte fans don't last very long but EVGA fans do. Easy to replace. Might just need to blow out the dust and clean the PCIe connector with isopropyl alcohol. |
ServicEnginICSend message Joined: 24 Sep 10 Posts: 592 Credit: 11,972,186,510 RAC: 1,447 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
The NVIDIA GPU running at 93C, then suddenly kicked off from the bus due to overheating This symptom would also fit to an overheated crunching laptop. More details are given at my Message #52937 |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level ![]() Scientific publications
|
it's a laptop with a very old GPU. just look at his profile and hosts. this is the same problem he had with the acemd3 tasks where the GPU overheated and dropped off the bus. overheating not necessarily due to fans not spinning, though possible. the fans might not be connected to the GPU itself. sometimes laptops use fans that share the GPU and CPU or are chassis controlled. time to retire this system IMO. or use it for PythonGPU only (lower GPU utilization and heat output). laptops in general aren't good candidates for BOINC due to the limited cooling under 24/7 use. laptop cooling systems really aren't designed for that.
|
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
I just saw the reported 0 rpm fan speed in the nvidia-smi output and commented. Didn't look into the actual hardware. Yes, don't use this hardware for anything but Python, and questionable even that. |
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
... laptops in general aren't good candidates for BOINC due to the limited cooling under 24/7 use. laptop cooling systems really aren't designed for that. Recently I was playing with the idea of buying a new laptop with a RTX3070 or even 3080 inside; but exactly what you are saying prevented me from doing it. |
|
Send message Joined: 5 May 22 Posts: 24 Credit: 12,458,305 RAC: 0 Level ![]() Scientific publications
|
Run only the selected applications ACEMD 3: no ACEMD 4: no ATM (beta): no Quantum Chemistry (CPU): yes Quantum Chemistry (CPU, beta): yes Python Runtime (CPU, beta): yes Python Runtime (GPU, beta): yes I tried to set the ATM and ACEMD off because they are mostly causing the overheating. But even setting them off the boincmgr still downloads them and I can not avoid the problem issues. The laptop model and GPU model is maybe too old, but overheating is the problem mainly. I don't know if it possible to get the GPU back to the bus if it was kicked off the be buss due to overheating. Always I have to reboot to get it back. |
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
You missed one critical setting in Project Preferences. If no work for selected applications is available, accept work from other applications? no If you don't set this to no, you will get any and ALL other applications when your desired app tasks aren't available. Just a FYI, there hasn't been any Quantum Chemistry work in about 3 years. |
|
Send message Joined: 3 May 20 Posts: 19 Credit: 1,043,759,208 RAC: 39 Level ![]() Scientific publications
|
I am using a gaming laptop with an RTX 3060 to crunch ATM which works fine. As mentioned before the heat needs to be controlled more carefully in a laptop. When I let the GPU crunch the CPU stays in idle so the only heat is generated by the GPU. Since Python tasks also use a few CPU threads at the same time I manually set the CPU frequency to 1300 Mhz which accelerates the calculation because otherwise it would stay at 400 Mhz but on the other hand doesn’t increase the heat much if otherwise left at idle. The GPU runs at 80 degress C. System is a Ryzen 7 5800H from Asus. |
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I am using a gaming laptop with an RTX 3060 to crunch ATM ... When I let the GPU crunch the CPU stays in idle so the only heat is generated by the GPU. how can the CPU stay idle while crunching an ATM task? |
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
I am using a gaming laptop with an RTX 3060 to crunch ATM ... When I let the GPU crunch the CPU stays in idle so the only heat is generated by the GPU. All of these Python based applications are very 'bursty' IOW, they very infrequently use the cpu and gpu, flipping back and forth between the two processing elements. |
|
Send message Joined: 3 May 20 Posts: 19 Credit: 1,043,759,208 RAC: 39 Level ![]() Scientific publications
|
Exactly! That‘s why I wrote „otherwise in idle“ meaning only the ATM task is being crunched. |
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
By throttling the cpu speed down to idle to save watts and heat, the only consequence is longer running tasks which may risk getting the credit bonuses. |
|
Send message Joined: 3 May 20 Posts: 19 Credit: 1,043,759,208 RAC: 39 Level ![]() Scientific publications
|
Not necessarily. With this setup my ATM beta tasks finish after around 6 hours on this linux host awarding 1.1 million credit and neither CPU or GPU overheat. If I were to let them lose like they were programmed to the CPU would stay at 400 Mhz even if the ATM task needs it. So manually raising it to 1300 Mhz decreases CPU calculation times. That way the CPU temp doesn't exceed 75 degrees and the GPU stays at around 80. It depends on the project that you run and your personal boltness what temps your willing to accept. I like the CPU to peak at 75 degrees and the GPU a little over 80. That's why I usually run only GPU or CPU work. Both together may be too much. An exception is milkyway which can be run in parallel if the CPU gets throttled to 1000 Mhz. This is just an example to show that ATM tasks can be run on a laptop. My Intel 1280P, GTX 1650 and Win 11 Laptop behaves differently. If I run only CPU work on it the temps go up straight to 93 degrees. If I use the GPU together with the CPU the CPU gets throttled automatically to around 2200 Mhz keeping the CPU temps at around 75 and the GPU at around 73. Sometimes it needs a little kick from the Intel Extreme tuning utility though. |
©2025 Universitat Pompeu Fabra