Message boards :
Number crunching :
Recently Python app work unit failed with computing error
Message board moderation
| Author | Message |
|---|---|
|
Send message Joined: 5 May 22 Posts: 24 Credit: 12,458,305 RAC: 0 Level ![]() Scientific publications
|
With work unit; e00009a00848-ABOU_rnd_ppod_expand_demos23-0-1-RND9304_0 It happened: Error while computing 262,081.10 What does this mean and what does it suggest is causing this error? |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 428 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Result 33000731 (easier that way) It looks like it's failed and attempted a restart multiple times. The number (262,081.10) is the number of seconds it's wasted doing all that - not a happy bunny. It's not immediately obvious what ultimately went wrong, but I'll keep looking. Your GPU (NVIDIA GeForce MX250 (4042MB) driver: 510.73) is unusual, and may be a little short on memory (6 GB is recommended), and I'm not an expert on the Debian drivers. |
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
That's a mobile gpu in a laptop. Not usually recommended to even try running gpu tasks on a laptop. |
|
Send message Joined: 5 May 22 Posts: 24 Credit: 12,458,305 RAC: 0 Level ![]() Scientific publications
|
If you find out what is causing the problem, please let me know about it also. |
|
Send message Joined: 5 May 22 Posts: 24 Credit: 12,458,305 RAC: 0 Level ![]() Scientific publications
|
It looks like overheating and GPU driver kicked out of operation: Wed Aug 24 17:28:52 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.05 Driver Version: 510.73.05 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:02:00.0 Off | N/A |
| N/A 93C P0 N/A / N/A | 3267MiB / 4096MiB | 94% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1491 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 3604 C bin/python 3261MiB |
+-----------------------------------------------------------------------------+
Wed Aug 24 17:28:57 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.05 Driver Version: 510.73.05 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:02:00.0 Off | N/A |
| N/A 90C P0 N/A / N/A | 3267MiB / 4096MiB | 11% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1491 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 3604 C bin/python 3261MiB |
+-----------------------------------------------------------------------------+
Wed Aug 24 17:29:02 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.05 Driver Version: 510.73.05 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... ERR! | 00000000:02:00.0 Off | N/A |
|ERR! ERR! ERR! N/A / ERR! | GPU is lost | ERR! ERR! |
| | | ERR! |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
|
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 428 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
It looks like overheating and GPU driver kicked out of operation: Putting the laptop on a cooling stand, with fans blowing air directly into the laptop's ventilation inlet slots, may help. The slots are often underneath, but check your particular machine. And give then a good clean while you're down there! |
|
Send message Joined: 27 Jul 11 Posts: 138 Credit: 539,953,398 RAC: 0 Level ![]() Scientific publications ![]()
|
I run laptop GPU's. Take a small air blower to its slots and blow the dust-out. I do it daily. |
JohnMDSend message Joined: 4 Dec 10 Posts: 5 Credit: 26,860,106 RAC: 0 Level ![]() Scientific publications ![]() ![]()
|
Result 33000731 (easier that way) It is clear that these GPU tasks (MX-series) fail with insufficient GPU memory. I am unable to find such "requirements" - I can't even find details of applications. Can anyone help ? |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level ![]() Scientific publications
|
Result 33000731 (easier that way) I can't speak for Windows behavior, but the last several Linux tasks I processed use a little more than 3GB per task when running 1 task per GPU. with the help of a cuda mps server, I can push 2 tasks concurrently in a 6GB GTX 1060 card as some of the memory gets shared. I would say 3GB minimum needed per task. and at least 4GB to be comfortable. system memory is also quite high. uses about 8GB system memory per task. but you should monitor it, the project could change the requirements at any time if they want to run larger jobs or change the direction of their research.
|
|
Send message Joined: 5 May 19 Posts: 36 Credit: 711,308,218 RAC: 60 Level ![]() Scientific publications
|
I have 2 laptops, the older one running Nvidia GT 1060, and the newer, running RTX 3060. The older laptop, though reaching GPU temperature of ~90C, completes most of Python tasks, though it may take up to 40 hours. The newer laptop, however, fails vast majority of Python tasks though that started quite recently. For example, 29/08/2022 9:32:39 PM | GPUGRID | Computation for task e00008a01599-ABOU_rnd_ppod_expand_demos24_3-0-1-RND7566_0 finished or another task, with more logging details: 29/08/2022 9:44:40 PM | GPUGRID | [task] Process for e00014a00897-ABOU_rnd_ppod_expand_demos24_3-0-1-RND0029_3 exited, exit code 195, task state 1 What could be the reason? GPU has enough graphical memory, the laptop has 16 GB - same as the older one. There is enough disk space, and the temperature doesn't rise above 55-60C. Acemd3 never fail on the newer laptop (at least, i don't remember such failures) and usually finish in under 15 hours. It's only Python, and only recently. Why could that be? What should I enable in the logs to diagnose better?[/quote] |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level ![]() Scientific publications
|
I have 2 laptops, the older one running Nvidia GT 1060, and the newer, running RTX 3060. The older laptop, though reaching GPU temperature of ~90C, completes most of Python tasks, though it may take up to 40 hours. The newer laptop, however, fails vast majority of Python tasks though that started quite recently. For example, this is the more specific error you are getting on that system: RuntimeError: Unable to find a valid cuDNN algorithm to run convolution and I've also seen this in your errors: RuntimeError: [enforce fail at C:\cb\pytorch_1000000000000\work\c10\core\impl\alloc_cpu.cpp:81] data. DefaultCPUAllocator: not enough memory: you tried to allocate 3612672 bytes. researching your first error, and considering the second error, it's likely that memory is your problem. are you trying to run multiple tasks at a time? if not, others with Windows systems have mentioned increasing the pagefile size to solve issues. have you done that as well?
|
|
Send message Joined: 5 May 19 Posts: 36 Credit: 711,308,218 RAC: 60 Level ![]() Scientific publications
|
Thank you so much! Well, i was suspecting the memory, but both laptops have Windows-managed pagefile which is set to 40+GB. Is that enough, or should I increase it even more on the newer system? Besides, if it's Windows-managed, doesn't the pagefile size increase automatically on demand, though it may cause issues to the processes that request memory allocation? Anyway, I set the pagefile to 64GB now, as someone did in a similar situation. I expect it to be enough for the WUs to complete without issues. Thanks again. |
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
The problem with Windows managed pagefile is that it probably doesn't increase its size immediately when the Python application starts loading all its spawned processes. It probably responds initially to the initiating Boinc wrapper app, but that then loads the Python libraries which have huge memory footprints on Windows. So the pagefile might not be large enough at the time that python dependent libraries are requesting lots of memory allocation space. I have been recommending that Windows user just set a custom static sized pagefile of 32GBmin - 64GBmax size and that seems to cover the Python application and tasks and tasks complete successfully. But with your 64GB current size, you probably have resolved the issue. |
|
Send message Joined: 15 Jul 14 Posts: 5 Credit: 85,726,648 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]()
|
I can't speak for Windows behavior, but the last several Linux tasks I processed use a little more than 3GB per task when running 1 task per GPU. On Linux I still can successfully run Python tasks on a GTX 1060 3GB. Using Intel graphics for display helps. I noticed Python app's memory usage decreased by ~100MiB too when I moved Xorg process to Intel GPU. I don't exactly know why. Tue Aug 30 08:22:46 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.141.03 Driver Version: 470.141.03 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:02:00.0 Off | N/A |
| 28% 60C P2 47W / 60W | 2740MiB / 3019MiB | 95% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1672379 C bin/python 2731MiB |
+-----------------------------------------------------------------------------+Before I switched to Intel graphics, it was about: Xorg 169 MiB xfwm4 2 MiB Python 2835 MiB If I'm not wrong. |
|
Send message Joined: 5 May 19 Posts: 36 Credit: 711,308,218 RAC: 60 Level ![]() Scientific publications
|
The problem with Windows managed pagefile is that it probably doesn't increase its size immediately when the Python application starts loading all its spawned processes. Thanks for the details. I was thinking about setting the min size of the pagefile, but then decided to avoid system hickups during pagefile size increase. After all, that's windows (: Since then my 2nd laptop started processing tasks without issues. Once again, thanks to all who pointed in the right direction. |
|
Send message Joined: 26 Dec 13 Posts: 86 Credit: 1,292,358,731 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Can anyone insight to the reason of these WU crashes? https://www.gpugrid.net/result.php?resultid=33020923 https://www.gpugrid.net/result.php?resultid=33021416 https://www.gpugrid.net/result.php?resultid=33020419 I couldn't find any understandable clues for myself : / |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level ![]() Scientific publications
|
Can anyone insight to the reason of these WU crashes? Look at the error below the traceback section. First one: ValueError: Object arrays cannot be loaded when allow_pickle=False Second and third one: BrokenPipeError: [WinError 232] The pipe is being closed In at least two of these cases, other people who ran the same WUs also had errors. Likely to just be a problem with the task itself and not your system.
|
|
Send message Joined: 26 Dec 13 Posts: 86 Credit: 1,292,358,731 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Thanks a lot! |
|
Send message Joined: 26 Dec 13 Posts: 86 Credit: 1,292,358,731 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
There are more and more problem WUs -_- https://www.gpugrid.net/result.php?resultid=33022242 BrokenPipeError: [WinError 232] The pipe is being closed https://www.gpugrid.net/result.php?resultid=33022703 BrokenPipeError: [WinError 109] The pipe has been ended |
|
Send message Joined: 10 Nov 13 Posts: 101 Credit: 15,773,211,122 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I have been seeing a number of the same failures for quite some time as well. https://www.gpugrid.net/result.php?resultid=33024302 https://www.gpugrid.net/result.php?resultid=33024910 https://www.gpugrid.net/result.php?resultid=33025156 It works out to about a 26% error rate. There really isn't a good indication of why these are failing but I would expect it's a python error or inherent in the simulations that don't work. The scientists will need to figure out these issues and let us know. Jeff |
©2025 Universitat Pompeu Fabra