Recently Python app work unit failed with computing error

Author	Message
Jari Kosonen Send message Joined: 5 May 22 Posts: 24 Credit: 12,458,305 RAC: 0 Level Scientific publications	Message 59125 - Posted: 18 Aug 2022, 11:01:06 UTC Last modified: 18 Aug 2022, 11:05:57 UTC With work unit; e00009a00848-ABOU_rnd_ppod_expand_demos23-0-1-RND9304_0 It happened: Error while computing 262,081.10 What does this mean and what does it suggest is causing this error? ID: 59125 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 3 Level Scientific publications	Message 59127 - Posted: 18 Aug 2022, 11:31:56 UTC - in response to Message 59125. Result 33000731 (easier that way) It looks like it's failed and attempted a restart multiple times. The number (262,081.10) is the number of seconds it's wasted doing all that - not a happy bunny. It's not immediately obvious what ultimately went wrong, but I'll keep looking. Your GPU (NVIDIA GeForce MX250 (4042MB) driver: 510.73) is unusual, and may be a little short on memory (6 GB is recommended), and I'm not an expert on the Debian drivers. ID: 59127 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1423 Credit: 9,186,946,190 RAC: 1,288,374 Level Scientific publications	Message 59130 - Posted: 18 Aug 2022, 17:24:14 UTC - in response to Message 59127. That's a mobile gpu in a laptop. Not usually recommended to even try running gpu tasks on a laptop. ID: 59130 · Rating: 0 · rate: / Reply Quote

Jari Kosonen Send message Joined: 5 May 22 Posts: 24 Credit: 12,458,305 RAC: 0 Level Scientific publications	Message 59132 - Posted: 19 Aug 2022, 0:29:13 UTC - in response to Message 59127. If you find out what is causing the problem, please let me know about it also. ID: 59132 · Rating: 0 · rate: / Reply Quote

Jari Kosonen Send message Joined: 5 May 22 Posts: 24 Credit: 12,458,305 RAC: 0 Level Scientific publications	Message 59156 - Posted: 24 Aug 2022, 9:34:43 UTC - in response to Message 59130. Last modified: 24 Aug 2022, 9:35:07 UTC It looks like overheating and GPU driver kicked out of operation: Wed Aug 24 17:28:52 2022 +-----------------------------------------------------------------------------+ \| NVIDIA-SMI 510.73.05 Driver Version: 510.73.05 CUDA Version: 11.6 \| \|-------------------------------+----------------------+----------------------+ \| GPU Name Persistence-M\| Bus-Id Disp.A \| Volatile Uncorr. ECC \| \| Fan Temp Perf Pwr:Usage/Cap\| Memory-Usage \| GPU-Util Compute M. \| \| \| \| MIG M. \| \|===============================+======================+======================\| \| 0 NVIDIA GeForce ... Off \| 00000000:02:00.0 Off \| N/A \| \| N/A 93C P0 N/A / N/A \| 3267MiB / 4096MiB \| 94% Default \| \| \| \| N/A \| +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ \| Processes: \| \| GPU GI CI PID Type Process name GPU Memory \| \| ID ID Usage \| \|=============================================================================\| \| 0 N/A N/A 1491 G /usr/lib/xorg/Xorg 4MiB \| \| 0 N/A N/A 3604 C bin/python 3261MiB \| +-----------------------------------------------------------------------------+ Wed Aug 24 17:28:57 2022 +-----------------------------------------------------------------------------+ \| NVIDIA-SMI 510.73.05 Driver Version: 510.73.05 CUDA Version: 11.6 \| \|-------------------------------+----------------------+----------------------+ \| GPU Name Persistence-M\| Bus-Id Disp.A \| Volatile Uncorr. ECC \| \| Fan Temp Perf Pwr:Usage/Cap\| Memory-Usage \| GPU-Util Compute M. \| \| \| \| MIG M. \| \|===============================+======================+======================\| \| 0 NVIDIA GeForce ... Off \| 00000000:02:00.0 Off \| N/A \| \| N/A 90C P0 N/A / N/A \| 3267MiB / 4096MiB \| 11% Default \| \| \| \| N/A \| +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ \| Processes: \| \| GPU GI CI PID Type Process name GPU Memory \| \| ID ID Usage \| \|=============================================================================\| \| 0 N/A N/A 1491 G /usr/lib/xorg/Xorg 4MiB \| \| 0 N/A N/A 3604 C bin/python 3261MiB \| +-----------------------------------------------------------------------------+ Wed Aug 24 17:29:02 2022 +-----------------------------------------------------------------------------+ \| NVIDIA-SMI 510.73.05 Driver Version: 510.73.05 CUDA Version: 11.6 \| \|-------------------------------+----------------------+----------------------+ \| GPU Name Persistence-M\| Bus-Id Disp.A \| Volatile Uncorr. ECC \| \| Fan Temp Perf Pwr:Usage/Cap\| Memory-Usage \| GPU-Util Compute M. \| \| \| \| MIG M. \| \|===============================+======================+======================\| \| 0 NVIDIA GeForce ... ERR! \| 00000000:02:00.0 Off \| N/A \| \|ERR! ERR! ERR! N/A / ERR! \| GPU is lost \| ERR! ERR! \| \| \| \| ERR! \| +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ \| Processes: \| \| GPU GI CI PID Type Process name GPU Memory \| \| ID ID Usage \| \|=============================================================================\| \| No running processes found \| +-----------------------------------------------------------------------------+ ID: 59156 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 3 Level Scientific publications	Message 59157 - Posted: 24 Aug 2022, 10:24:44 UTC - in response to Message 59156. It looks like overheating and GPU driver kicked out of operation: Putting the laptop on a cooling stand, with fans blowing air directly into the laptop's ventilation inlet slots, may help. The slots are often underneath, but check your particular machine. And give then a good clean while you're down there! ID: 59157 · Rating: 0 · rate: / Reply Quote

KAMasud Send message Joined: 27 Jul 11 Posts: 138 Credit: 539,953,398 RAC: 0 Level Scientific publications	Message 59158 - Posted: 24 Aug 2022, 12:29:58 UTC I run laptop GPU's. Take a small air blower to its slots and blow the dust-out. I do it daily. ID: 59158 · Rating: 0 · rate: / Reply Quote

JohnMD Send message Joined: 4 Dec 10 Posts: 5 Credit: 26,860,106 RAC: 0 Level Scientific publications	Message 59168 - Posted: 28 Aug 2022, 20:05:27 UTC - in response to Message 59127. Result 33000731 (easier that way) It looks like it's failed and attempted a restart multiple times. The number (262,081.10) is the number of seconds it's wasted doing all that - not a happy bunny. It's not immediately obvious what ultimately went wrong, but I'll keep looking. Your GPU (NVIDIA GeForce MX250 (4042MB) driver: 510.73) is unusual, and may be a little short on memory (6 GB is recommended), and I'm not an expert on the Debian drivers. It is clear that these GPU tasks (MX-series) fail with insufficient GPU memory. I am unable to find such "requirements" - I can't even find details of applications. Can anyone help ? ID: 59168 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 423,674 Level Scientific publications	Message 59169 - Posted: 28 Aug 2022, 20:21:13 UTC - in response to Message 59168. Result 33000731 (easier that way) It looks like it's failed and attempted a restart multiple times. The number (262,081.10) is the number of seconds it's wasted doing all that - not a happy bunny. It's not immediately obvious what ultimately went wrong, but I'll keep looking. Your GPU (NVIDIA GeForce MX250 (4042MB) driver: 510.73) is unusual, and may be a little short on memory (6 GB is recommended), and I'm not an expert on the Debian drivers. It is clear that these GPU tasks (MX-series) fail with insufficient GPU memory. I am unable to find such "requirements" - I can't even find details of applications. Can anyone help ? I can't speak for Windows behavior, but the last several Linux tasks I processed use a little more than 3GB per task when running 1 task per GPU. with the help of a cuda mps server, I can push 2 tasks concurrently in a 6GB GTX 1060 card as some of the memory gets shared. I would say 3GB minimum needed per task. and at least 4GB to be comfortable. system memory is also quite high. uses about 8GB system memory per task. but you should monitor it, the project could change the requirements at any time if they want to run larger jobs or change the direction of their research. ID: 59169 · Rating: 0 · rate: / Reply Quote

goldfinch Send message Joined: 5 May 19 Posts: 36 Credit: 711,308,218 RAC: 0 Level Scientific publications	Message 59173 - Posted: 29 Aug 2022, 11:37:33 UTC Last modified: 29 Aug 2022, 11:46:50 UTC I have 2 laptops, the older one running Nvidia GT 1060, and the newer, running RTX 3060. The older laptop, though reaching GPU temperature of ~90C, completes most of Python tasks, though it may take up to 40 hours. The newer laptop, however, fails vast majority of Python tasks though that started quite recently. For example, 29/08/2022 9:32:39 PM \| GPUGRID \| Computation for task e00008a01599-ABOU_rnd_ppod_expand_demos24_3-0-1-RND7566_0 finished 29/08/2022 9:32:39 PM \| GPUGRID \| [task] result state=COMPUTE_ERROR for e00008a01599-ABOU_rnd_ppod_expand_demos24_3-0-1-RND7566_0 from CS::app_finished or another task, with more logging details: 29/08/2022 9:44:40 PM \| GPUGRID \| [task] Process for e00014a00897-ABOU_rnd_ppod_expand_demos24_3-0-1-RND0029_3 exited, exit code 195, task state 1 29/08/2022 9:44:40 PM \| GPUGRID \| [task] task_state=EXITED for e00014a00897-ABOU_rnd_ppod_expand_demos24_3-0-1-RND0029_3 from handle_exited_app 29/08/2022 9:44:40 PM \| GPUGRID \| [task] result state=COMPUTE_ERROR for e00014a00897-ABOU_rnd_ppod_expand_demos24_3-0-1-RND0029_3 from CS::report_result_error 29/08/2022 9:44:40 PM \| GPUGRID \| [task] Process for e00014a00897-ABOU_rnd_ppod_expand_demos24_3-0-1-RND0029_3 exited 29/08/2022 9:44:40 PM \| GPUGRID \| [task] exit code 195 (0xc3): (unknown error) 29/08/2022 9:44:43 PM \| GPUGRID \| Computation for task e00014a00897-ABOU_rnd_ppod_expand_demos24_3-0-1-RND0029_3 finished 29/08/2022 9:44:43 PM \| GPUGRID \| [task] result state=COMPUTE_ERROR for e00014a00897-ABOU_rnd_ppod_expand_demos24_3-0-1-RND0029_3 from CS::app_finished What could be the reason? GPU has enough graphical memory, the laptop has 16 GB - same as the older one. There is enough disk space, and the temperature doesn't rise above 55-60C. Acemd3 never fail on the newer laptop (at least, i don't remember such failures) and usually finish in under 15 hours. It's only Python, and only recently. Why could that be? What should I enable in the logs to diagnose better?[/quote] ID: 59173 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 423,674 Level Scientific publications	Message 59174 - Posted: 29 Aug 2022, 12:16:32 UTC - in response to Message 59173. I have 2 laptops, the older one running Nvidia GT 1060, and the newer, running RTX 3060. The older laptop, though reaching GPU temperature of ~90C, completes most of Python tasks, though it may take up to 40 hours. The newer laptop, however, fails vast majority of Python tasks though that started quite recently. For example, 29/08/2022 9:32:39 PM \| GPUGRID \| Computation for task e00008a01599-ABOU_rnd_ppod_expand_demos24_3-0-1-RND7566_0 finished 29/08/2022 9:32:39 PM \| GPUGRID \| [task] result state=COMPUTE_ERROR for e00008a01599-ABOU_rnd_ppod_expand_demos24_3-0-1-RND7566_0 from CS::app_finished or another task, with more logging details: 29/08/2022 9:44:40 PM \| GPUGRID \| [task] Process for e00014a00897-ABOU_rnd_ppod_expand_demos24_3-0-1-RND0029_3 exited, exit code 195, task state 1 29/08/2022 9:44:40 PM \| GPUGRID \| [task] task_state=EXITED for e00014a00897-ABOU_rnd_ppod_expand_demos24_3-0-1-RND0029_3 from handle_exited_app 29/08/2022 9:44:40 PM \| GPUGRID \| [task] result state=COMPUTE_ERROR for e00014a00897-ABOU_rnd_ppod_expand_demos24_3-0-1-RND0029_3 from CS::report_result_error 29/08/2022 9:44:40 PM \| GPUGRID \| [task] Process for e00014a00897-ABOU_rnd_ppod_expand_demos24_3-0-1-RND0029_3 exited 29/08/2022 9:44:40 PM \| GPUGRID \| [task] exit code 195 (0xc3): (unknown error) 29/08/2022 9:44:43 PM \| GPUGRID \| Computation for task e00014a00897-ABOU_rnd_ppod_expand_demos24_3-0-1-RND0029_3 finished 29/08/2022 9:44:43 PM \| GPUGRID \| [task] result state=COMPUTE_ERROR for e00014a00897-ABOU_rnd_ppod_expand_demos24_3-0-1-RND0029_3 from CS::app_finished What could be the reason? GPU has enough graphical memory, the laptop has 16 GB - same as the older one. There is enough disk space, and the temperature doesn't rise above 55-60C. Acemd3 never fail on the newer laptop (at least, i don't remember such failures) and usually finish in under 15 hours. It's only Python, and only recently. Why could that be? What should I enable in the logs to diagnose better? this is the more specific error you are getting on that system: RuntimeError: Unable to find a valid cuDNN algorithm to run convolution and I've also seen this in your errors: RuntimeError: [enforce fail at C:\cb\pytorch_1000000000000\work\c10\core\impl\alloc_cpu.cpp:81] data. DefaultCPUAllocator: not enough memory: you tried to allocate 3612672 bytes. researching your first error, and considering the second error, it's likely that memory is your problem. are you trying to run multiple tasks at a time? if not, others with Windows systems have mentioned increasing the pagefile size to solve issues. have you done that as well? ID: 59174 · Rating: 0 · rate: / Reply Quote

goldfinch Send message Joined: 5 May 19 Posts: 36 Credit: 711,308,218 RAC: 0 Level Scientific publications	Message 59175 - Posted: 29 Aug 2022, 21:52:41 UTC - in response to Message 59174. Thank you so much! Well, i was suspecting the memory, but both laptops have Windows-managed pagefile which is set to 40+GB. Is that enough, or should I increase it even more on the newer system? Besides, if it's Windows-managed, doesn't the pagefile size increase automatically on demand, though it may cause issues to the processes that request memory allocation? Anyway, I set the pagefile to 64GB now, as someone did in a similar situation. I expect it to be enough for the WUs to complete without issues. Thanks again. ID: 59175 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1423 Credit: 9,186,946,190 RAC: 1,288,374 Level Scientific publications	Message 59176 - Posted: 30 Aug 2022, 1:34:08 UTC The problem with Windows managed pagefile is that it probably doesn't increase its size immediately when the Python application starts loading all its spawned processes. It probably responds initially to the initiating Boinc wrapper app, but that then loads the Python libraries which have huge memory footprints on Windows. So the pagefile might not be large enough at the time that python dependent libraries are requesting lots of memory allocation space. I have been recommending that Windows user just set a custom static sized pagefile of 32GBmin - 64GBmax size and that seems to cover the Python application and tasks and tasks complete successfully. But with your 64GB current size, you probably have resolved the issue. ID: 59176 · Rating: 0 · rate: / Reply Quote

Luigi R. Send message Joined: 15 Jul 14 Posts: 5 Credit: 85,726,648 RAC: 0 Level Scientific publications	Message 59177 - Posted: 30 Aug 2022, 6:59:55 UTC - in response to Message 59169. I can't speak for Windows behavior, but the last several Linux tasks I processed use a little more than 3GB per task when running 1 task per GPU. with the help of a cuda mps server, I can push 2 tasks concurrently in a 6GB GTX 1060 card as some of the memory gets shared. I would say 3GB minimum needed per task. and at least 4GB to be comfortable. system memory is also quite high. uses about 8GB system memory per task. but you should monitor it, the project could change the requirements at any time if they want to run larger jobs or change the direction of their research. On Linux I still can successfully run Python tasks on a GTX 1060 3GB. Using Intel graphics for display helps. I noticed Python app's memory usage decreased by ~100MiB too when I moved Xorg process to Intel GPU. I don't exactly know why. Tue Aug 30 08:22:46 2022 +-----------------------------------------------------------------------------+ \| NVIDIA-SMI 470.141.03 Driver Version: 470.141.03 CUDA Version: 11.4 \| \|-------------------------------+----------------------+----------------------+ \| GPU Name Persistence-M\| Bus-Id Disp.A \| Volatile Uncorr. ECC \| \| Fan Temp Perf Pwr:Usage/Cap\| Memory-Usage \| GPU-Util Compute M. \| \| \| \| MIG M. \| \|===============================+======================+======================\| \| 0 NVIDIA GeForce ... Off \| 00000000:02:00.0 Off \| N/A \| \| 28% 60C P2 47W / 60W \| 2740MiB / 3019MiB \| 95% Default \| \| \| \| N/A \| +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ \| Processes: \| \| GPU GI CI PID Type Process name GPU Memory \| \| ID ID Usage \| \|=============================================================================\| \| 0 N/A N/A 1672379 C bin/python 2731MiB \| +-----------------------------------------------------------------------------+ Before I switched to Intel graphics, it was about: Xorg 169 MiB xfwm4 2 MiB Python 2835 MiB If I'm not wrong. ID: 59177 · Rating: 0 · rate: / Reply Quote

goldfinch Send message Joined: 5 May 19 Posts: 36 Credit: 711,308,218 RAC: 0 Level Scientific publications	Message 59179 - Posted: 31 Aug 2022, 1:50:14 UTC - in response to Message 59176. The problem with Windows managed pagefile is that it probably doesn't increase its size immediately when the Python application starts loading all its spawned processes. It probably responds initially to the initiating Boinc wrapper app, but that then loads the Python libraries which have huge memory footprints on Windows. So the pagefile might not be large enough at the time that python dependent libraries are requesting lots of memory allocation space. I have been recommending that Windows user just set a custom static sized pagefile of 32GBmin - 64GBmax size and that seems to cover the Python application and tasks and tasks complete successfully. But with your 64GB current size, you probably have resolved the issue. Thanks for the details. I was thinking about setting the min size of the pagefile, but then decided to avoid system hickups during pagefile size increase. After all, that's windows (: Since then my 2nd laptop started processing tasks without issues. Once again, thanks to all who pointed in the right direction. ID: 59179 · Rating: 0 · rate: / Reply Quote

[CSF] Aleksey Belkov Send message Joined: 26 Dec 13 Posts: 87 Credit: 1,292,358,731 RAC: 0 Level Scientific publications	Message 59189 - Posted: 3 Sep 2022, 0:17:11 UTC Can anyone insight to the reason of these WU crashes? https://www.gpugrid.net/result.php?resultid=33020923 https://www.gpugrid.net/result.php?resultid=33021416 https://www.gpugrid.net/result.php?resultid=33020419 I couldn't find any understandable clues for myself : / ID: 59189 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 423,674 Level Scientific publications	Message 59190 - Posted: 3 Sep 2022, 0:45:40 UTC - in response to Message 59189. Last modified: 3 Sep 2022, 0:46:07 UTC Can anyone insight to the reason of these WU crashes? https://www.gpugrid.net/result.php?resultid=33020923 https://www.gpugrid.net/result.php?resultid=33021416 https://www.gpugrid.net/result.php?resultid=33020419 I couldn't find any understandable clues for myself : / Look at the error below the traceback section. First one: ValueError: Object arrays cannot be loaded when allow_pickle=False Second and third one: BrokenPipeError: [WinError 232] The pipe is being closed In at least two of these cases, other people who ran the same WUs also had errors. Likely to just be a problem with the task itself and not your system. ID: 59190 · Rating: 0 · rate: / Reply Quote

[CSF] Aleksey Belkov Send message Joined: 26 Dec 13 Posts: 87 Credit: 1,292,358,731 RAC: 0 Level Scientific publications	Message 59191 - Posted: 3 Sep 2022, 1:44:34 UTC - in response to Message 59190. In at least two of these cases, other people who ran the same WUs also had errors. Likely to just be a problem with the task itself and not your system. Thanks a lot! ID: 59191 · Rating: 0 · rate: / Reply Quote

[CSF] Aleksey Belkov Send message Joined: 26 Dec 13 Posts: 87 Credit: 1,292,358,731 RAC: 0 Level Scientific publications	Message 59194 - Posted: 5 Sep 2022, 9:13:17 UTC There are more and more problem WUs -_- https://www.gpugrid.net/result.php?resultid=33022242 BrokenPipeError: [WinError 232] The pipe is being closed https://www.gpugrid.net/result.php?resultid=33022703 BrokenPipeError: [WinError 109] The pipe has been ended ID: 59194 · Rating: 0 · rate: / Reply Quote

jjch Send message Joined: 10 Nov 13 Posts: 101 Credit: 15,776,211,122 RAC: 3,857 Level Scientific publications	Message 59195 - Posted: 5 Sep 2022, 19:16:20 UTC - in response to Message 59194. I have been seeing a number of the same failures for quite some time as well. https://www.gpugrid.net/result.php?resultid=33024302 https://www.gpugrid.net/result.php?resultid=33024910 https://www.gpugrid.net/result.php?resultid=33025156 It works out to about a 26% error rate. There really isn't a good indication of why these are failing but I would expect it's a python error or inherent in the simulations that don't work. The scientists will need to figure out these issues and let us know. Jeff ID: 59195 · Rating: 0 · rate: / Reply Quote