Python apps for GPU hosts errors

Author	Message
r_podl Send message Joined: 21 Oct 21 Posts: 4 Credit: 226,165,413 RAC: 128,290 Level Scientific publications	Message 59861 - Posted: 2 Feb 2023, 2:47:15 UTC I'm fairly certain I've been running these "Python apps for GPU hosts" successfully before. Now I see 85-90% with "Error while computing status". If I check I am one of 4-8 others with same status, although not necessarily the same underlying error. http://www.gpugrid.net/workunit.php?wuid=27392690 http://www.gpugrid.net/result.php?resultid=33277602 Anyway the error I'm seeing is: Define learner Created Learner. Look for a progress_last_chk file - if exists, adjust target_env_steps Define train loop Traceback (most recent call last): File "C:\ProgramData\BOINC\slots\3\lib\site-packages\pytorchrl\scheme\gradients\g_worker.py", line 196, in get_data self.next_batch = self.batches.__next__() StopIteration During handling of the above exception, another exception occurred: Traceback (most recent call last): Last in the traceback is the following, which I'm not sure if it is the original exception or not. If it is can I adjust max_split_size_mb (how and where) and what is a good value for it? RuntimeError: CUDA out of memory. Tried to allocate 202.00 MiB (GPU 0; 2.00 GiB total capacity; 1.23 GiB already allocated; 0 bytes free; 1.69 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF 19:02:23 (17760): python.exe exited; CPU time 1095.984375 Thoughts, suggestions... Thanks in advance. ID: 59861 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1168 Credit: 12,311,898,501 RAC: 331,341 Level Scientific publications	Message 59864 - Posted: 2 Feb 2023, 6:29:50 UTC - in response to Message 59861. the answer is simple: GPUs with only 2 GB VRAM are too small for processing Python tasks. ID: 59864 · Rating: 0 · rate: / Reply Quote

r_podl Send message Joined: 21 Oct 21 Posts: 4 Credit: 226,165,413 RAC: 128,290 Level Scientific publications	Message 59865 - Posted: 2 Feb 2023, 22:17:15 UTC - in response to Message 59864. Last modified: 2 Feb 2023, 22:17:45 UTC GPUs with only 2 GB VRAM are too small for processing Python tasks. Ok sure. Then what has changed since I've been running these tasks successfully prior to 2 (or maybe 3) weeks ago? My system is the same. Did I miss something? ID: 59865 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1423 Credit: 9,187,696,190 RAC: 1,276,885 Level Scientific publications	Message 59866 - Posted: 2 Feb 2023, 23:41:17 UTC - in response to Message 59865. The latest series of 1000 tasks uses more VRAM as posted by the researcher. ID: 59866 · Rating: 0 · rate: / Reply Quote

r_podl Send message Joined: 21 Oct 21 Posts: 4 Credit: 226,165,413 RAC: 128,290 Level Scientific publications	Message 59867 - Posted: 3 Feb 2023, 2:54:01 UTC - in response to Message 59866. The latest series of 1000 tasks uses more VRAM as posted by the researcher. Thanks, figures... I've missed something. Is there a link or forum post from the researcher that you could point me to? ID: 59867 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1168 Credit: 12,311,898,501 RAC: 331,341 Level Scientific publications	Message 59868 - Posted: 3 Feb 2023, 6:30:06 UTC - in response to Message 59865. GPUs with only 2 GB VRAM are too small for processing Python tasks. Ok sure. Then what has changed since I've been running these tasks successfully prior to 2 (or maybe 3) weeks ago? My system is the same. Did I miss something? about 3 weeks ago, ACEMD3 tasks were distributed for a while, but no Pythons. Maybe you crunched ACEMD3 tasks at that time? They do not nearly need as much VRAM as the Pythons do. ID: 59868 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1168 Credit: 12,311,898,501 RAC: 331,341 Level Scientific publications	Message 59869 - Posted: 3 Feb 2023, 6:46:15 UTC - in response to Message 59866. The latest series of 1000 tasks uses more VRAM as posted by the researcher. hm, that's strange - here it seems to happen the other way round: from what I can see e.g. on my Quadro P5000: with 4 Pythons running concurrently, before VRAM use was nearly 16GB, now it's below 12 GB. ID: 59869 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1168 Credit: 12,311,898,501 RAC: 331,341 Level Scientific publications	Message 59870 - Posted: 3 Feb 2023, 14:49:05 UTC - in response to Message 59869. The latest series of 1000 tasks uses more VRAM as posted by the researcher. hm, that's strange - here it seems to happen the other way round: from what I can see e.g. on my Quadro P5000: with 4 Pythons running concurrently, before VRAM use was nearly 16GB, now it's below 12 GB. Most recently, 4 Pythons running concurrently on the P5000 use roughly 9,8 GB VRAM - so it's becoming less all the time ID: 59870 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 423,674 Level Scientific publications	Message 59871 - Posted: 3 Feb 2023, 15:01:48 UTC - in response to Message 59870. The latest series of 1000 tasks uses more VRAM as posted by the researcher. hm, that's strange - here it seems to happen the other way round: from what I can see e.g. on my Quadro P5000: with 4 Pythons running concurrently, before VRAM use was nearly 16GB, now it's below 12 GB. Most recently, 4 Pythons running concurrently on the P5000 use roughly 9,8 GB VRAM - so it's becoming less all the time you should check at which stage of running the tasks are on for a more insightful picture on what's happening. when the task fist starts, for about the first 5 minutes, its only extracting the archive. it will use no VRAM during this time. from about 5 minutes to 10minutes or so, it will use a reduced amount. 2-3GB. then after 10-15mins or so, it gets to the main process and will use the full VRAM amount. about 3-4GB. so far I have noticed two main sizes with the new batches. I have some tasks using about 3GB (which is the same as a few weeks ago) and some tasks using about 4GB which lines more with the recent tasks. I have not noticed any key indicator in the file names to determine which tasks are using the lower VRAM and which are using more. ID: 59871 · Rating: 0 · rate: / Reply Quote

r_podl Send message Joined: 21 Oct 21 Posts: 4 Credit: 226,165,413 RAC: 128,290 Level Scientific publications	Message 59872 - Posted: 3 Feb 2023, 15:18:13 UTC - in response to Message 59868. about 3 weeks ago, ACEMD3 tasks were distributed for a while, but no Pythons. Maybe you crunched ACEMD3 tasks at that time? They do not nearly need as much VRAM as the Pythons do. Fair enough as I can't say for sure which Application(s) was(/were) shown in my GPUGRID task list. I think my assumption was based on the processes seen in (Windows) Task Manager, where I would see dozens of Python processes when processing a GPUGRID task. i.e. maybe other applications than "Python apps for GPU hosts" use Python(?) And, going back further than 3 weeks, never until now have I seen so many tasks failing. By luck I've processed one "Python apps for GPU hosts" task overnight and another is currently running longer than the usual failure point. It still would be nice to see a link or forum post from researcher(s) with requirements and release notes for the applications. ID: 59872 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1168 Credit: 12,311,898,501 RAC: 331,341 Level Scientific publications	Message 59873 - Posted: 3 Feb 2023, 15:45:00 UTC - in response to Message 59871. Last modified: 3 Feb 2023, 15:49:17 UTC you should check at which stage of running the tasks are on for a more insightful picture on what's happening. on the Quadro P5000, the status of this moment is as follows: task 1: 82% - 19:58 hrs task 2: 31% - 7:09 hrs task 3: 14% - 2:43 hrs task 4: 22% - 4:36 hrs VRAM use: 9.834 MB- and this even includes a few hundred MB for the monitor. ID: 59873 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1423 Credit: 9,187,696,190 RAC: 1,276,885 Level Scientific publications	Message 59874 - Posted: 3 Feb 2023, 17:48:39 UTC - in response to Message 59872. It still would be nice to see a link or forum post from researcher(s) with requirements and release notes for the applications. All the pertinent information about the Python tasks is always posted in the main thread in News. https://www.gpugrid.net/forum_thread.php?id=5233 This statement about the memory reduction for the next series is here. https://www.gpugrid.net/forum_thread.php?id=5233&nowrap=true#59838 ID: 59874 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1168 Credit: 12,311,898,501 RAC: 331,341 Level Scientific publications	Message 59875 - Posted: 3 Feb 2023, 20:57:27 UTC - in response to Message 59874. This statement about the memory reduction for the next series is here. https://www.gpugrid.net/forum_thread.php?id=5233&nowrap=true#59838 From what I can see on all my hosts which crunch Pythons: the VRAM requirement of the recent tasks has dropped considerably ID: 59875 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 423,674 Level Scientific publications	Message 59876 - Posted: 4 Feb 2023, 3:10:19 UTC - in response to Message 59875. Maybe this is some change affecting windows only. All my tasks are still using 3-4GB each. ID: 59876 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1423 Credit: 9,187,696,190 RAC: 1,276,885 Level Scientific publications	Message 59877 - Posted: 4 Feb 2023, 4:57:24 UTC I'm still seeing 3-4GB each for the Python tasks also on my Linux Ubuntu hosts. ID: 59877 · Rating: 0 · rate: / Reply Quote

Pop Piasa Send message Joined: 8 Aug 19 Posts: 252 Credit: 458,054,251 RAC: 0 Level Scientific publications	Message 59878 - Posted: 5 Feb 2023, 1:15:26 UTC - in response to Message 59876. Maybe this is some change affecting windows only. No guys, my Windows hosts are using the same ~4GBs graphics mem on the latest released WUs. Earlier I've noticed some "exp" tasks have used over 6GB, so there must be a variance among tasks. I wonder if he saw some ACEMD tasks go through and mistook them for PythonGPUs maybe. Running a PythonGPU on 2GB seems almost impossible to me. "Together we crunch To check out a hunch And wish all our credit Could just buy us lunch" Piasa Tribe - Illini Nation ID: 59878 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1168 Credit: 12,311,898,501 RAC: 331,341 Level Scientific publications	Message 59880 - Posted: 5 Feb 2023, 15:34:01 UTC - in response to Message 59873. you should check at which stage of running the tasks are on for a more insightful picture on what's happening. on the Quadro P5000, the status of this moment is as follows: task 1: 82% - 19:58 hrs task 2: 31% - 7:09 hrs task 3: 14% - 2:43 hrs task 4: 22% - 4:36 hrs VRAM use: 9.834 MB- and this even includes a few hundred MB for the monitor. The 4 Pythons which have been running for several hours ea. right now are using even less VRAM then the ones reported above from 2 days ago - total VRAM use is 8.840 MB. So there seems to be quite some variance between these Pythons. ID: 59880 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1423 Credit: 9,187,696,190 RAC: 1,276,885 Level Scientific publications	Message 59881 - Posted: 5 Feb 2023, 20:48:45 UTC - in response to Message 59880. I don't believe your numbers. Whatever utility you are using in Windows is not reporting correctly or more likely you are interpreting what it displays or looking at the wrong numbers. I will believe what nvidia-smi.exe shows. ID: 59881 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1168 Credit: 12,311,898,501 RAC: 331,341 Level Scientific publications	Message 59882 - Posted: 5 Feb 2023, 20:56:29 UTC - in response to Message 59881. Last modified: 5 Feb 2023, 20:57:14 UTC I don't believe your numbers. Whatever utility you are using in Windows is not reporting correctly or more likely you are interpreting what it displays or looking at the wrong numbers. I will believe what nvidia-smi.exe shows. The utility I use is GPU-Z. So, maybe it indeed shows wrong figures, I cannot tell for sure, of course. As already said in another thread about a week ago: nvidia-smi unfortunately does not function here, no idea why. There is a problem "access denied" ID: 59882 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1423 Credit: 9,187,696,190 RAC: 1,276,885 Level Scientific publications	Message 59884 - Posted: 5 Feb 2023, 21:41:17 UTC - in response to Message 59882. Last modified: 5 Feb 2023, 21:47:29 UTC There must be some way to run the command in a Windows terminal with elevated rights. The application is a user level application that Nvidia provides in all distributions. https://www.minitool.com/news/elevated-command-prompt.html ID: 59884 · Rating: 0 · rate: / Reply Quote