Experimental Python tasks (beta)

Author	Message
Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 428 Level Scientific publications	Message 58116 - Posted: 15 Dec 2021, 19:29:54 UTC Last modified: 15 Dec 2021, 19:48:28 UTC Task e1a15-ABOU_rnd_ppod_3-0-1-RND2976_3 was the first to run after the reset, but unfortunately it failed too. Edit - so did e1a14-ABOU_rnd_ppod_3-0-1-RND3383_2, on the same machine. This host also has 16 GB system RAM: GPU is GTX 1660 Ti. ID: 58116 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level Scientific publications	Message 58117 - Posted: 15 Dec 2021, 19:40:45 UTC - in response to Message 58114. Last modified: 15 Dec 2021, 19:43:12 UTC I reset the project on my host. still failed. WU: http://gpugrid.net/workunit.php?wuid=27102456 I see that ServicEnginIC and I both had the same error. we also both only have 16GB system memory on our host. Aurum previously reported very high system memory use, but didn't elaborate on if it was real or virtual. However, I can elaborate further to confirm that it's real. https://i.imgur.com/XwAj4s3.png a lot of it seems to stem from the ~4GB used by the python run.py process and then +184M for each of 32x multiproc spawns that appear to be running. not sure if these are intended to run, or if these were are artifact of setup that never got cleaned up? I'm not certain, but it's possible that the task ultimately failed due to lack of resources having both RAM and Swap maxed out. maybe the next system that has it will succeed with it's 64GB TR system? abouh, is it intended to keep this much system memory used during these tasks? or is the just something leftover that was supposed to be cleaned up? It might be helpful to know the exact system requirements so people with unsupported hardware do not try to run these tasks. if these tasks are going to use so much memory and all of the CPU cores, we should be prepared for that ahead of time. ID: 58117 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level Scientific publications	Message 58118 - Posted: 15 Dec 2021, 23:25:46 UTC - in response to Message 58117. I couldn't get your imgur image to load, just a spinner. ID: 58118 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level Scientific publications	Message 58119 - Posted: 16 Dec 2021, 0:13:31 UTC - in response to Message 58118. Yeah I get a message that Imgur is over capacity (first time I’ve ever seen that). Their site must be having maintenance or getting hammered. It was working earlier. I guess just try again a little later. ID: 58119 · Rating: 0 · rate: / Reply Quote

mmonnin Send message Joined: 2 Jul 16 Posts: 338 Credit: 7,987,341,558 RAC: 259 Level Scientific publications	Message 58120 - Posted: 16 Dec 2021, 0:26:37 UTC I've had two tasks complete on a host that was previously erroring out: https://www.gpugrid.net/workunit.php?wuid=27102460 https://www.gpugrid.net/workunit.php?wuid=27101116 Between 12:45:58 UTC and 19:44:33 UTC a task failed and then completed w/o any changes, resets, anything from me. Wildly different runtime/credit ratios, I would expect something in between. Run time Credit Credit/sec 3,389.26 264,786.85 78/s 49,311.35 34,722.22 0.70/s CUDA 26,635.40 420,000.00 15.77/s ID: 58120 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 58123 - Posted: 16 Dec 2021, 9:44:51 UTC - in response to Message 58117. Hello everyone, The reset was only to solve the error reported in e1a12-ABOU_rnd_ppod_3-0-1-RND1575_0 and other jobs, relative to a dependency called "pinocchio". I have checked the jobs reported to have errors after resetting, it seems like this error is not present in those jobs. Regarding the memory usage, it is real as you report. The ~4GB are from the main script containing the AI agent and the training process. The 32x multiproc spawns are intended, each one contains an instance of the environment the agent interacts with to learn. Some RL environments run on GPU, but unfortunately the one we are working with at the moment does not. I get a total of 15GB locally when running 1 job. This could probably explain some job failures. Running all these environments in parallel is also more CPU intense as mentioned as well. The process to train the AI interleaves phases of data collection from interactions with the environment instances (CPU intensive), with phases of learning (GPU intensive) I will test locally if the AI agent still learns by interacting with less instances of the environment at the same time, that could help reduce a bit the memory requirements in future jobs. However, for now the most immediate jobs will have similar requirements. ID: 58123 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 58124 - Posted: 16 Dec 2021, 10:15:12 UTC - in response to Message 58120. Yes I was progressively testing for how many steps the Agents could be trained and I forgot to increase the credits proportionally to the training steps. I will correct that in the immediate next batch, sorry and thanks for making us notice. ID: 58124 · Rating: 0 · rate: / Reply Quote

PDW Send message Joined: 7 Mar 14 Posts: 18 Credit: 6,575,125,525 RAC: 2 Level Scientific publications	Message 58125 - Posted: 16 Dec 2021, 10:23:45 UTC - in response to Message 58123. On mine, free memory (as reported in top) dropped from approximately 25,500 (when running an ACEMD task) to 7,000. That I can manage. However the task also spawns a process for the number of threads (x) the machine has and then runs these, from 1 to x processes can be running at any one time. The value x is based on the machine threads and not what Boinc is configured for, in addition Boinc has no idea they exist and should be taken into account for scheduling purposes. The result is that the machine can at times be loading the CPU upto twice as much as expected. This I can't manage unless I only run one of these tasks and the machine is doing nothing else which isn't going to happen. ID: 58125 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level Scientific publications	Message 58127 - Posted: 16 Dec 2021, 14:18:23 UTC - in response to Message 58123. thanks for the clarification. I agree with PDW that running work on all CPUs threads when BOINC expects at most that 1 CPU thread will be used will be problematic for most users who run CPU work from other projects. in my case, i did notice that each spawn used only a little CPU, but I'm not sure if this is the case for everyone. you could in theory tell BOINC how much CPU these are using by using a value over 1 in app_config for python tasks . for example, it looks like only ~10% of a thread was being used. so for my 32 thread CPU, that would equate to about 4 threads worth (round up from 3.2). so maybe something like <app> <name>PythonGPU</name> <gpu_versions> <cpu_usage>4</cpu_usage> <gpu_usage>1</gpu_usage> </gpu_versions> </app> you'd have to pick a cpu_usage value appropriate for your CPU use, and test to see if it works as desired. ID: 58127 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 428 Level Scientific publications	Message 58132 - Posted: 16 Dec 2021, 16:56:20 UTC - in response to Message 58127. I agree with PDW that running work on all CPUs threads when BOINC expects at most that 1 CPU thread will be used will be problematic for most users who run CPU work from other projects. The normal way of handling that is to use the [MT] (multi-threaded) plan class mechanism in BOINC - these trial apps are being issued using the same [cuda1121] plan class as the current ACEMD production work. Having said that, it might be quite tricky to devise a combined [CUDA + MT] plan class. BOINC code usually expects a simple-minded either/or solution, not a combination. And I don't really like the standard MT implementation, which defaults to using every possible CPU core in the volunteer's computer. Not polite. MT can be tamed by using an app_config.xml or app_info.xml file, but you may need to tweak both <cpu_usage> (for BOINC scheduling purposes) and something like a command line parameter to control the spawning behaviour of the app. ID: 58132 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level Scientific publications	Message 58134 - Posted: 16 Dec 2021, 18:20:00 UTC given the current state of these beta tasks, I have done the following on my 7xGPU 48-thread system. allowed only 3x Python Beta tasks to run since the systems only have 64GB ram and each process is using ~20GB. app_config.xml <app_config> <app> <name>acemd3</name> <gpu_versions> <cpu_usage>1.0</cpu_usage> <gpu_usage>1.0</gpu_usage> </gpu_versions> </app> <app> <name>PythonGPU</name> <gpu_versions> <cpu_usage>5.0</cpu_usage> <gpu_usage>1.0</gpu_usage> </gpu_versions> <max_concurrent>3</max_concurrent> </app> </app_config> will see how it works out when more python beta tasks flow. and adjust as the project adjusts settings. abouh, before you start releasing more beta tasks, could you give us a heads up to what we should expect and/or what you changed about them? ID: 58134 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level Scientific publications	Message 58135 - Posted: 16 Dec 2021, 18:22:58 UTC I finished up a python gpu task last night on one host and saw it spawned a ton of processes that used up 17GB of system memory. I have 32GB minimum in all my hosts and it was not a problem. ID: 58135 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level Scientific publications	Message 58136 - Posted: 16 Dec 2021, 18:52:22 UTC - in response to Message 58135. I finished up a python gpu task last night on one host and saw it spawned a ton of processes that used up 17GB of system memory. I have 32GB minimum in all my hosts and it was not a problem. Good to know Keith. Did you by chance get a look at GPU utilization? Or CPU thread utilization of the spawns? ID: 58136 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level Scientific publications	Message 58137 - Posted: 16 Dec 2021, 19:14:26 UTC - in response to Message 58136. I finished up a python gpu task last night on one host and saw it spawned a ton of processes that used up 17GB of system memory. I have 32GB minimum in all my hosts and it was not a problem. Good to know Keith. Did you by chance get a look at GPU utilization? Or CPU thread utilization of the spawns? Gpu utilization was at 3%. Each spawn used up about 170MB of memory and fluctuated around 13-17% cpu utilization. ID: 58137 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level Scientific publications	Message 58138 - Posted: 16 Dec 2021, 19:18:43 UTC - in response to Message 58137. good to know. so what I experienced was pretty similar. I'm sure you also had some other CPU tasks running too. I wonder if CPU utilization of the spawns would be higher if no other CPU tasks were running. ID: 58138 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level Scientific publications	Message 58140 - Posted: 16 Dec 2021, 21:00:08 UTC - in response to Message 58138. Yes primarily Universe and a few TN-Grid tasks were running also. ID: 58140 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 58141 - Posted: 17 Dec 2021, 10:17:36 UTC - in response to Message 58134. I will send some more tasks later today with similar requirements as the last ones, with 32 multithreading reinforcement learning environments running in parallel for the agent to interact with. For one job, locally I get around 15GB of system memory, and each cpu 13% - 17% utilisation as mentioned. For the GPU, the usage fluctuates between low use (5%-10%) during the phases in which the agent collects data from the environments and short high utilisation peaks of a few seconds, when the agent uses the data to learn (I get between 50% and 80%). I will try to train the agents for a bit longer than in the last tasks. I have already corrected the credits of the tasks, in proportion to the number of interaction between the agent and the environments occurring in the tasks. ID: 58141 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level Scientific publications	Message 58143 - Posted: 17 Dec 2021, 16:48:28 UTC - in response to Message 58141. I got 3 of them just now. all failed with tracebacks after several minutes of run time. seems like there's still some coding bugs in the application. all wingmen are failing similarly: https://gpugrid.net/workunit.php?wuid=27102526 https://gpugrid.net/workunit.php?wuid=27102527 https://gpugrid.net/workunit.php?wuid=27102525 GPU (2080Ti) was loaded ~10-13% GPU utilization, but at base clocks 1350MHz and only ~65W power draw. GPU memory loaded 2-4GB. system memory reached ~25GB utilization while 2 tasks were running at the same time. CPU thread utilization ~25-30% across all 48 threads (EPYC 7402P), it didn't cap at 32 and about twice as much CPU utilization as expected, but maybe that's due to relatively low clock speed @ 3.35GHz. (I paused other CPU processing during this time). ID: 58143 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level Scientific publications	Message 58144 - Posted: 17 Dec 2021, 16:54:43 UTC - in response to Message 58143. Last modified: 17 Dec 2021, 16:58:05 UTC the new one I just got seems to be doing better. less CPU use, and it looks like i'm seeing the mentioned 60-80% spikes on the GPU occasionally. this one succeeded on the same host as the above three. https://gpugrid.net/workunit.php?wuid=27102535 ID: 58144 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 58145 - Posted: 17 Dec 2021, 17:21:35 UTC - in response to Message 58144. Last modified: 17 Dec 2021, 17:26:54 UTC I normally test the jobs locally first, to then run a couple of small batches of tasks in GPUGrid in case some error that did not appear locally occurs. The first small batch failed so I could fix the error in the second one. Now that the second batch succeeded will send a bigger batch of tasks. ID: 58145 · Rating: 0 · rate: / Reply Quote

Experimental Python tasks (beta) - task description