Message boards :
News :
Experimental Python tasks (beta) - task description
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 . . . 50 · Next
Author | Message |
---|---|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 295,172 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Task e1a15-ABOU_rnd_ppod_3-0-1-RND2976_3 was the first to run after the reset, but unfortunately it failed too. Edit - so did e1a14-ABOU_rnd_ppod_3-0-1-RND3383_2, on the same machine. This host also has 16 GB system RAM: GPU is GTX 1660 Ti. |
Send message Joined: 21 Feb 20 Posts: 1114 Credit: 40,838,535,595 RAC: 4,302,611 Level ![]() Scientific publications ![]() |
I reset the project on my host. still failed. WU: http://gpugrid.net/workunit.php?wuid=27102456 I see that ServicEnginIC and I both had the same error. we also both only have 16GB system memory on our host. Aurum previously reported very high system memory use, but didn't elaborate on if it was real or virtual. However, I can elaborate further to confirm that it's real. https://i.imgur.com/XwAj4s3.png a lot of it seems to stem from the ~4GB used by the python run.py process and then +184M for each of 32x multiproc spawns that appear to be running. not sure if these are intended to run, or if these were are artifact of setup that never got cleaned up? I'm not certain, but it's possible that the task ultimately failed due to lack of resources having both RAM and Swap maxed out. maybe the next system that has it will succeed with it's 64GB TR system? abouh, is it intended to keep this much system memory used during these tasks? or is the just something leftover that was supposed to be cleaned up? It might be helpful to know the exact system requirements so people with unsupported hardware do not try to run these tasks. if these tasks are going to use so much memory and all of the CPU cores, we should be prepared for that ahead of time. ![]() |
![]() Send message Joined: 13 Dec 17 Posts: 1416 Credit: 9,119,446,190 RAC: 614,515 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() |
I couldn't get your imgur image to load, just a spinner. |
Send message Joined: 21 Feb 20 Posts: 1114 Credit: 40,838,535,595 RAC: 4,302,611 Level ![]() Scientific publications ![]() |
Yeah I get a message that Imgur is over capacity (first time I’ve ever seen that). Their site must be having maintenance or getting hammered. It was working earlier. I guess just try again a little later. ![]() |
Send message Joined: 2 Jul 16 Posts: 338 Credit: 7,987,341,558 RAC: 178,897 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() |
I've had two tasks complete on a host that was previously erroring out: https://www.gpugrid.net/workunit.php?wuid=27102460 https://www.gpugrid.net/workunit.php?wuid=27101116 Between 12:45:58 UTC and 19:44:33 UTC a task failed and then completed w/o any changes, resets, anything from me. Wildly different runtime/credit ratios, I would expect something in between. Run time Credit Credit/sec 3,389.26 264,786.85 78/s 49,311.35 34,722.22 0.70/s CUDA 26,635.40 420,000.00 15.77/s |
Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level ![]() Scientific publications ![]() |
Hello everyone, The reset was only to solve the error reported in e1a12-ABOU_rnd_ppod_3-0-1-RND1575_0 and other jobs, relative to a dependency called "pinocchio". I have checked the jobs reported to have errors after resetting, it seems like this error is not present in those jobs. Regarding the memory usage, it is real as you report. The ~4GB are from the main script containing the AI agent and the training process. The 32x multiproc spawns are intended, each one contains an instance of the environment the agent interacts with to learn. Some RL environments run on GPU, but unfortunately the one we are working with at the moment does not. I get a total of 15GB locally when running 1 job. This could probably explain some job failures. Running all these environments in parallel is also more CPU intense as mentioned as well. The process to train the AI interleaves phases of data collection from interactions with the environment instances (CPU intensive), with phases of learning (GPU intensive) I will test locally if the AI agent still learns by interacting with less instances of the environment at the same time, that could help reduce a bit the memory requirements in future jobs. However, for now the most immediate jobs will have similar requirements. |
Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level ![]() Scientific publications ![]() |
Yes I was progressively testing for how many steps the Agents could be trained and I forgot to increase the credits proportionally to the training steps. I will correct that in the immediate next batch, sorry and thanks for making us notice. |
![]() Send message Joined: 7 Mar 14 Posts: 18 Credit: 6,575,125,525 RAC: 1,038 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() |
On mine, free memory (as reported in top) dropped from approximately 25,500 (when running an ACEMD task) to 7,000. That I can manage. However the task also spawns a process for the number of threads (x) the machine has and then runs these, from 1 to x processes can be running at any one time. The value x is based on the machine threads and not what Boinc is configured for, in addition Boinc has no idea they exist and should be taken into account for scheduling purposes. The result is that the machine can at times be loading the CPU upto twice as much as expected. This I can't manage unless I only run one of these tasks and the machine is doing nothing else which isn't going to happen. |
Send message Joined: 21 Feb 20 Posts: 1114 Credit: 40,838,535,595 RAC: 4,302,611 Level ![]() Scientific publications ![]() |
thanks for the clarification. I agree with PDW that running work on all CPUs threads when BOINC expects at most that 1 CPU thread will be used will be problematic for most users who run CPU work from other projects. in my case, i did notice that each spawn used only a little CPU, but I'm not sure if this is the case for everyone. you could in theory tell BOINC how much CPU these are using by using a value over 1 in app_config for python tasks . for example, it looks like only ~10% of a thread was being used. so for my 32 thread CPU, that would equate to about 4 threads worth (round up from 3.2). so maybe something like <app> <name>PythonGPU</name> <gpu_versions> <cpu_usage>4</cpu_usage> <gpu_usage>1</gpu_usage> </gpu_versions> </app> you'd have to pick a cpu_usage value appropriate for your CPU use, and test to see if it works as desired. ![]() |
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 295,172 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I agree with PDW that running work on all CPUs threads when BOINC expects at most that 1 CPU thread will be used will be problematic for most users who run CPU work from other projects. The normal way of handling that is to use the [MT] (multi-threaded) plan class mechanism in BOINC - these trial apps are being issued using the same [cuda1121] plan class as the current ACEMD production work. Having said that, it might be quite tricky to devise a combined [CUDA + MT] plan class. BOINC code usually expects a simple-minded either/or solution, not a combination. And I don't really like the standard MT implementation, which defaults to using every possible CPU core in the volunteer's computer. Not polite. MT can be tamed by using an app_config.xml or app_info.xml file, but you may need to tweak both <cpu_usage> (for BOINC scheduling purposes) and something like a command line parameter to control the spawning behaviour of the app. |
Send message Joined: 21 Feb 20 Posts: 1114 Credit: 40,838,535,595 RAC: 4,302,611 Level ![]() Scientific publications ![]() |
given the current state of these beta tasks, I have done the following on my 7xGPU 48-thread system. allowed only 3x Python Beta tasks to run since the systems only have 64GB ram and each process is using ~20GB. app_config.xml <app_config> <app> <name>acemd3</name> <gpu_versions> <cpu_usage>1.0</cpu_usage> <gpu_usage>1.0</gpu_usage> </gpu_versions> </app> <app> <name>PythonGPU</name> <gpu_versions> <cpu_usage>5.0</cpu_usage> <gpu_usage>1.0</gpu_usage> </gpu_versions> <max_concurrent>3</max_concurrent> </app> </app_config> will see how it works out when more python beta tasks flow. and adjust as the project adjusts settings. abouh, before you start releasing more beta tasks, could you give us a heads up to what we should expect and/or what you changed about them? ![]() |
![]() Send message Joined: 13 Dec 17 Posts: 1416 Credit: 9,119,446,190 RAC: 614,515 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() |
I finished up a python gpu task last night on one host and saw it spawned a ton of processes that used up 17GB of system memory. I have 32GB minimum in all my hosts and it was not a problem. |
Send message Joined: 21 Feb 20 Posts: 1114 Credit: 40,838,535,595 RAC: 4,302,611 Level ![]() Scientific publications ![]() |
I finished up a python gpu task last night on one host and saw it spawned a ton of processes that used up 17GB of system memory. I have 32GB minimum in all my hosts and it was not a problem. Good to know Keith. Did you by chance get a look at GPU utilization? Or CPU thread utilization of the spawns? ![]() |
![]() Send message Joined: 13 Dec 17 Posts: 1416 Credit: 9,119,446,190 RAC: 614,515 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() |
I finished up a python gpu task last night on one host and saw it spawned a ton of processes that used up 17GB of system memory. I have 32GB minimum in all my hosts and it was not a problem. Gpu utilization was at 3%. Each spawn used up about 170MB of memory and fluctuated around 13-17% cpu utilization. |
Send message Joined: 21 Feb 20 Posts: 1114 Credit: 40,838,535,595 RAC: 4,302,611 Level ![]() Scientific publications ![]() |
good to know. so what I experienced was pretty similar. I'm sure you also had some other CPU tasks running too. I wonder if CPU utilization of the spawns would be higher if no other CPU tasks were running. ![]() |
![]() Send message Joined: 13 Dec 17 Posts: 1416 Credit: 9,119,446,190 RAC: 614,515 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() |
Yes primarily Universe and a few TN-Grid tasks were running also. |
Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level ![]() Scientific publications ![]() |
I will send some more tasks later today with similar requirements as the last ones, with 32 multithreading reinforcement learning environments running in parallel for the agent to interact with. For one job, locally I get around 15GB of system memory, and each cpu 13% - 17% utilisation as mentioned. For the GPU, the usage fluctuates between low use (5%-10%) during the phases in which the agent collects data from the environments and short high utilisation peaks of a few seconds, when the agent uses the data to learn (I get between 50% and 80%). I will try to train the agents for a bit longer than in the last tasks. I have already corrected the credits of the tasks, in proportion to the number of interaction between the agent and the environments occurring in the tasks. |
Send message Joined: 21 Feb 20 Posts: 1114 Credit: 40,838,535,595 RAC: 4,302,611 Level ![]() Scientific publications ![]() |
I got 3 of them just now. all failed with tracebacks after several minutes of run time. seems like there's still some coding bugs in the application. all wingmen are failing similarly: https://gpugrid.net/workunit.php?wuid=27102526 https://gpugrid.net/workunit.php?wuid=27102527 https://gpugrid.net/workunit.php?wuid=27102525 GPU (2080Ti) was loaded ~10-13% GPU utilization, but at base clocks 1350MHz and only ~65W power draw. GPU memory loaded 2-4GB. system memory reached ~25GB utilization while 2 tasks were running at the same time. CPU thread utilization ~25-30% across all 48 threads (EPYC 7402P), it didn't cap at 32 and about twice as much CPU utilization as expected, but maybe that's due to relatively low clock speed @ 3.35GHz. (I paused other CPU processing during this time). ![]() |
Send message Joined: 21 Feb 20 Posts: 1114 Credit: 40,838,535,595 RAC: 4,302,611 Level ![]() Scientific publications ![]() |
the new one I just got seems to be doing better. less CPU use, and it looks like i'm seeing the mentioned 60-80% spikes on the GPU occasionally. this one succeeded on the same host as the above three. https://gpugrid.net/workunit.php?wuid=27102535 ![]() |
Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level ![]() Scientific publications ![]() |
I normally test the jobs locally first, to then run a couple of small batches of tasks in GPUGrid in case some error that did not appear locally occurs. The first small batch failed so I could fix the error in the second one. Now that the second batch succeeded will send a bigger batch of tasks. |
©2025 Universitat Pompeu Fabra