Experimental Python tasks (beta)

Author	Message
Pop Piasa Send message Joined: 8 Aug 19 Posts: 252 Credit: 458,054,251 RAC: 0 Level Scientific publications	Message 59902 - Posted: 11 Feb 2023, 16:25:16 UTC Good to see Zoltan here again, welcome back!😀 ~~~~~~~~~~~~ I need to correct what I reported on the program data folder to KAMasud earlier. The folder is not hidden (as Erich56 noted) but is a system folder, so in windows I've had to enable access to system files and folders on a new install in order to see it. Just in case you're still having trouble. ID: 59902 · Rating: 0 · rate: / Reply Quote

Pop Piasa Send message Joined: 8 Aug 19 Posts: 252 Credit: 458,054,251 RAC: 0 Level Scientific publications	Message 59903 - Posted: 11 Feb 2023, 21:10:57 UTC Last modified: 11 Feb 2023, 21:35:20 UTC they already fixed the overused CPU issue. it's now capped at 4x CPU threads and hard coded in the run.py script. but that is in addition to the 32 threads for the agents. there is no way to reduce that unless abouh wanted to use less agents, but i don't think he does at this time. I am enjoying watching abouh gain prowess at scripting with each run, using less and less resources as they evolve. Real progress. Godspeed to abouh and crew. ID: 59903 · Rating: 0 · rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 59904 - Posted: 11 Feb 2023, 23:36:24 UTC - in response to Message 59901. Is there a way to control the number of spawned threads? there is no reason to do this anymore. My reason to reduce their numbers is to run two tasks at the same time to increase GPU usage, because I need the full heat output of my GPUs to heat our apartment. As I saw it in "Task Manager" the CPU usage of the spawned tasks drops when I start the second task (my CPU doesn't have that many threads). Could the GPU usage be increased somehow? it's now capped at 4x CPU threads and hard coded in the run.py script. but that is in addition to the 32 threads for the agents. there is no way to reduce that ... I confirm that. I looked into that script, though I'm not very familiar with python. I've even tried to modify the num_env_processes in conf.yaml, but this file gets overwritten every time I restart the task, even though I removed the rights of the boinc user and the boinc group to write that file. :) if you want to run python tasks, you need to account for this and just tell BOINC to reserve some extra CPU resources by setting a larger value for the cpu_usage in app_config. i use values between 8-10. but you can experiment with what you are happy with. on my python dedicated system, I stop all other CPU projects as that gives the best performance. That's clear I did that. ID: 59904 · Rating: 0 · rate: / Reply Quote

KAMasud Send message Joined: 27 Jul 11 Posts: 138 Credit: 539,953,398 RAC: 0 Level Scientific publications	Message 59905 - Posted: 12 Feb 2023, 1:05:32 UTC Good to see you Zoltan. ID: 59905 · Rating: 0 · rate: / Reply Quote

KAMasud Send message Joined: 27 Jul 11 Posts: 138 Credit: 539,953,398 RAC: 0 Level Scientific publications	Message 59906 - Posted: 12 Feb 2023, 1:10:21 UTC - in response to Message 59902. Good to see Zoltan here again, welcome back!😀 ~~~~~~~~~~~~ I need to correct what I reported on the program data folder to KAMasud earlier. The folder is not hidden (as Erich56 noted) but is a system folder, so in windows, I've had to enable access to system files and folders on a new install in order to see it. Just in case you're still having trouble. Pop, there used to be two Program folders as I remember. Program and Program 32. Now there is a hidden Program System folder. Three in all. ID: 59906 · Rating: 0 · rate: / Reply Quote

KAMasud Send message Joined: 27 Jul 11 Posts: 138 Credit: 539,953,398 RAC: 0 Level Scientific publications	Message 59907 - Posted: 12 Feb 2023, 1:10:24 UTC - in response to Message 59902. Good to see Zoltan here again, welcome back!😀 ~~~~~~~~~~~~ I need to correct what I reported on the program data folder to KAMasud earlier. The folder is not hidden (as Erich56 noted) but is a system folder, so in windows, I've had to enable access to system files and folders on a new install in order to see it. Just in case you're still having trouble. Pop, there used to be two Program folders as I remember. Program and Program 32. Now there is a hidden Program System folder. Three in all. ID: 59907 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 9,834 Level Scientific publications	Message 59908 - Posted: 12 Feb 2023, 1:18:15 UTC - in response to Message 59904. Last modified: 12 Feb 2023, 1:20:40 UTC Is there a way to control the number of spawned threads? there is no reason to do this anymore. My reason to reduce their numbers is to run two tasks at the same time to increase GPU usage, because I need the full heat output of my GPUs to heat our apartment. As I saw it in "Task Manager" the CPU usage of the spawned tasks drops when I start the second task (my CPU doesn't have that many threads). Could the GPU usage be increased somehow? If you need the heat output of the GPU, then you need to run a different project. Or only run ACEMD3 tasks when they are available. You will not get it from the Python tasks in their current state. You can increase the GPU use by adding more tasks concurrently. But not to the extent that you expect or need. I run 4x tasks on my A4000s but they still don’t even have full utilization. Usually only like 40% and ~100W avg power draw. Two tasks aren’t gonna cut it for increasing utilization by any substantial amount. ID: 59908 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1168 Credit: 12,317,898,501 RAC: 91,654 Level Scientific publications	Message 59909 - Posted: 12 Feb 2023, 6:36:52 UTC - in response to Message 59905. Good to see you Zoltan. +1 ID: 59909 · Rating: 0 · rate: / Reply Quote

Pop Piasa Send message Joined: 8 Aug 19 Posts: 252 Credit: 458,054,251 RAC: 0 Level Scientific publications	Message 59910 - Posted: 12 Feb 2023, 22:48:44 UTC - in response to Message 59904. ...I need the full heat output of my GPUs to heat our apartment... It's been a bit chilly in my basement "computer lab/mancave" running these this winter, but I'm saving power($) so I'm bearing it. I just hope they last into summer so I can stay cool here in the humid Mississippi river valley of Illinois. I've had some success running Einstein GPU tasks concurrently with Pythons and saw full GPU usage, although there is of course a longer completion time for both tasks. ID: 59910 · Rating: 0 · rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 59912 - Posted: 13 Feb 2023, 1:29:20 UTC - in response to Message 59908. Last modified: 13 Feb 2023, 1:32:48 UTC If you need the heat output of the GPU, then you need to run a different project. I came to that conclusion, again. Or only run ACEMD3 tasks when they are available. I caught 2 or 3, that's why I put 3 host back to GPUGrid. You will not get it [the full GPU heat output] from the Python tasks in their current state. That's regrettable, but it could be ok for me this spring. My main issue with the python app is that I think there's no point running that many spawned (training) threads, as their total (combined) memory access operations cause massive amount of CPU L3 cache misses, hindering each other's performace. Before I've put my i9-12900F host back to GPUGrid, I run 7 TN-Grid tasks + 1 FAH GPU task simultaneously on that host, the average processing time was 4080-4200 sec for the TN-Grid tasks. Now I run 1 GPUGrid task + 1 TN-Grid task simultaneously, and the processing time of the TN-Grid task went up to 4660-4770 sec. Compared to the 6 other TN-Grid tasks plus a FAH task the GPUGrid python task cause a 14% performance loss. You can see the change in processing times for yourself here. If I run only 1 TN-Grid task (no GPU tasks) on that host, the processing time is 3800 seconds. Compared to that, running a GPUGrid pythnon task cause a 22% performance loss. Perhaps this app should do a short benchmark of the given CPU it's actually running on to establish the ideal number of training threads, or give some control of that number for the advanced users like me :) to do that benchmarking of their respective systems. ID: 59912 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 9,834 Level Scientific publications	Message 59913 - Posted: 13 Feb 2023, 1:43:43 UTC - in response to Message 59912. I don't think you understand what the intention of the researcher is here. he wants 32 agents and the whole experiment is designed around 32 agents. and agent training happens on the CPU, so each agent needs its own process. you can't just arbitrarily reduce this number without the researcher making the change for everyone. it would fundamentally change the research. you could only reduce the number of agents with a new/different experiment. or make MASSIVE changes to the code to push it all into the GPU, but likely most GPUs wouldn't have enough VRAM to run it and everyone would be complaining about that instead. ID: 59913 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 59914 - Posted: 13 Feb 2023, 7:27:30 UTC - in response to Message 59913. Last modified: 13 Feb 2023, 7:27:54 UTC Hello everyone, this is exactly correct, agents collect data from their interaction with the environment (running on CPU), and the data is posteriorly used to update the neural network that controls action selection (on GPU). Having multiple agents allows to collect data in parallel, speeding up training. ID: 59914 · Rating: 0 · rate: / Reply Quote

Ryan Munro Send message Joined: 6 Mar 18 Posts: 38 Credit: 1,343,792,080 RAC: 1,582 Level Scientific publications	Message 59915 - Posted: 13 Feb 2023, 15:04:18 UTC I think I am going a bit mad, I set the app_config file to use 0.33 GPU to try and get more units running at the same time, I then remembered 2 is the max, however this config when running 2 seemed to go faster, units completed 25% in about 3 hours, normally I think the units take a lot longer than this. I will need to take a week to so to double-check this though. What's the optimal config at the moment? this is my current one: <app_config> <app> <name>PythonGPU</name> <gpu_versions> <cpu_usage>8</cpu_usage> <gpu_usage>0.5</gpu_usage> </gpu_versions> </app> </app_config> ID: 59915 · Rating: 0 · rate: / Reply Quote

Pop Piasa Send message Joined: 8 Aug 19 Posts: 252 Credit: 458,054,251 RAC: 0 Level Scientific publications	Message 59916 - Posted: 13 Feb 2023, 19:17:59 UTC - in response to Message 59915. Last modified: 13 Feb 2023, 19:40:30 UTC Ryan, here's what works for me: <app_config> <app> <name>PythonGPU</name> <max_concurrent>1</max_concurrent> <fraction_done_exact/> <gpu_versions> <gpu_usage>0.5</gpu_usage> <cpu_usage>1</cpu_usage> </gpu_versions> </app> <app> <name>acemd3</name> <max_concurrent>2</max_concurrent> <fraction_done_exact/> <gpu_versions> <gpu_usage>0.5</gpu_usage> <cpu_usage>1</cpu_usage> </gpu_versions> </app> <project_max_concurrent>2</project_max_concurrent> <report_results_immediately/> </app_config> You can change the numbers whenever ACEMDs are available and allow them to run concurrent with a Python. You will need to adjust the CPU figures to match your present appconfig. (Many thanks Richard Hazelgrove, for helping me upthread) ID: 59916 · Rating: 0 · rate: / Reply Quote

Ryan Munro Send message Joined: 6 Mar 18 Posts: 38 Credit: 1,343,792,080 RAC: 1,582 Level Scientific publications	Message 59917 - Posted: 14 Feb 2023, 16:09:12 UTC Thanks, is 1 CPU per python unit enough? what times are you getting per unit? when I run 8 threads per unit and other tasks on the spare threads my CPU is always running at 100%. ID: 59917 · Rating: 0 · rate: / Reply Quote

KAMasud Send message Joined: 27 Jul 11 Posts: 138 Credit: 539,953,398 RAC: 0 Level Scientific publications	Message 59918 - Posted: 14 Feb 2023, 17:10:42 UTC It is not about how many threads your machine has, it is about how many tasks you can run alongside a Python. I have a six-core, twelve threads but can only run three Einstein WUs and my CPU peaks at 82%. A fine balancing act is required and sometimes a GPUGrid WU arrives and I have to suspend other work. I have also reached the limit of my 16GB RAM(sometimes) other times? These AI WUs seem to be outdoing us. Monitoring is also required. Pop will explain. ID: 59918 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 42,316 Level Scientific publications	Message 59920 - Posted: 14 Feb 2023, 19:35:24 UTC Last modified: 14 Feb 2023, 19:36:00 UTC Anybody else getting sent Python tasks for the old 1121 app? I have been using the newer 1131 app and it has worked fine on all tasks. I don't even have the old 1121 app anymore since I did a project reset to use the new python job file for reduced cpu usage. The 1121 app tasks are instant erroring out. ID: 59920 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1168 Credit: 12,317,898,501 RAC: 91,654 Level Scientific publications	Message 59921 - Posted: 14 Feb 2023, 20:10:49 UTC - in response to Message 59920. Anybody else getting sent Python tasks for the old 1121 app? not so far ID: 59921 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 42,316 Level Scientific publications	Message 59922 - Posted: 14 Feb 2023, 20:33:08 UTC - in response to Message 59921. Based on the number of _x issues of these tasks and everyone else erroring out, must be a scheduler issue. ID: 59922 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 9,834 Level Scientific publications	Message 59923 - Posted: 14 Feb 2023, 23:20:18 UTC I've received some of them so far. they fail within like 10 seconds. looks like someone at the project put the old v4.01 linux app up. these seem not compatible with the new experiment. I'm guessing someone enabled that application by accident. abouh, you probably need to pull this app version back down to prevent it from being sent out. and leave the working v4.03 up. ID: 59923 · Rating: 0 · rate: / Reply Quote

Experimental Python tasks (beta) - task description