Message boards :
News :
Experimental Python tasks (beta) - task description
Message board moderation
Previous · 1 . . . 43 · 44 · 45 · 46 · 47 · 48 · 49 . . . 50 · Next
| Author | Message |
|---|---|
|
Send message Joined: 8 Aug 19 Posts: 252 Credit: 458,054,251 RAC: 0 Level ![]() Scientific publications ![]()
|
Good to see Zoltan here again, welcome back!๐ ~~~~~~~~~~~~ I need to correct what I reported on the program data folder to KAMasud earlier. The folder is not hidden (as Erich56 noted) but is a system folder, so in windows I've had to enable access to system files and folders on a new install in order to see it. Just in case you're still having trouble. |
|
Send message Joined: 8 Aug 19 Posts: 252 Credit: 458,054,251 RAC: 0 Level ![]() Scientific publications ![]()
|
they already fixed the overused CPU issue. it's now capped at 4x CPU threads and hard coded in the run.py script. but that is in addition to the 32 threads for the agents. there is no way to reduce that unless abouh wanted to use less agents, but i don't think he does at this time. I am enjoying watching abouh gain prowess at scripting with each run, using less and less resources as they evolve. Real progress. Godspeed to abouh and crew. |
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
My reason to reduce their numbers is to run two tasks at the same time to increase GPU usage, because I need the full heat output of my GPUs to heat our apartment. As I saw it in "Task Manager" the CPU usage of the spawned tasks drops when I start the second task (my CPU doesn't have that many threads).Is there a way to control the number of spawned threads?there is no reason to do this anymore. Could the GPU usage be increased somehow? it's now capped at 4x CPU threads and hard coded in the run.py script. but that is in addition to the 32 threads for the agents.I confirm that. I looked into that script, though I'm not very familiar with python. I've even tried to modify the num_env_processes in conf.yaml, but this file gets overwritten every time I restart the task, even though I removed the rights of the boinc user and the boinc group to write that file. :) if you want to run python tasks, you need to account for this and just tell BOINC to reserve some extra CPU resources by setting a larger value for the cpu_usage in app_config. i use values between 8-10. but you can experiment with what you are happy with. on my python dedicated system, I stop all other CPU projects as that gives the best performance.That's clear I did that. |
|
Send message Joined: 27 Jul 11 Posts: 138 Credit: 539,953,398 RAC: 0 Level ![]() Scientific publications ![]()
|
Good to see you Zoltan. |
|
Send message Joined: 27 Jul 11 Posts: 138 Credit: 539,953,398 RAC: 0 Level ![]() Scientific publications ![]()
|
Good to see Zoltan here again, welcome back!๐ Pop, there used to be two Program folders as I remember. Program and Program 32. Now there is a hidden Program System folder. Three in all. |
|
Send message Joined: 27 Jul 11 Posts: 138 Credit: 539,953,398 RAC: 0 Level ![]() Scientific publications ![]()
|
Good to see Zoltan here again, welcome back!๐ Pop, there used to be two Program folders as I remember. Program and Program 32. Now there is a hidden Program System folder. Three in all. |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 5,269 Level ![]() Scientific publications
|
My reason to reduce their numbers is to run two tasks at the same time to increase GPU usage, because I need the full heat output of my GPUs to heat our apartment. As I saw it in "Task Manager" the CPU usage of the spawned tasks drops when I start the second task (my CPU doesn't have that many threads).Is there a way to control the number of spawned threads?there is no reason to do this anymore. If you need the heat output of the GPU, then you need to run a different project. Or only run ACEMD3 tasks when they are available. You will not get it from the Python tasks in their current state. You can increase the GPU use by adding more tasks concurrently. But not to the extent that you expect or need. I run 4x tasks on my A4000s but they still donโt even have full utilization. Usually only like 40% and ~100W avg power draw. Two tasks arenโt gonna cut it for increasing utilization by any substantial amount.
|
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Good to see you Zoltan. +1 |
|
Send message Joined: 8 Aug 19 Posts: 252 Credit: 458,054,251 RAC: 0 Level ![]() Scientific publications ![]()
|
...I need the full heat output of my GPUs to heat our apartment... It's been a bit chilly in my basement "computer lab/mancave" running these this winter, but I'm saving power($) so I'm bearing it. I just hope they last into summer so I can stay cool here in the humid Mississippi river valley of Illinois. I've had some success running Einstein GPU tasks concurrently with Pythons and saw full GPU usage, although there is of course a longer completion time for both tasks. |
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
If you need the heat output of the GPU, then you need to run a different project.I came to that conclusion, again. Or only run ACEMD3 tasks when they are available.I caught 2 or 3, that's why I put 3 host back to GPUGrid. You will not get it [the full GPU heat output] from the Python tasks in their current state.That's regrettable, but it could be ok for me this spring. My main issue with the python app is that I think there's no point running that many spawned (training) threads, as their total (combined) memory access operations cause massive amount of CPU L3 cache misses, hindering each other's performace. Before I've put my i9-12900F host back to GPUGrid, I run 7 TN-Grid tasks + 1 FAH GPU task simultaneously on that host, the average processing time was 4080-4200 sec for the TN-Grid tasks. Now I run 1 GPUGrid task + 1 TN-Grid task simultaneously, and the processing time of the TN-Grid task went up to 4660-4770 sec. Compared to the 6 other TN-Grid tasks plus a FAH task the GPUGrid python task cause a 14% performance loss. You can see the change in processing times for yourself here. If I run only 1 TN-Grid task (no GPU tasks) on that host, the processing time is 3800 seconds. Compared to that, running a GPUGrid pythnon task cause a 22% performance loss. Perhaps this app should do a short benchmark of the given CPU it's actually running on to establish the ideal number of training threads, or give some control of that number for the advanced users like me :) to do that benchmarking of their respective systems. |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 5,269 Level ![]() Scientific publications
|
I don't think you understand what the intention of the researcher is here. he wants 32 agents and the whole experiment is designed around 32 agents. and agent training happens on the CPU, so each agent needs its own process. you can't just arbitrarily reduce this number without the researcher making the change for everyone. it would fundamentally change the research. you could only reduce the number of agents with a new/different experiment. or make MASSIVE changes to the code to push it all into the GPU, but likely most GPUs wouldn't have enough VRAM to run it and everyone would be complaining about that instead.
|
|
Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level ![]() Scientific publications
|
Hello everyone, this is exactly correct, agents collect data from their interaction with the environment (running on CPU), and the data is posteriorly used to update the neural network that controls action selection (on GPU). Having multiple agents allows to collect data in parallel, speeding up training. |
|
Send message Joined: 6 Mar 18 Posts: 38 Credit: 1,340,042,080 RAC: 27 Level ![]() Scientific publications
|
I think I am going a bit mad, I set the app_config file to use 0.33 GPU to try and get more units running at the same time, I then remembered 2 is the max, however this config when running 2 seemed to go faster, units completed 25% in about 3 hours, normally I think the units take a lot longer than this. I will need to take a week to so to double-check this though. What's the optimal config at the moment? this is my current one: <app_config> <app> <name>PythonGPU</name> <gpu_versions> <cpu_usage>8</cpu_usage> <gpu_usage>0.5</gpu_usage> </gpu_versions> </app> </app_config> |
|
Send message Joined: 8 Aug 19 Posts: 252 Credit: 458,054,251 RAC: 0 Level ![]() Scientific publications ![]()
|
Ryan, here's what works for me: <app_config> <app> <name>PythonGPU</name> <max_concurrent>1</max_concurrent> <fraction_done_exact/> <gpu_versions> <gpu_usage>0.5</gpu_usage> <cpu_usage>1</cpu_usage> </gpu_versions> </app> <app> <name>acemd3</name> <max_concurrent>2</max_concurrent> <fraction_done_exact/> <gpu_versions> <gpu_usage>0.5</gpu_usage> <cpu_usage>1</cpu_usage> </gpu_versions> </app> <project_max_concurrent>2</project_max_concurrent> <report_results_immediately/> </app_config> You can change the numbers whenever ACEMDs are available and allow them to run concurrent with a Python. You will need to adjust the CPU figures to match your present appconfig. (Many thanks Richard Hazelgrove, for helping me upthread) |
|
Send message Joined: 6 Mar 18 Posts: 38 Credit: 1,340,042,080 RAC: 27 Level ![]() Scientific publications
|
Thanks, is 1 CPU per python unit enough? what times are you getting per unit? when I run 8 threads per unit and other tasks on the spare threads my CPU is always running at 100%. |
|
Send message Joined: 27 Jul 11 Posts: 138 Credit: 539,953,398 RAC: 0 Level ![]() Scientific publications ![]()
|
It is not about how many threads your machine has, it is about how many tasks you can run alongside a Python. I have a six-core, twelve threads but can only run three Einstein WUs and my CPU peaks at 82%. A fine balancing act is required and sometimes a GPUGrid WU arrives and I have to suspend other work. I have also reached the limit of my 16GB RAM(sometimes) other times? These AI WUs seem to be outdoing us. Monitoring is also required. Pop will explain. |
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 731 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Anybody else getting sent Python tasks for the old 1121 app? I have been using the newer 1131 app and it has worked fine on all tasks. I don't even have the old 1121 app anymore since I did a project reset to use the new python job file for reduced cpu usage. The 1121 app tasks are instant erroring out. |
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Anybody else getting sent Python tasks for the old 1121 app? not so far |
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 731 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Based on the number of _x issues of these tasks and everyone else erroring out, must be a scheduler issue. |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 5,269 Level ![]() Scientific publications
|
I've received some of them so far. they fail within like 10 seconds. looks like someone at the project put the old v4.01 linux app up. these seem not compatible with the new experiment. I'm guessing someone enabled that application by accident. abouh, you probably need to pull this app version back down to prevent it from being sent out. and leave the working v4.03 up.
|
©2025 Universitat Pompeu Fabra