Message boards :
News :
Experimental Python tasks (beta) - task description
Message board moderation
Previous · 1 . . . 36 · 37 · 38 · 39 · 40 · 41 · 42 . . . 50 · Next
| Author | Message |
|---|---|
|
Send message Joined: 18 Jul 13 Posts: 79 Credit: 218,778,292 RAC: 12,880 Level ![]() Scientific publications
|
File name is conf.yaml parameters are start_env_steps and target_env_steps. |
|
Send message Joined: 1 Jan 15 Posts: 1168 Credit: 12,317,898,501 RAC: 91,654 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
File name is conf.yaml I had already abortet the task mentioned above when I now read your posting. But I looked up the figures in a task which is in process right now. It says: 32start_env_steps: 25000000 sticky_actions: true target_env_steps: 50000000 so what exactly do the figures mean: in this case, about half of the task has been processed? |
|
Send message Joined: 18 Jul 13 Posts: 79 Credit: 218,778,292 RAC: 12,880 Level ![]() Scientific publications
|
I think it means that previous crunchers have already crunched up to 25000000 steps and your workunit will continue to 50000000. |
|
Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level ![]() Scientific publications
|
Yes this is exactly what it means. Most parameters in the config file define the specifics of the agent training process. In this case these parameters specify that the initial AI agent will be loaded from a previous agent that already took 25_000_000M steps in his simulated environment, so it is not taking completely random actions. The agent will continue the process, interacting 25_000_000M more times with the environment and learning from its successes and failures. Other parameter specify the type of algorithm used for learning, the number of copies of the environment used to speed up the interactions (32), and many other things. |
|
Send message Joined: 1 Jan 15 Posts: 1168 Credit: 12,317,898,501 RAC: 91,654 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
what I noticed within the past few days is that the runtime of the Pythons has increased. Whereas until short time ago on all of my hosts some tasks made it below 24 hrs, now every task lasts > 24 hrs. |
|
Send message Joined: 18 Jul 13 Posts: 79 Credit: 218,778,292 RAC: 12,880 Level ![]() Scientific publications
|
Try to reduce number of simultaneously running workunits. |
|
Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 42,316 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
I've rarely had a short runner in weeks. Now almost all tasks take more than 24 hours. Missing by a few minutes usually which is disheartening. But I won't be reducing the compute load since I only run a single Python task on each host along with multiple other projects work. I just accept the lesser credit while still maintaining a full load of my other projects which aren't impacted too much by the single Python task. |
|
Send message Joined: 27 Jul 11 Posts: 138 Credit: 539,953,398 RAC: 0 Level ![]() Scientific publications ![]()
|
What I am noticing is, my two machines running no other project are completing the tasks which others have errored out on. I think Python loves to run free without companions to keep it company. |
|
Send message Joined: 1 Jan 15 Posts: 1168 Credit: 12,317,898,501 RAC: 91,654 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
What I am noticing is, my two machines running no other project are completing the tasks which others have errored out on. I think Python loves to run free without companions to keep it company. this is exactly my observation, too. |
|
Send message Joined: 30 Oct 19 Posts: 7 Credit: 405,900 RAC: 0 Level ![]() Scientific publications
|
The only thing I noticed: The biggest lie for the new python tasks is "0.9 CPU". My current task, and the one before, were/is using 20 out of my 24 cores on my 5900X... Please support the Tensor Cores as soon as possible, my 4090 is getting bored :/ |
|
Send message Joined: 18 Jul 13 Posts: 79 Credit: 218,778,292 RAC: 12,880 Level ![]() Scientific publications
|
Some errored tasks crash because someone was trying to run them on GTX 680 with 2 gb vram. |
|
Send message Joined: 27 Jul 11 Posts: 138 Credit: 539,953,398 RAC: 0 Level ![]() Scientific publications ![]()
|
task 33145039 Example. Seven computers have crashed this work unit. Richard or someone else who can read the files can find out why. |
|
Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level ![]() Scientific publications
|
Hello! I just checked the failed submissions of this jobs, and in each case it failed for a different reason. 1. ERROR: Cannot set length for output file : There is not enough space on the disk 2. DefaultCPUAllocator: not enough memory (GPU memory?) 3. RuntimeError: Unable to find a valid cuDNN algorithm to run convolution (GPU not supported by cuda?) 4. Failed to establish a new connection (connection failed to install the only pipeable dependency) 5. AssertionError. assert ports_found (some port configuration missing?) 6. BrokenPipeError: [WinError 232] The pipe is being closed (for some reason multiprocessing broke, I am guessing not enough memory since windows uses much more memory than linux when running multiprocessing) 7. lbzip2: Cannot exec: No such file or directory It is quite unlikely that it fails 7 times, but each machine has a different configuration it is very difficult to cover all cases. That is the reason why jobs are submitted multiple times after failure, to be fault tolerant. |
|
Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level ![]() Scientific publications
|
These tasks alternate between GPU usage and CPU usage, would it make such a big difference to use Tensor Cores for mixed precision? You would be trading speed for precision but only speeding up the GPU phases. I was looking at pytorch documentation (the python package we use to train the AI agents, which supports using Tensor Cores for mixed precision) for automatic-mixed-precision and it says: (if) Your network may fail to saturate the GPU(s) with work, and is therefore CPU bound. Amp’s effect on GPU performance won’t matter. |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 9,834 Level ![]() Scientific publications
|
you'd need to find a way to get the task loaded fully to the GPU. the environment training that you're doing on CPU, can you do that same processing on the GPU? probably.
|
|
Send message Joined: 27 Jul 11 Posts: 138 Credit: 539,953,398 RAC: 0 Level ![]() Scientific publications ![]()
|
These tasks alternate between GPU usage and CPU usage, would it make such a big difference to use Tensor Cores for mixed precision? You would be trading speed for precision but only speeding up the GPU phases. ----------------- Thank you. |
|
Send message Joined: 27 Jul 11 Posts: 138 Credit: 539,953,398 RAC: 0 Level ![]() Scientific publications ![]()
|
These tasks alternate between GPU usage and CPU usage, would it make such a big difference to use Tensor Cores for mixed precision? You would be trading speed for precision but only speeding up the GPU phases. ----------------- Thank you. |
|
Send message Joined: 8 Aug 19 Posts: 252 Credit: 458,054,251 RAC: 0 Level ![]() Scientific publications ![]()
|
I'm being curious here... These Python apps don't seem to report their virtual memory usage accurately on my hosts. They show 7.4GB while my commit charge shows 52BG+ (with 16GB RAM). They report more CPU time than the amount of time it actually took my hosts to finish them. They're also causing the CPU usage to max out around 50% when there are no other CPU tasks running, no matter what my boinc manager CPU usage limit is. Could anyone please explain this to a confused codger? "Together we crunch To check out a hunch And wish all our credit Could just buy us lunch" Piasa Tribe - Illini Nation |
|
Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level ![]() Scientific publications
|
These tasks are a bit particular, because they use multiprocessing and also interleave stages of CPU utilisation with stages of GPU utilisation. The multiprocessing nature of the tasks is responsible for the wrong CPU time (BOINC takes into account the time of all threads). That, together with the fact that the tasks use a python library for machine learning called PyTorch, accounts for the large virtual memory (every thread commits virtual memory when the package is imported, even though it is not later used). The switching between CPU and GPU phases could be causing the CPU's to be at 50%. Other hosts have found configurations to improve resource utilisation by running more than one task, some configurations are shared in this forum. |
|
Send message Joined: 3 Jul 16 Posts: 31 Credit: 2,250,309,169 RAC: 50 Level ![]() Scientific publications ![]()
|
The multiprocessing nature of the tasks is responsible for the wrong CPU time (BOINC takes into account the time of all threads). I don't think so. The CPU-time should be correct, it's just that the overall runtime is faulty. You can easily see that if you compare the runtime to the send and receive times. - - - - - - - - - - Greetings, Jens |
©2026 Universitat Pompeu Fabra