Experimental Python tasks (beta)

Author	Message
kotenok2000 Send message Joined: 18 Jul 13 Posts: 79 Credit: 241,278,292 RAC: 24,803 Level Scientific publications	Message 59579 - Posted: 12 Nov 2022, 1:12:40 UTC - in response to Message 59578. Last modified: 12 Nov 2022, 1:12:54 UTC File name is conf.yaml parameters are start_env_steps and target_env_steps. ID: 59579 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1171 Credit: 12,662,148,501 RAC: 459,461 Level Scientific publications	Message 59580 - Posted: 12 Nov 2022, 6:37:03 UTC - in response to Message 59579. Last modified: 12 Nov 2022, 6:59:46 UTC File name is conf.yaml parameters are start_env_steps and target_env_steps. I had already abortet the task mentioned above when I now read your posting. But I looked up the figures in a task which is in process right now. It says: 32start_env_steps: 25000000 sticky_actions: true target_env_steps: 50000000 so what exactly do the figures mean: in this case, about half of the task has been processed? ID: 59580 · Rating: 0 · rate: / Reply Quote

kotenok2000 Send message Joined: 18 Jul 13 Posts: 79 Credit: 241,278,292 RAC: 24,803 Level Scientific publications	Message 59581 - Posted: 12 Nov 2022, 11:31:19 UTC - in response to Message 59580. I think it means that previous crunchers have already crunched up to 25000000 steps and your workunit will continue to 50000000. ID: 59581 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 59582 - Posted: 12 Nov 2022, 17:11:52 UTC - in response to Message 59581. Last modified: 12 Nov 2022, 17:15:31 UTC Yes this is exactly what it means. Most parameters in the config file define the specifics of the agent training process. In this case these parameters specify that the initial AI agent will be loaded from a previous agent that already took 25_000_000M steps in his simulated environment, so it is not taking completely random actions. The agent will continue the process, interacting 25_000_000M more times with the environment and learning from its successes and failures. Other parameter specify the type of algorithm used for learning, the number of copies of the environment used to speed up the interactions (32), and many other things. ID: 59582 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1171 Credit: 12,662,148,501 RAC: 459,461 Level Scientific publications	Message 59583 - Posted: 15 Nov 2022, 20:17:08 UTC what I noticed within the past few days is that the runtime of the Pythons has increased. Whereas until short time ago on all of my hosts some tasks made it below 24 hrs, now every task lasts > 24 hrs. ID: 59583 · Rating: 0 · rate: / Reply Quote

kotenok2000 Send message Joined: 18 Jul 13 Posts: 79 Credit: 241,278,292 RAC: 24,803 Level Scientific publications	Message 59584 - Posted: 15 Nov 2022, 23:56:22 UTC - in response to Message 59583. Try to reduce number of simultaneously running workunits. ID: 59584 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 4 Level Scientific publications	Message 59585 - Posted: 16 Nov 2022, 2:48:12 UTC I've rarely had a short runner in weeks. Now almost all tasks take more than 24 hours. Missing by a few minutes usually which is disheartening. But I won't be reducing the compute load since I only run a single Python task on each host along with multiple other projects work. I just accept the lesser credit while still maintaining a full load of my other projects which aren't impacted too much by the single Python task. ID: 59585 · Rating: 0 · rate: / Reply Quote

KAMasud Send message Joined: 27 Jul 11 Posts: 138 Credit: 539,953,398 RAC: 0 Level Scientific publications	Message 59586 - Posted: 16 Nov 2022, 7:16:47 UTC What I am noticing is, my two machines running no other project are completing the tasks which others have errored out on. I think Python loves to run free without companions to keep it company. ID: 59586 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1171 Credit: 12,662,148,501 RAC: 459,461 Level Scientific publications	Message 59587 - Posted: 16 Nov 2022, 12:59:30 UTC - in response to Message 59586. What I am noticing is, my two machines running no other project are completing the tasks which others have errored out on. I think Python loves to run free without companions to keep it company. this is exactly my observation, too. ID: 59587 · Rating: 0 · rate: / Reply Quote

Asghan Send message Joined: 30 Oct 19 Posts: 7 Credit: 405,900 RAC: 0 Level Scientific publications	Message 59589 - Posted: 22 Nov 2022, 9:35:28 UTC Last modified: 22 Nov 2022, 9:36:04 UTC The only thing I noticed: The biggest lie for the new python tasks is "0.9 CPU". My current task, and the one before, were/is using 20 out of my 24 cores on my 5900X... Please support the Tensor Cores as soon as possible, my 4090 is getting bored :/ ID: 59589 · Rating: 0 · rate: / Reply Quote

kotenok2000 Send message Joined: 18 Jul 13 Posts: 79 Credit: 241,278,292 RAC: 24,803 Level Scientific publications	Message 59590 - Posted: 22 Nov 2022, 17:10:26 UTC - in response to Message 59589. Some errored tasks crash because someone was trying to run them on GTX 680 with 2 gb vram. ID: 59590 · Rating: 0 · rate: / Reply Quote

KAMasud Send message Joined: 27 Jul 11 Posts: 138 Credit: 539,953,398 RAC: 0 Level Scientific publications	Message 59591 - Posted: 23 Nov 2022, 8:08:36 UTC Last modified: 23 Nov 2022, 8:28:39 UTC task 33145039 Example. Seven computers have crashed this work unit. Richard or someone else who can read the files can find out why. ID: 59591 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 59592 - Posted: 23 Nov 2022, 9:42:09 UTC - in response to Message 59591. Hello! I just checked the failed submissions of this jobs, and in each case it failed for a different reason. 1. ERROR: Cannot set length for output file : There is not enough space on the disk 2. DefaultCPUAllocator: not enough memory (GPU memory?) 3. RuntimeError: Unable to find a valid cuDNN algorithm to run convolution (GPU not supported by cuda?) 4. Failed to establish a new connection (connection failed to install the only pipeable dependency) 5. AssertionError. assert ports_found (some port configuration missing?) 6. BrokenPipeError: [WinError 232] The pipe is being closed (for some reason multiprocessing broke, I am guessing not enough memory since windows uses much more memory than linux when running multiprocessing) 7. lbzip2: Cannot exec: No such file or directory It is quite unlikely that it fails 7 times, but each machine has a different configuration it is very difficult to cover all cases. That is the reason why jobs are submitted multiple times after failure, to be fault tolerant. ID: 59592 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 59593 - Posted: 23 Nov 2022, 10:05:58 UTC - in response to Message 59589. Last modified: 23 Nov 2022, 10:34:57 UTC These tasks alternate between GPU usage and CPU usage, would it make such a big difference to use Tensor Cores for mixed precision? You would be trading speed for precision but only speeding up the GPU phases. I was looking at pytorch documentation (the python package we use to train the AI agents, which supports using Tensor Cores for mixed precision) for automatic-mixed-precision and it says: (if) Your network may fail to saturate the GPU(s) with work, and is therefore CPU bound. Amp’s effect on GPU performance won’t matter. ID: 59593 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1117 Credit: 40,876,970,595 RAC: 1 Level Scientific publications	Message 59594 - Posted: 23 Nov 2022, 13:15:31 UTC - in response to Message 59593. you'd need to find a way to get the task loaded fully to the GPU. the environment training that you're doing on CPU, can you do that same processing on the GPU? probably. ID: 59594 · Rating: 0 · rate: / Reply Quote

KAMasud Send message Joined: 27 Jul 11 Posts: 138 Credit: 539,953,398 RAC: 0 Level Scientific publications	Message 59598 - Posted: 26 Nov 2022, 7:39:46 UTC - in response to Message 59593. These tasks alternate between GPU usage and CPU usage, would it make such a big difference to use Tensor Cores for mixed precision? You would be trading speed for precision but only speeding up the GPU phases. I was looking at pytorch documentation (the python package we use to train the AI agents, which supports using Tensor Cores for mixed precision) for automatic-mixed-precision and it says: (if) Your network may fail to saturate the GPU(s) with work, and is therefore CPU bound. Amp’s effect on GPU performance won’t matter. ----------------- Thank you. ID: 59598 · Rating: 0 · rate: / Reply Quote

KAMasud Send message Joined: 27 Jul 11 Posts: 138 Credit: 539,953,398 RAC: 0 Level Scientific publications	Message 59599 - Posted: 26 Nov 2022, 7:39:54 UTC - in response to Message 59593. These tasks alternate between GPU usage and CPU usage, would it make such a big difference to use Tensor Cores for mixed precision? You would be trading speed for precision but only speeding up the GPU phases. I was looking at pytorch documentation (the python package we use to train the AI agents, which supports using Tensor Cores for mixed precision) for automatic-mixed-precision and it says: (if) Your network may fail to saturate the GPU(s) with work, and is therefore CPU bound. Amp’s effect on GPU performance won’t matter. ----------------- Thank you. ID: 59599 · Rating: 0 · rate: / Reply Quote

Pop Piasa Send message Joined: 8 Aug 19 Posts: 252 Credit: 458,054,251 RAC: 0 Level Scientific publications	Message 59627 - Posted: 21 Dec 2022, 3:43:17 UTC I'm being curious here... These Python apps don't seem to report their virtual memory usage accurately on my hosts. They show 7.4GB while my commit charge shows 52BG+ (with 16GB RAM). They report more CPU time than the amount of time it actually took my hosts to finish them. They're also causing the CPU usage to max out around 50% when there are no other CPU tasks running, no matter what my boinc manager CPU usage limit is. Could anyone please explain this to a confused codger? "Together we crunch To check out a hunch And wish all our credit Could just buy us lunch" Piasa Tribe - Illini Nation ID: 59627 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 59639 - Posted: 22 Dec 2022, 7:50:45 UTC - in response to Message 59627. Last modified: 22 Dec 2022, 7:52:43 UTC These tasks are a bit particular, because they use multiprocessing and also interleave stages of CPU utilisation with stages of GPU utilisation. The multiprocessing nature of the tasks is responsible for the wrong CPU time (BOINC takes into account the time of all threads). That, together with the fact that the tasks use a python library for machine learning called PyTorch, accounts for the large virtual memory (every thread commits virtual memory when the package is imported, even though it is not later used). The switching between CPU and GPU phases could be causing the CPU's to be at 50%. Other hosts have found configurations to improve resource utilisation by running more than one task, some configurations are shared in this forum. ID: 59639 · Rating: 0 · rate: / Reply Quote

gemini8 Send message Joined: 3 Jul 16 Posts: 31 Credit: 2,452,809,169 RAC: 297,974 Level Scientific publications	Message 59640 - Posted: 22 Dec 2022, 8:14:50 UTC - in response to Message 59639. The multiprocessing nature of the tasks is responsible for the wrong CPU time (BOINC takes into account the time of all threads). I don't think so. The CPU-time should be correct, it's just that the overall runtime is faulty. You can easily see that if you compare the runtime to the send and receive times. - - - - - - - - - - Greetings, Jens ID: 59640 · Rating: 0 · rate: / Reply Quote

Experimental Python tasks (beta) - task description