Experimental Python tasks (beta) - task description

Message boards : News : Experimental Python tasks (beta) - task description
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 36 · 37 · 38 · 39 · 40 · 41 · 42 . . . 50 · Next

AuthorMessage
kotenok2000

Send message
Joined: 18 Jul 13
Posts: 79
Credit: 218,778,292
RAC: 12,880
Level
Leu
Scientific publications
wat
Message 59579 - Posted: 12 Nov 2022, 1:12:40 UTC - in response to Message 59578.  
Last modified: 12 Nov 2022, 1:12:54 UTC

File name is conf.yaml
parameters are
start_env_steps and target_env_steps.
ID: 59579 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1168
Credit: 12,317,898,501
RAC: 91,654
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59580 - Posted: 12 Nov 2022, 6:37:03 UTC - in response to Message 59579.  
Last modified: 12 Nov 2022, 6:59:46 UTC

File name is conf.yaml
parameters are
start_env_steps and target_env_steps.

I had already abortet the task mentioned above when I now read your posting.

But I looked up the figures in a task which is in process right now. It says:

32start_env_steps: 25000000
sticky_actions: true
target_env_steps: 50000000

so what exactly do the figures mean: in this case, about half of the task has been processed?
ID: 59580 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
kotenok2000

Send message
Joined: 18 Jul 13
Posts: 79
Credit: 218,778,292
RAC: 12,880
Level
Leu
Scientific publications
wat
Message 59581 - Posted: 12 Nov 2022, 11:31:19 UTC - in response to Message 59580.  

I think it means that previous crunchers have already crunched up to 25000000 steps and your workunit will continue to 50000000.
ID: 59581 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
abouh

Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59582 - Posted: 12 Nov 2022, 17:11:52 UTC - in response to Message 59581.  
Last modified: 12 Nov 2022, 17:15:31 UTC

Yes this is exactly what it means. Most parameters in the config file define the specifics of the agent training process.

In this case these parameters specify that the initial AI agent will be loaded from a previous agent that already took 25_000_000M steps in his simulated environment, so it is not taking completely random actions. The agent will continue the process, interacting 25_000_000M more times with the environment and learning from its successes and failures.

Other parameter specify the type of algorithm used for learning, the number of copies of the environment used to speed up the interactions (32), and many other things.
ID: 59582 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1168
Credit: 12,317,898,501
RAC: 91,654
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59583 - Posted: 15 Nov 2022, 20:17:08 UTC

what I noticed within the past few days is that the runtime of the Pythons has increased.
Whereas until short time ago on all of my hosts some tasks made it below 24 hrs, now every task lasts > 24 hrs.
ID: 59583 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
kotenok2000

Send message
Joined: 18 Jul 13
Posts: 79
Credit: 218,778,292
RAC: 12,880
Level
Leu
Scientific publications
wat
Message 59584 - Posted: 15 Nov 2022, 23:56:22 UTC - in response to Message 59583.  

Try to reduce number of simultaneously running workunits.
ID: 59584 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1424
Credit: 9,189,946,190
RAC: 42,316
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59585 - Posted: 16 Nov 2022, 2:48:12 UTC

I've rarely had a short runner in weeks. Now almost all tasks take more than 24 hours.

Missing by a few minutes usually which is disheartening.

But I won't be reducing the compute load since I only run a single Python task on each host along with multiple other projects work.

I just accept the lesser credit while still maintaining a full load of my other projects which aren't impacted too much by the single Python task.
ID: 59585 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
KAMasud

Send message
Joined: 27 Jul 11
Posts: 138
Credit: 539,953,398
RAC: 0
Level
Lys
Scientific publications
watwat
Message 59586 - Posted: 16 Nov 2022, 7:16:47 UTC

What I am noticing is, my two machines running no other project are completing the tasks which others have errored out on. I think Python loves to run free without companions to keep it company.
ID: 59586 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1168
Credit: 12,317,898,501
RAC: 91,654
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59587 - Posted: 16 Nov 2022, 12:59:30 UTC - in response to Message 59586.  

What I am noticing is, my two machines running no other project are completing the tasks which others have errored out on. I think Python loves to run free without companions to keep it company.

this is exactly my observation, too.
ID: 59587 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Asghan

Send message
Joined: 30 Oct 19
Posts: 7
Credit: 405,900
RAC: 0
Level

Scientific publications
wat
Message 59589 - Posted: 22 Nov 2022, 9:35:28 UTC
Last modified: 22 Nov 2022, 9:36:04 UTC

The only thing I noticed:
The biggest lie for the new python tasks is "0.9 CPU".
My current task, and the one before, were/is using 20 out of my 24 cores on my 5900X...

Please support the Tensor Cores as soon as possible, my 4090 is getting bored :/
ID: 59589 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
kotenok2000

Send message
Joined: 18 Jul 13
Posts: 79
Credit: 218,778,292
RAC: 12,880
Level
Leu
Scientific publications
wat
Message 59590 - Posted: 22 Nov 2022, 17:10:26 UTC - in response to Message 59589.  

Some errored tasks crash because someone was trying to run them on GTX 680 with 2 gb vram.
ID: 59590 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
KAMasud

Send message
Joined: 27 Jul 11
Posts: 138
Credit: 539,953,398
RAC: 0
Level
Lys
Scientific publications
watwat
Message 59591 - Posted: 23 Nov 2022, 8:08:36 UTC
Last modified: 23 Nov 2022, 8:28:39 UTC

task 33145039
Example. Seven computers have crashed this work unit. Richard or someone else who can read the files can find out why.
ID: 59591 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
abouh

Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59592 - Posted: 23 Nov 2022, 9:42:09 UTC - in response to Message 59591.  

Hello! I just checked the failed submissions of this jobs, and in each case it failed for a different reason.


1. ERROR: Cannot set length for output file : There is not enough space on the disk
2. DefaultCPUAllocator: not enough memory (GPU memory?)
3. RuntimeError: Unable to find a valid cuDNN algorithm to run convolution (GPU not supported by cuda?)
4. Failed to establish a new connection (connection failed to install the only pipeable dependency)
5. AssertionError. assert ports_found (some port configuration missing?)
6. BrokenPipeError: [WinError 232] The pipe is being closed (for some reason multiprocessing broke, I am guessing not enough memory since windows uses much more memory than linux when running multiprocessing)
7. lbzip2: Cannot exec: No such file or directory

It is quite unlikely that it fails 7 times, but each machine has a different configuration it is very difficult to cover all cases. That is the reason why jobs are submitted multiple times after failure, to be fault tolerant.
ID: 59592 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
abouh

Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59593 - Posted: 23 Nov 2022, 10:05:58 UTC - in response to Message 59589.  
Last modified: 23 Nov 2022, 10:34:57 UTC

These tasks alternate between GPU usage and CPU usage, would it make such a big difference to use Tensor Cores for mixed precision? You would be trading speed for precision but only speeding up the GPU phases.

I was looking at pytorch documentation (the python package we use to train the AI agents, which supports using Tensor Cores for mixed precision) for automatic-mixed-precision and it says:

(if) Your network may fail to saturate the GPU(s) with work, and is therefore CPU bound. Amp’s effect on GPU performance won’t matter.

ID: 59593 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,876,970,595
RAC: 9,834
Level
Trp
Scientific publications
wat
Message 59594 - Posted: 23 Nov 2022, 13:15:31 UTC - in response to Message 59593.  

you'd need to find a way to get the task loaded fully to the GPU. the environment training that you're doing on CPU, can you do that same processing on the GPU? probably.
ID: 59594 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
KAMasud

Send message
Joined: 27 Jul 11
Posts: 138
Credit: 539,953,398
RAC: 0
Level
Lys
Scientific publications
watwat
Message 59598 - Posted: 26 Nov 2022, 7:39:46 UTC - in response to Message 59593.  

These tasks alternate between GPU usage and CPU usage, would it make such a big difference to use Tensor Cores for mixed precision? You would be trading speed for precision but only speeding up the GPU phases.

I was looking at pytorch documentation (the python package we use to train the AI agents, which supports using Tensor Cores for mixed precision) for automatic-mixed-precision and it says:

(if) Your network may fail to saturate the GPU(s) with work, and is therefore CPU bound. Amp’s effect on GPU performance won’t matter.

-----------------
Thank you.
ID: 59598 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
KAMasud

Send message
Joined: 27 Jul 11
Posts: 138
Credit: 539,953,398
RAC: 0
Level
Lys
Scientific publications
watwat
Message 59599 - Posted: 26 Nov 2022, 7:39:54 UTC - in response to Message 59593.  

These tasks alternate between GPU usage and CPU usage, would it make such a big difference to use Tensor Cores for mixed precision? You would be trading speed for precision but only speeding up the GPU phases.

I was looking at pytorch documentation (the python package we use to train the AI agents, which supports using Tensor Cores for mixed precision) for automatic-mixed-precision and it says:

(if) Your network may fail to saturate the GPU(s) with work, and is therefore CPU bound. Amp’s effect on GPU performance won’t matter.

-----------------
Thank you.
ID: 59599 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Pop Piasa
Avatar

Send message
Joined: 8 Aug 19
Posts: 252
Credit: 458,054,251
RAC: 0
Level
Gln
Scientific publications
watwat
Message 59627 - Posted: 21 Dec 2022, 3:43:17 UTC

I'm being curious here...

These Python apps don't seem to report their virtual memory usage accurately on my hosts. They show 7.4GB while my commit charge shows 52BG+ (with 16GB RAM).

They report more CPU time than the amount of time it actually took my hosts to finish them.

They're also causing the CPU usage to max out around 50% when there are no other CPU tasks running, no matter what my boinc manager CPU usage limit is.

Could anyone please explain this to a confused codger?
"Together we crunch
To check out a hunch
And wish all our credit
Could just buy us lunch"


Piasa Tribe - Illini Nation
ID: 59627 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
abouh

Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59639 - Posted: 22 Dec 2022, 7:50:45 UTC - in response to Message 59627.  
Last modified: 22 Dec 2022, 7:52:43 UTC

These tasks are a bit particular, because they use multiprocessing and also interleave stages of CPU utilisation with stages of GPU utilisation.

The multiprocessing nature of the tasks is responsible for the wrong CPU time (BOINC takes into account the time of all threads). That, together with the fact that the tasks use a python library for machine learning called PyTorch, accounts for the large virtual memory (every thread commits virtual memory when the package is imported, even though it is not later used).

The switching between CPU and GPU phases could be causing the CPU's to be at 50%.

Other hosts have found configurations to improve resource utilisation by running more than one task, some configurations are shared in this forum.
ID: 59639 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
gemini8
Avatar

Send message
Joined: 3 Jul 16
Posts: 31
Credit: 2,250,309,169
RAC: 50
Level
Phe
Scientific publications
watwat
Message 59640 - Posted: 22 Dec 2022, 8:14:50 UTC - in response to Message 59639.  

The multiprocessing nature of the tasks is responsible for the wrong CPU time (BOINC takes into account the time of all threads).

I don't think so.
The CPU-time should be correct, it's just that the overall runtime is faulty.
You can easily see that if you compare the runtime to the send and receive times.
- - - - - - - - - -
Greetings, Jens
ID: 59640 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 . . . 36 · 37 · 38 · 39 · 40 · 41 · 42 . . . 50 · Next

Message boards : News : Experimental Python tasks (beta) - task description

©2026 Universitat Pompeu Fabra