Experimental Python tasks (beta) - task description

Message boards : News : Experimental Python tasks (beta) - task description
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 50 · Next

AuthorMessage
abouh

Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58170 - Posted: 21 Dec 2021, 9:56:28 UTC - in response to Message 58168.  

Today I will send a couple of batches with short tasks for some final debugging of the scripts and then later I will send a big batch of debugged tasks.

ID: 58170 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
abouh

Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58171 - Posted: 21 Dec 2021, 9:57:51 UTC - in response to Message 58169.  

The idea is to make it work for Windows in the future as well, once it works smoothly on linux.
ID: 58171 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1114
Credit: 40,838,535,595
RAC: 4,302,611
Level
Trp
Scientific publications
wat
Message 58172 - Posted: 21 Dec 2021, 15:44:20 UTC - in response to Message 58170.  

Thanks, looks like they are small enough to fit on a 16GB system now. using about 12GB.
ID: 58172 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1114
Credit: 40,838,535,595
RAC: 4,302,611
Level
Trp
Scientific publications
wat
Message 58173 - Posted: 21 Dec 2021, 16:47:02 UTC - in response to Message 58172.  

Thanks, looks like they are small enough to fit on a 16GB system now. using about 12GB.


not sure what happened to it. take a look.

https://gpugrid.net/result.php?resultid=32731651
ID: 58173 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1416
Credit: 9,119,446,190
RAC: 614,515
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58174 - Posted: 21 Dec 2021, 17:16:54 UTC - in response to Message 58173.  

Looks like a needed package was not retrieved properly with a "deadline exceeded" error.
ID: 58174 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1114
Credit: 40,838,535,595
RAC: 4,302,611
Level
Trp
Scientific publications
wat
Message 58175 - Posted: 21 Dec 2021, 18:15:03 UTC - in response to Message 58174.  

Looks like a needed package was not retrieved properly with a "deadline exceeded" error.


It's interesting, looking at the stderr output. it appears that this app is communicating over the internet to send and receive data outside of BOINC. and to servers that are not belonging to the project.

(i think the issue is that I was connected to my VPN checking something else and I left the connection active and it might have had an issue reaching the site it was trying to access)

not sure how kosher that is. I think BOINC devs don't intend/desire this kind of behavior. some people might have some security concerns of the app doing these things outside of BOINC. might be a little smoother to do all communication only between the host and the project and only via the BOINC framework. if data needs to be uploaded elsewhere, it might be better for the project to do that on the backend.

just my .02
ID: 58175 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Aurum
Avatar

Send message
Joined: 12 Jul 17
Posts: 404
Credit: 17,408,899,587
RAC: 2
Level
Trp
Scientific publications
watwatwat
Message 58176 - Posted: 21 Dec 2021, 18:44:13 UTC - in response to Message 58161.  

1. Detected multiple CUDA out of memory errors. Locally the jobs use 6GB of GPU memory. It seems difficult to lower the GPU memory requirements for now, so jobs running in GPUs with less memory should fail.


I'm getting CUDA out of memory failures and all my cards have 10 to 12 GB of GDDR: 1080 Ti, 2080 Ti, 3080 Ti and 3080. There must be something else going on.

I've also stopped trying to time-slice with PythonGPU. It should have a dedicated GPU and I'm leaving 32 CPU threads open for it.

I keep looking for Pinocchio but have yet to see him. Where does it come from? Maybe I never got it.
ID: 58176 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 869
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 58177 - Posted: 21 Dec 2021, 18:58:56 UTC - in response to Message 58171.  

The idea is to make it work for Windows in the future as well, once it works smoothly on linux.

okay, sounds good; thanks for the information
ID: 58177 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1416
Credit: 9,119,446,190
RAC: 614,515
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58178 - Posted: 21 Dec 2021, 19:12:20 UTC

I'm running one of the new batch and at first the task was only using 2.2GB of gpu memory but now it has clocked backup to 6.6GB of gpu memory.

Much as the previous ones. I thought the memory requirements were going to be cut in half.

Consuming the same amount of system memory as before . . . maybe a couple of GB more in fact. Up to 20GB now.
ID: 58178 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Aurum
Avatar

Send message
Joined: 12 Jul 17
Posts: 404
Credit: 17,408,899,587
RAC: 2
Level
Trp
Scientific publications
watwatwat
Message 58179 - Posted: 21 Dec 2021, 21:21:09 UTC

Just had one that's listed as "aborted by user." I didn't abort it.
https://www.gpugrid.net/result.php?resultid=32731704

It also says "Please update your install command." I've kept my computer updated. Is this something I need to do?

What's this? Something I need to do or not?
"FutureWarning: Not all Ray CLI dependencies were found. In Ray 1.4+, the Ray CLI, autoscaler, and dashboard will only be usable via `pip install 'ray[default]'`"
ID: 58179 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
mmonnin

Send message
Joined: 2 Jul 16
Posts: 338
Credit: 7,987,341,558
RAC: 178,897
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58180 - Posted: 21 Dec 2021, 23:12:16 UTC

RuntimeError: CUDA out of memory. Tried to allocate 112.00 MiB (GPU 0; 11.77 GiB total capacity; 3.05 GiB already allocated; 50.00 MiB free; 3.21 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):

That error on 4 tasks right around 55 minutes on 3080Ti

The same PC/GPU has complete Python tasks before, one earlier that ran for 1900 seconds and is running one now for 9hr. Util is around 2-3% and 6.5GB memory in nvidia-smi. 6.1GB in BOINC.

3070Ti has been running for 7:45 with 8% Util and same memory usage.
ID: 58180 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1416
Credit: 9,119,446,190
RAC: 614,515
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58181 - Posted: 22 Dec 2021, 1:34:01 UTC - in response to Message 58179.  

The ray errors are normal and can be ignored.
I completed one of the new tasks successfully. The one I commented on before.
14 hours of compute time.

I had another one that completed successfully but the stderr.txt was truncated and does not show the normal summary and boinc finish statements. Feels similar to the truncation that Einstein stderr.txt outputs have.
ID: 58181 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1416
Credit: 9,119,446,190
RAC: 614,515
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58182 - Posted: 22 Dec 2021, 1:40:18 UTC - in response to Message 58176.  

1. Detected multiple CUDA out of memory errors. Locally the jobs use 6GB of GPU memory. It seems difficult to lower the GPU memory requirements for now, so jobs running in GPUs with less memory should fail.


I'm getting CUDA out of memory failures and all my cards have 10 to 12 GB of GDDR: 1080 Ti, 2080 Ti, 3080 Ti and 3080. There must be something else going on.

I've also stopped trying to time-slice with PythonGPU. It should have a dedicated GPU and I'm leaving 32 CPU threads open for it.

I keep looking for Pinocchio but have yet to see him. Where does it come from? Maybe I never got it.

I'm not doing anything at all in mitigation for the Python on GPU tasks other than to only run one at a time. I've been successful in almost all cases other than the very first trial ones in each evolution.
ID: 58182 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
abouh

Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58183 - Posted: 22 Dec 2021, 9:29:54 UTC - in response to Message 58178.  
Last modified: 22 Dec 2021, 9:30:08 UTC

What was halved was the amount of Agent training per task, and therefore the total amount of time required to completed it.

The GPU memory and system memory will remain the same in the next batches.
ID: 58183 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
abouh

Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58184 - Posted: 22 Dec 2021, 9:37:48 UTC - in response to Message 58175.  
Last modified: 22 Dec 2021, 9:43:47 UTC

During the task, the performance of the Agent is intermittently sent to https://wandb.ai/ to track how the agent is doing in the environment as training progresses. It immensely helps to understand the behaviour of the agent and facilitates research, as it allows visualising the information in a structured way.

wandb has a python package extensively used in machine learning research, which we import in our scripts for this purpose.
ID: 58184 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
abouh

Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58185 - Posted: 22 Dec 2021, 9:43:04 UTC - in response to Message 58176.  

Pinocchio probably only caused problems in a subset of hosts, as it was due to one of the firsts test batches having a wrong conda environment requirements file. It was a small batch.
ID: 58185 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 295,172
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58186 - Posted: 22 Dec 2021, 10:07:45 UTC

My machines are probably just above the minimum spec for the current batches - 16 GB RAM, and 6 GB video RAM on a GTX 1660.

They've both completed and validated their first task, in around 10.5 / 11 hours.

But there's something odd about the result display in the task listing on this website - both the Run time and CPU time columns show the exact same value, and it's too large to be feasible: task 32731629, for example, shows 926 minutes of run time, but only 626 minutes between issue and return.

Tasks currently running locally show CPU time so far about 50% above elapsed time, which is to be expected from the description of how these tasks are designed to run. I suspect that something is triggering an anti-cheat mechanism: a task specified to use a single CPU core couldn't possibly use the CPU for longer than the run time, could it? But if so, it seems odd to 'correct' the elapsed time rather than the CPU time.

I'll take a look at the sched_request file after the next one reports, to see if the 'correction' is being applied locally by the BOINC client, or on the server.
ID: 58186 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
mmonnin

Send message
Joined: 2 Jul 16
Posts: 338
Credit: 7,987,341,558
RAC: 178,897
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58187 - Posted: 22 Dec 2021, 11:25:13 UTC - in response to Message 58183.  

What was halved was the amount of Agent training per task, and therefore the total amount of time required to completed it.

The GPU memory and system memory will remain the same in the next batches.


Halved? I've got one at nearly 21.5 hours on a 3080Ti and still going
ID: 58187 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 295,172
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58188 - Posted: 22 Dec 2021, 15:39:07 UTC

This shows the timing discrepancy, a few minutes before task 32731655 completed.



The two valid tasks on host 508381 ran in sequence on the same GPU: there's no way they could have both finished within 24 hours if the displayed elapsed time was accurate.
ID: 58188 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1114
Credit: 40,838,535,595
RAC: 4,302,611
Level
Trp
Scientific publications
wat
Message 58189 - Posted: 22 Dec 2021, 15:47:48 UTC - in response to Message 58188.  

i still think the 5,000,000 GFLOPs count is far too low. since these run for 12-24hrs depending on host (GPU speed does not seem to be a factor in this since GPU utilization is so low, most likely CPU/memory bound) and there seems to be a bit of a discrepancy in run time per task. I had a task run for 9hrs on my 3080Ti, while another user claims 21+ hrs on his 3080Ti. and I've had several tasks get killed around 12hrs for exceeded time limit, while others ran for longer. lots of inconsistencies here.

the low flops count is causing a lot of tasks to prematurely get killed by BOINC for exceeded time limit when they would have completed eventually. the fact that they do not proceed past 10% completion until the end probably doesn't help.
ID: 58189 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 50 · Next

Message boards : News : Experimental Python tasks (beta) - task description

©2025 Universitat Pompeu Fabra