Message boards :
News :
Experimental Python tasks (beta) - task description
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 50 · Next
Author | Message |
---|---|
Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level ![]() Scientific publications ![]() |
Today I will send a couple of batches with short tasks for some final debugging of the scripts and then later I will send a big batch of debugged tasks. |
Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level ![]() Scientific publications ![]() |
The idea is to make it work for Windows in the future as well, once it works smoothly on linux. |
Send message Joined: 21 Feb 20 Posts: 1114 Credit: 40,838,722,595 RAC: 4,266,994 Level ![]() Scientific publications ![]() |
Thanks, looks like they are small enough to fit on a 16GB system now. using about 12GB. ![]() |
Send message Joined: 21 Feb 20 Posts: 1114 Credit: 40,838,722,595 RAC: 4,266,994 Level ![]() Scientific publications ![]() |
Thanks, looks like they are small enough to fit on a 16GB system now. using about 12GB. not sure what happened to it. take a look. https://gpugrid.net/result.php?resultid=32731651 ![]() |
![]() Send message Joined: 13 Dec 17 Posts: 1416 Credit: 9,119,446,190 RAC: 614,515 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() |
Looks like a needed package was not retrieved properly with a "deadline exceeded" error. |
Send message Joined: 21 Feb 20 Posts: 1114 Credit: 40,838,722,595 RAC: 4,266,994 Level ![]() Scientific publications ![]() |
Looks like a needed package was not retrieved properly with a "deadline exceeded" error. It's interesting, looking at the stderr output. it appears that this app is communicating over the internet to send and receive data outside of BOINC. and to servers that are not belonging to the project. (i think the issue is that I was connected to my VPN checking something else and I left the connection active and it might have had an issue reaching the site it was trying to access) not sure how kosher that is. I think BOINC devs don't intend/desire this kind of behavior. some people might have some security concerns of the app doing these things outside of BOINC. might be a little smoother to do all communication only between the host and the project and only via the BOINC framework. if data needs to be uploaded elsewhere, it might be better for the project to do that on the backend. just my .02 ![]() |
![]() Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,408,899,587 RAC: 2 Level ![]() Scientific publications ![]() ![]() ![]() |
1. Detected multiple CUDA out of memory errors. Locally the jobs use 6GB of GPU memory. It seems difficult to lower the GPU memory requirements for now, so jobs running in GPUs with less memory should fail. I'm getting CUDA out of memory failures and all my cards have 10 to 12 GB of GDDR: 1080 Ti, 2080 Ti, 3080 Ti and 3080. There must be something else going on. I've also stopped trying to time-slice with PythonGPU. It should have a dedicated GPU and I'm leaving 32 CPU threads open for it. I keep looking for Pinocchio but have yet to see him. Where does it come from? Maybe I never got it. |
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 869 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
The idea is to make it work for Windows in the future as well, once it works smoothly on linux. okay, sounds good; thanks for the information |
![]() Send message Joined: 13 Dec 17 Posts: 1416 Credit: 9,119,446,190 RAC: 614,515 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() |
I'm running one of the new batch and at first the task was only using 2.2GB of gpu memory but now it has clocked backup to 6.6GB of gpu memory. Much as the previous ones. I thought the memory requirements were going to be cut in half. Consuming the same amount of system memory as before . . . maybe a couple of GB more in fact. Up to 20GB now. |
![]() Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,408,899,587 RAC: 2 Level ![]() Scientific publications ![]() ![]() ![]() |
Just had one that's listed as "aborted by user." I didn't abort it. https://www.gpugrid.net/result.php?resultid=32731704 It also says "Please update your install command." I've kept my computer updated. Is this something I need to do? What's this? Something I need to do or not? "FutureWarning: Not all Ray CLI dependencies were found. In Ray 1.4+, the Ray CLI, autoscaler, and dashboard will only be usable via `pip install 'ray[default]'`" |
Send message Joined: 2 Jul 16 Posts: 338 Credit: 7,987,341,558 RAC: 178,897 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() |
RuntimeError: CUDA out of memory. Tried to allocate 112.00 MiB (GPU 0; 11.77 GiB total capacity; 3.05 GiB already allocated; 50.00 MiB free; 3.21 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF Traceback (most recent call last): That error on 4 tasks right around 55 minutes on 3080Ti The same PC/GPU has complete Python tasks before, one earlier that ran for 1900 seconds and is running one now for 9hr. Util is around 2-3% and 6.5GB memory in nvidia-smi. 6.1GB in BOINC. 3070Ti has been running for 7:45 with 8% Util and same memory usage. |
![]() Send message Joined: 13 Dec 17 Posts: 1416 Credit: 9,119,446,190 RAC: 614,515 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() |
The ray errors are normal and can be ignored. I completed one of the new tasks successfully. The one I commented on before. 14 hours of compute time. I had another one that completed successfully but the stderr.txt was truncated and does not show the normal summary and boinc finish statements. Feels similar to the truncation that Einstein stderr.txt outputs have. |
![]() Send message Joined: 13 Dec 17 Posts: 1416 Credit: 9,119,446,190 RAC: 614,515 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() |
1. Detected multiple CUDA out of memory errors. Locally the jobs use 6GB of GPU memory. It seems difficult to lower the GPU memory requirements for now, so jobs running in GPUs with less memory should fail. I'm not doing anything at all in mitigation for the Python on GPU tasks other than to only run one at a time. I've been successful in almost all cases other than the very first trial ones in each evolution. |
Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level ![]() Scientific publications ![]() |
What was halved was the amount of Agent training per task, and therefore the total amount of time required to completed it. The GPU memory and system memory will remain the same in the next batches. |
Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level ![]() Scientific publications ![]() |
During the task, the performance of the Agent is intermittently sent to https://wandb.ai/ to track how the agent is doing in the environment as training progresses. It immensely helps to understand the behaviour of the agent and facilitates research, as it allows visualising the information in a structured way. wandb has a python package extensively used in machine learning research, which we import in our scripts for this purpose. |
Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level ![]() Scientific publications ![]() |
Pinocchio probably only caused problems in a subset of hosts, as it was due to one of the firsts test batches having a wrong conda environment requirements file. It was a small batch. |
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 295,172 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
My machines are probably just above the minimum spec for the current batches - 16 GB RAM, and 6 GB video RAM on a GTX 1660. They've both completed and validated their first task, in around 10.5 / 11 hours. But there's something odd about the result display in the task listing on this website - both the Run time and CPU time columns show the exact same value, and it's too large to be feasible: task 32731629, for example, shows 926 minutes of run time, but only 626 minutes between issue and return. Tasks currently running locally show CPU time so far about 50% above elapsed time, which is to be expected from the description of how these tasks are designed to run. I suspect that something is triggering an anti-cheat mechanism: a task specified to use a single CPU core couldn't possibly use the CPU for longer than the run time, could it? But if so, it seems odd to 'correct' the elapsed time rather than the CPU time. I'll take a look at the sched_request file after the next one reports, to see if the 'correction' is being applied locally by the BOINC client, or on the server. |
Send message Joined: 2 Jul 16 Posts: 338 Credit: 7,987,341,558 RAC: 178,897 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() |
What was halved was the amount of Agent training per task, and therefore the total amount of time required to completed it. Halved? I've got one at nearly 21.5 hours on a 3080Ti and still going |
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 295,172 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
This shows the timing discrepancy, a few minutes before task 32731655 completed. ![]() The two valid tasks on host 508381 ran in sequence on the same GPU: there's no way they could have both finished within 24 hours if the displayed elapsed time was accurate. |
Send message Joined: 21 Feb 20 Posts: 1114 Credit: 40,838,722,595 RAC: 4,266,994 Level ![]() Scientific publications ![]() |
i still think the 5,000,000 GFLOPs count is far too low. since these run for 12-24hrs depending on host (GPU speed does not seem to be a factor in this since GPU utilization is so low, most likely CPU/memory bound) and there seems to be a bit of a discrepancy in run time per task. I had a task run for 9hrs on my 3080Ti, while another user claims 21+ hrs on his 3080Ti. and I've had several tasks get killed around 12hrs for exceeded time limit, while others ran for longer. lots of inconsistencies here. the low flops count is causing a lot of tasks to prematurely get killed by BOINC for exceeded time limit when they would have completed eventually. the fact that they do not proceed past 10% completion until the end probably doesn't help. ![]() |
©2025 Universitat Pompeu Fabra