Experimental Python tasks (beta)

Author	Message
abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 58170 - Posted: 21 Dec 2021, 9:56:28 UTC - in response to Message 58168. Today I will send a couple of batches with short tasks for some final debugging of the scripts and then later I will send a big batch of debugged tasks. ID: 58170 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 58171 - Posted: 21 Dec 2021, 9:57:51 UTC - in response to Message 58169. The idea is to make it work for Windows in the future as well, once it works smoothly on linux. ID: 58171 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level Scientific publications	Message 58172 - Posted: 21 Dec 2021, 15:44:20 UTC - in response to Message 58170. Thanks, looks like they are small enough to fit on a 16GB system now. using about 12GB. ID: 58172 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level Scientific publications	Message 58173 - Posted: 21 Dec 2021, 16:47:02 UTC - in response to Message 58172. Thanks, looks like they are small enough to fit on a 16GB system now. using about 12GB. not sure what happened to it. take a look. https://gpugrid.net/result.php?resultid=32731651 ID: 58173 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level Scientific publications	Message 58174 - Posted: 21 Dec 2021, 17:16:54 UTC - in response to Message 58173. Looks like a needed package was not retrieved properly with a "deadline exceeded" error. ID: 58174 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level Scientific publications	Message 58175 - Posted: 21 Dec 2021, 18:15:03 UTC - in response to Message 58174. Looks like a needed package was not retrieved properly with a "deadline exceeded" error. It's interesting, looking at the stderr output. it appears that this app is communicating over the internet to send and receive data outside of BOINC. and to servers that are not belonging to the project. (i think the issue is that I was connected to my VPN checking something else and I left the connection active and it might have had an issue reaching the site it was trying to access) not sure how kosher that is. I think BOINC devs don't intend/desire this kind of behavior. some people might have some security concerns of the app doing these things outside of BOINC. might be a little smoother to do all communication only between the host and the project and only via the BOINC framework. if data needs to be uploaded elsewhere, it might be better for the project to do that on the backend. just my .02 ID: 58175 · Rating: 0 · rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,408,899,587 RAC: 0 Level Scientific publications	Message 58176 - Posted: 21 Dec 2021, 18:44:13 UTC - in response to Message 58161. 1. Detected multiple CUDA out of memory errors. Locally the jobs use 6GB of GPU memory. It seems difficult to lower the GPU memory requirements for now, so jobs running in GPUs with less memory should fail. I'm getting CUDA out of memory failures and all my cards have 10 to 12 GB of GDDR: 1080 Ti, 2080 Ti, 3080 Ti and 3080. There must be something else going on. I've also stopped trying to time-slice with PythonGPU. It should have a dedicated GPU and I'm leaving 32 CPU threads open for it. I keep looking for Pinocchio but have yet to see him. Where does it come from? Maybe I never got it. ID: 58176 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level Scientific publications	Message 58177 - Posted: 21 Dec 2021, 18:58:56 UTC - in response to Message 58171. The idea is to make it work for Windows in the future as well, once it works smoothly on linux. okay, sounds good; thanks for the information ID: 58177 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level Scientific publications	Message 58178 - Posted: 21 Dec 2021, 19:12:20 UTC I'm running one of the new batch and at first the task was only using 2.2GB of gpu memory but now it has clocked backup to 6.6GB of gpu memory. Much as the previous ones. I thought the memory requirements were going to be cut in half. Consuming the same amount of system memory as before . . . maybe a couple of GB more in fact. Up to 20GB now. ID: 58178 · Rating: 0 · rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,408,899,587 RAC: 0 Level Scientific publications	Message 58179 - Posted: 21 Dec 2021, 21:21:09 UTC Just had one that's listed as "aborted by user." I didn't abort it. https://www.gpugrid.net/result.php?resultid=32731704 It also says "Please update your install command." I've kept my computer updated. Is this something I need to do? What's this? Something I need to do or not? "FutureWarning: Not all Ray CLI dependencies were found. In Ray 1.4+, the Ray CLI, autoscaler, and dashboard will only be usable via `pip install 'ray[default]'`" ID: 58179 · Rating: 0 · rate: / Reply Quote

mmonnin Send message Joined: 2 Jul 16 Posts: 338 Credit: 7,987,341,558 RAC: 259 Level Scientific publications	Message 58180 - Posted: 21 Dec 2021, 23:12:16 UTC RuntimeError: CUDA out of memory. Tried to allocate 112.00 MiB (GPU 0; 11.77 GiB total capacity; 3.05 GiB already allocated; 50.00 MiB free; 3.21 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF Traceback (most recent call last): That error on 4 tasks right around 55 minutes on 3080Ti The same PC/GPU has complete Python tasks before, one earlier that ran for 1900 seconds and is running one now for 9hr. Util is around 2-3% and 6.5GB memory in nvidia-smi. 6.1GB in BOINC. 3070Ti has been running for 7:45 with 8% Util and same memory usage. ID: 58180 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level Scientific publications	Message 58181 - Posted: 22 Dec 2021, 1:34:01 UTC - in response to Message 58179. The ray errors are normal and can be ignored. I completed one of the new tasks successfully. The one I commented on before. 14 hours of compute time. I had another one that completed successfully but the stderr.txt was truncated and does not show the normal summary and boinc finish statements. Feels similar to the truncation that Einstein stderr.txt outputs have. ID: 58181 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level Scientific publications	Message 58182 - Posted: 22 Dec 2021, 1:40:18 UTC - in response to Message 58176. 1. Detected multiple CUDA out of memory errors. Locally the jobs use 6GB of GPU memory. It seems difficult to lower the GPU memory requirements for now, so jobs running in GPUs with less memory should fail. I'm getting CUDA out of memory failures and all my cards have 10 to 12 GB of GDDR: 1080 Ti, 2080 Ti, 3080 Ti and 3080. There must be something else going on. I've also stopped trying to time-slice with PythonGPU. It should have a dedicated GPU and I'm leaving 32 CPU threads open for it. I keep looking for Pinocchio but have yet to see him. Where does it come from? Maybe I never got it. I'm not doing anything at all in mitigation for the Python on GPU tasks other than to only run one at a time. I've been successful in almost all cases other than the very first trial ones in each evolution. ID: 58182 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 58183 - Posted: 22 Dec 2021, 9:29:54 UTC - in response to Message 58178. Last modified: 22 Dec 2021, 9:30:08 UTC What was halved was the amount of Agent training per task, and therefore the total amount of time required to completed it. The GPU memory and system memory will remain the same in the next batches. ID: 58183 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 58184 - Posted: 22 Dec 2021, 9:37:48 UTC - in response to Message 58175. Last modified: 22 Dec 2021, 9:43:47 UTC During the task, the performance of the Agent is intermittently sent to https://wandb.ai/ to track how the agent is doing in the environment as training progresses. It immensely helps to understand the behaviour of the agent and facilitates research, as it allows visualising the information in a structured way. wandb has a python package extensively used in machine learning research, which we import in our scripts for this purpose. ID: 58184 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 58185 - Posted: 22 Dec 2021, 9:43:04 UTC - in response to Message 58176. Pinocchio probably only caused problems in a subset of hosts, as it was due to one of the firsts test batches having a wrong conda environment requirements file. It was a small batch. ID: 58185 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 428 Level Scientific publications	Message 58186 - Posted: 22 Dec 2021, 10:07:45 UTC My machines are probably just above the minimum spec for the current batches - 16 GB RAM, and 6 GB video RAM on a GTX 1660. They've both completed and validated their first task, in around 10.5 / 11 hours. But there's something odd about the result display in the task listing on this website - both the Run time and CPU time columns show the exact same value, and it's too large to be feasible: task 32731629, for example, shows 926 minutes of run time, but only 626 minutes between issue and return. Tasks currently running locally show CPU time so far about 50% above elapsed time, which is to be expected from the description of how these tasks are designed to run. I suspect that something is triggering an anti-cheat mechanism: a task specified to use a single CPU core couldn't possibly use the CPU for longer than the run time, could it? But if so, it seems odd to 'correct' the elapsed time rather than the CPU time. I'll take a look at the sched_request file after the next one reports, to see if the 'correction' is being applied locally by the BOINC client, or on the server. ID: 58186 · Rating: 0 · rate: / Reply Quote

mmonnin Send message Joined: 2 Jul 16 Posts: 338 Credit: 7,987,341,558 RAC: 259 Level Scientific publications	Message 58187 - Posted: 22 Dec 2021, 11:25:13 UTC - in response to Message 58183. What was halved was the amount of Agent training per task, and therefore the total amount of time required to completed it. The GPU memory and system memory will remain the same in the next batches. Halved? I've got one at nearly 21.5 hours on a 3080Ti and still going ID: 58187 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 428 Level Scientific publications	Message 58188 - Posted: 22 Dec 2021, 15:39:07 UTC This shows the timing discrepancy, a few minutes before task 32731655 completed. The two valid tasks on host 508381 ran in sequence on the same GPU: there's no way they could have both finished within 24 hours if the displayed elapsed time was accurate. ID: 58188 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level Scientific publications	Message 58189 - Posted: 22 Dec 2021, 15:47:48 UTC - in response to Message 58188. i still think the 5,000,000 GFLOPs count is far too low. since these run for 12-24hrs depending on host (GPU speed does not seem to be a factor in this since GPU utilization is so low, most likely CPU/memory bound) and there seems to be a bit of a discrepancy in run time per task. I had a task run for 9hrs on my 3080Ti, while another user claims 21+ hrs on his 3080Ti. and I've had several tasks get killed around 12hrs for exceeded time limit, while others ran for longer. lots of inconsistencies here. the low flops count is causing a lot of tasks to prematurely get killed by BOINC for exceeded time limit when they would have completed eventually. the fact that they do not proceed past 10% completion until the end probably doesn't help. ID: 58189 · Rating: 0 · rate: / Reply Quote

Experimental Python tasks (beta) - task description