Experimental Python tasks (beta)

Author	Message
Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level Scientific publications	Message 58190 - Posted: 22 Dec 2021, 16:27:52 UTC - in response to Message 58189. Because this project still uses DCF, the 'exceeded time limit' problem should go away as soon as you can get a single task to complete. Both my machines with finished tasks are now showing realistic estimates, but with DCFs of 5+ and 10+ - I agree, the FLOPs estimate should be increased by that sort of multiplier to keep estimates balanced against other researchers' work for the project. The screen shot also shows how the 'remaining time' estimate gets screwed up when the running value reaches something like 10 hours at 10%. Roll on intermediate progress reports and checkpoints. ID: 58190 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 5,269 Level Scientific publications	Message 58191 - Posted: 22 Dec 2021, 17:05:06 UTC Last modified: 22 Dec 2021, 17:05:49 UTC my system that completed a few tasks had a DCF of 36+ checkpointing also still isn't working. I had some tasks running for ~3hrs. restarted boinc and they restarted at 5mins. ID: 58191 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level Scientific publications	Message 58192 - Posted: 22 Dec 2021, 18:52:57 UTC - in response to Message 58191. checkpointing also still isn't working. See my screenshot. "CPU time since checkpoint: 16:24:44" ID: 58192 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level Scientific publications	Message 58193 - Posted: 22 Dec 2021, 18:59:00 UTC I've checked a sched_request when reporting. <result> <name>e1a26-ABOU_rnd_ppod_11-0-1-RND6936_0</name> <final_cpu_time>55983.300000</final_cpu_time> <final_elapsed_time>36202.136027</final_elapsed_time> That's task 32731632. So it's the server applying the 'sanity(?) check' "elapsed time not less than CPU time". That's right for a single core GPU task, but not right for a task with multithreaded CPU elements. ID: 58193 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 58194 - Posted: 23 Dec 2021, 10:07:59 UTC - in response to Message 58187. As mentioned by Ian&Steve C., GPU speed influences only partially task completion time. During the task, the agent first interacts with the environments for a while, then uses the GPU to process the collected data and learn from it, then interacts again with the environments, and so on. In the last batch, I reduced the total amount of agent-environment interactions gathered and processed before ending the task with respect to the previous batch, which should have reduced the completion time. ID: 58194 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 58195 - Posted: 23 Dec 2021, 10:09:32 UTC Last modified: 23 Dec 2021, 10:19:03 UTC I will look into the reported issues before sending the next batch, to see if I can find a solution for both the problem of jobs being killed due to “exceeded time limit” and the progress and checkpointing problems. From what Ian&Steve C. mentioned, I understand that increasing the "Estimated Computation Size", however BOINC calculates that, could solve the problem of jobs being killed? Thanks you very much for your feedback. Happy holidays to everyone! ID: 58195 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level Scientific publications	Message 58196 - Posted: 23 Dec 2021, 13:16:56 UTC - in response to Message 58195. From what Ian&Steve C. mentioned, I understand that increasing the "Estimated Computation Size", however BOINC calculates that, could solve the problem of jobs being killed? The jobs reach us with a workunit description: <workunit> <name>e1a24-ABOU_rnd_ppod_11-0-1-RND1891</name> <app_name>PythonGPU</app_name> <version_num>401</version_num> <rsc_fpops_est>5000000000000000.000000</rsc_fpops_est> <rsc_fpops_bound>250000000000000000.000000</rsc_fpops_bound> <rsc_memory_bound>4000000000.000000</rsc_memory_bound> <rsc_disk_bound>10000000000.000000</rsc_disk_bound> <file_ref> <file_name>e1a24-ABOU_rnd_ppod_11-0-run</file_name> <open_name>run.py</open_name> <copy_file/> </file_ref> <file_ref> <file_name>e1a24-ABOU_rnd_ppod_11-0-data</file_name> <open_name>input.zip</open_name> <copy_file/> </file_ref> <file_ref> <file_name>e1a24-ABOU_rnd_ppod_11-0-requirements</file_name> <open_name>requirements.txt</open_name> <copy_file/> </file_ref> <file_ref> <file_name>e1a24-ABOU_rnd_ppod_11-0-input_enc</file_name> <open_name>input</open_name> <copy_file/> </file_ref> </workunit> It's the fourth line, '<rsc_fpops_est>', which causes the problem. The job size is given as the estimated number of floating point operations to be calculated, in total. BOINC uses this, along with the estimated speed of the device it's running on, to estimate how long the task will take. For a GPU app, it's usually the speed of the GPU that counts, but in this case - although it's described as a GPU app - the dominant factor might be the speed of the CPU. BOINC doesn't take any direct notice of that. The jobs are killed when they reach the duration calculated from the next line, '<rsc_fpops_bound>'. A quick and dirty fix while testing might be to increase that value even above the current 50x the original estimate, but that removes a valuable safeguard during normal running. ID: 58196 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 58197 - Posted: 23 Dec 2021, 15:57:01 UTC - in response to Message 58196. Last modified: 23 Dec 2021, 21:34:36 UTC I see, thank you very much for the info. I asked Toni to help me adjusting the "rsc_fpops_est" parameter. Hopefully the next jobs won't be aborted by the server. Also, I checked the progress and the checkpointing problems. They were caused by format errors. The python scripts were logging the progress into a "progress.txt" file but apparently BOINC wants just a file "progress" without extension. Similarly, checkpoints were being generated, but were not identified correctly since they were not called "restart.chk". I will work on fixing these issues before the next batch of tasks. ID: 58197 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 731 Level Scientific publications	Message 58198 - Posted: 23 Dec 2021, 19:35:37 UTC - in response to Message 58197. Thanks @abouh for working with us in debugging your application and work units. Nice to have a attentive and easy to work with researcher. Looking forward to the next batch. ID: 58198 · Rating: 0 · rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 592 Credit: 11,972,186,510 RAC: 1,187 Level Scientific publications	Message 58200 - Posted: 23 Dec 2021, 21:20:01 UTC - in response to Message 58194. Thank you for your kind support. During the task, the agent first interacts with the environments for a while, then uses the GPU to process the collected data and learn from it, then interacts again with the environments, and so on. This behavior can be seen at some tests described at my Managing non-high-end hosts thread. ID: 58200 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 58201 - Posted: 24 Dec 2021, 10:02:52 UTC I just sent another batch of tasks. I tested locally and the progress and the restart.chk files are correctly generated and updated. rsc_fpops_est job parameter should be higher too now. Please let us know if you think the success rate of tasks can be improved in any other way. Thanks a lot for your help. ID: 58201 · Rating: 0 · rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 592 Credit: 11,972,186,510 RAC: 1,187 Level Scientific publications	Message 58202 - Posted: 24 Dec 2021, 10:35:31 UTC - in response to Message 58201. I just sent another batch of tasks. Thank you very much for this kind of Christmas present! Merry Christmas to everyone crunchers worldwide 🎄✨ ID: 58202 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level Scientific publications	Message 58203 - Posted: 24 Dec 2021, 11:38:42 UTC Last modified: 24 Dec 2021, 12:09:40 UTC 1,000,000,000 GFLOPs - initial estimate 1690d 21:37:58. That should be enough! I'll watch this one through, but after that I'll be away for a few days - happy holidays, and we'll pick up again on the other side. Edit: Progress %age jumps to 10% after the initial unpacking phase, then increments every 0.9%. That'll do. ID: 58203 · Rating: 0 · rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 592 Credit: 11,972,186,510 RAC: 1,187 Level Scientific publications	Message 58204 - Posted: 24 Dec 2021, 12:51:06 UTC - in response to Message 58201. I tested locally and the progress and the restart.chk files are correctly generated and updated. rsc_fpops_est job parameter should be higher too now. In a preliminary sight of one new Python GPU task received today: - Progress estimation is now working properly, updating by 0,9% increments. - Estimated computation size has raised to 1,000,000,000 GFLOPs, as also confirmed by Richard Haselgrove - Checkpointing seems to be working also, and is being stored at about every two minutes. - Learning cycle period has reduced to 11 seconds from 21 seconds observed at previous task. sudo nvidia-smi dmon - GPU dedicated RAM usage seems to have been reduced, but I don't know if enough for running at 4 GB RAM GPUs (?) - Currrent progress for task e1a20-ABOU_rnd_ppod_13-0-1-RND1192_0 is 28,9% after 2 hours and 13 minutes running. This leads to a total true execution time of about 7 hours and 41 minutes at my Host #569442 Well done! ID: 58204 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 731 Level Scientific publications	Message 58208 - Posted: 24 Dec 2021, 16:43:12 UTC Same observed behavior. Gpu memory halved, progress indicator normal and GFLOPS in line with actual usage. Well done. ID: 58208 · Rating: 0 · rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 592 Credit: 11,972,186,510 RAC: 1,187 Level Scientific publications	Message 58209 - Posted: 24 Dec 2021, 17:38:21 UTC - in response to Message 58204. - GPU dedicated RAM usage seems to have been reduced, but I don't know if enough for running at 4 GB RAM GPUs (?) I'm answering to myself: I enabled Python GPU tasks requesting in my GTX 1650 SUPER 4 GB system, and I happened to catch this previously failed task e1a21-ABOU_rnd_ppod_13-0-1-RND2308_1 This task has passed the initial processing steps, and has reached the learning cycle phase. At this point, memory usage is just at the limit of the 4 GB GPU available RAM. Waiting to see whether this task will be succeeding or not. System RAM usage keeps being very high. 99% of the 16 GB available RAM at this system is currently in use. ID: 58209 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level Scientific publications	Message 58210 - Posted: 24 Dec 2021, 22:56:33 UTC - in response to Message 58204. - Currrent progress for task e1a20-ABOU_rnd_ppod_13-0-1-RND1192_0 is 28,9% after 2 hours and 13 minutes running. This leads to a total true execution time of about 7 hours and 41 minutes at my Host #569442 That's roughly the figure I got in the early stages of today's tasks. But task 32731884 has just finished with <result> <name>e1a17-ABOU_rnd_ppod_13-0-1-RND0389_3</name> <final_cpu_time>59637.190000</final_cpu_time> <final_elapsed_time>39080.805144</final_elapsed_time> That's very similar (and on the same machine) as the one I reported in message 58193. So I don't think the task duration has changed much: maybe the progress %age isn't quite linear (but not enough to worry about). ID: 58210 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 58218 - Posted: 29 Dec 2021, 8:31:14 UTC Hello, reviewing which jobs failed in the last batches I have seen several times this error: 21:28:07 (152316): wrapper (7.7.26016): starting 21:28:07 (152316): wrapper (7.7.26016): starting 21:28:07 (152316): wrapper: running /usr/bin/flock (/var/lib/boinc-client/projects/www.gpugrid.net/miniconda.lock -c "/bin/bash ./miniconda-installer.sh -b -u -p /var/lib/boinc-client/projects/www.gpugrid.net/miniconda && /var/lib/boinc-client/projects/www.gpugrid.net/miniconda/bin/conda install -m -y -p gpugridpy --file requirements.txt ") [152341] INTERNAL ERROR: cannot create temporary directory! [152345] INTERNAL ERROR: cannot create temporary directory! 21:28:08 (152316): /usr/bin/flock exited; CPU time 0.147100 21:28:08 (152316): app exit status: 0x1 21:28:08 (152316): called boinc_finish(195 I have found an issue from Richard Haselgrove talking about this error: https://github.com/BOINC/boinc/issues/4125 It seems like the users getting this error could simply solve it by setting PrivateTmp=true. Is that correct? What is the appropriate way to modify that? ID: 58218 · Rating: 0 · rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 592 Credit: 11,972,186,510 RAC: 1,187 Level Scientific publications	Message 58219 - Posted: 29 Dec 2021, 9:15:02 UTC - in response to Message 58218. It seems like the users getting this error could simply solve it by setting PrivateTmp=true. Is that correct? What is the appropriate way to modify that? Right. I gave a step-by-step solution based on Richard Haselgrove finding at my Message #55986 It worked fine for all my hosts. ID: 58219 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 58220 - Posted: 29 Dec 2021, 9:26:29 UTC - in response to Message 58219. Thank you! ID: 58220 · Rating: 0 · rate: / Reply Quote

Experimental Python tasks (beta) - task description