Experimental Python tasks (beta)

Author	Message
abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 59369 - Posted: 30 Sep 2022, 5:44:38 UTC Last modified: 30 Sep 2022, 5:47:12 UTC I added the discussed changes and deployed them to the PythonGPUbeta app. More specifically: 1. I changed the 7za.exe executable to (I believe) the latest version. A much newer one than the one previously used in any case. 2. I compress now the conda-environment files to .txz. I use the default --compress-level (4), because I tried with 9 and the compressed file size was the same. As Aleksey mentioned, the unpacking still needs to be done in 2 steps, but at least now the sent files are smaller due to a more efficient compression. Did anyone catch any of the PythonGPUbeta jobs? They seemed to work Regarding what bibi mentioned, /Windows/System32/cmd.exe seems to be present in all Windows machines so far, or at least I have not seen any job failing because of this. I have sent 64 test jobs in total. ID: 59369 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 42,316 Level Scientific publications	Message 59370 - Posted: 30 Sep 2022, 5:48:53 UTC - in response to Message 59369. No, I haven't been lucky enough yet to snag any of the beta tasks. ID: 59370 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 59371 - Posted: 30 Sep 2022, 7:38:17 UTC Last modified: 30 Sep 2022, 7:42:38 UTC One of my Linux machines has just crashed two tasks in succession with UnboundLocalError: local variable 'features' referenced before assignment https://www.gpugrid.net/results.php?hostid=508381 Edit - make that three. And a fourth looks to be heading in the same direction - many other users have tried it already. ID: 59371 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 59372 - Posted: 30 Sep 2022, 8:25:10 UTC - in response to Message 59371. Thanks for the warning Richard, I have just fixed the error. Should not be present in the jobs starting a few minutes from now. ID: 59372 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 59373 - Posted: 30 Sep 2022, 9:11:53 UTC - in response to Message 59372. Yes, the next one has got well into the work zone - 1.99%. Thank you. ID: 59373 · Rating: 0 · rate: / Reply Quote

KAMasud Send message Joined: 27 Jul 11 Posts: 138 Credit: 539,953,398 RAC: 0 Level Scientific publications	Message 59374 - Posted: 30 Sep 2022, 9:28:33 UTC Last modified: 30 Sep 2022, 9:29:46 UTC Just an observation. Boinc does not consider a GPUGrid task as a task. Yesterday my finger brushed against Moo's "allow new WU's" and it promptly downloaded 12 WU"s. They, were all 12 running with the GPUGrid task also running? Never seen that before. I took remedial action. None errored out. ID: 59374 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1168 Credit: 12,317,898,501 RAC: 91,654 Level Scientific publications	Message 59375 - Posted: 30 Sep 2022, 10:37:03 UTC I tried to run 1 Python on a second BOINC instance. So far, they have run on the "regular" instance, 1 task ea. on 2 RTX3070, without problems. Runtime was about 22-23hours. On the "regular" instance I now run 2 Primegrid tasks, such ones with GPU use only, no CPU use. Hence, to run Pythons in addition would be a nice supplement - using a lot of CPU and only part of the GPU. After I started a Python on the second BOINC instance, all ran normal for a short while: CPU usage climed up close to 100%, VRAM usage was close to 4GB, system RAM some 8GB. However, after a few minutes, CPU usage for the Python went down to about 15%. RAM and VRAM usage stayed at same level as before. The progress bar in the BOINC manager showed some 2.980% after about 3 hours. So it was clear that something was going wrong, and I aborted the task. Stderr can be seen here: https://www.gpugrid.net/result.php?resultid=33056430 I then started another task, just to preclude that the problem from before was a "one-timer". However, same problem again. What's going wrong? FYI, recently I ran altogether 3 Pythons on 2 RTX3070, which means on one of the RTX two Pythons were crunched simultaneously. No problem at all, the total runtime for each of the two tasks was just a little longer than for 1 task per GPU. ID: 59375 · Rating: 0 · rate: / Reply Quote

KAMasud Send message Joined: 27 Jul 11 Posts: 138 Credit: 539,953,398 RAC: 0 Level Scientific publications	Message 59376 - Posted: 30 Sep 2022, 14:54:53 UTC My question is, how can 13 tasks run on a 12-thread machine? Is it a good idea to run other tasks? Also, why was Boinc not taking into account the GPUGrid task? ID: 59376 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 59377 - Posted: 30 Sep 2022, 15:16:27 UTC - in response to Message 59376. If the 13th task is assessed - by the project and BOINC in conjunction - to require less than 1.0000 of a CPU, it will be allowed to run in parallel with a fully occupied CPU. For a GPU task, it will run at a slightly higher CPU priority, so it will steal CPU cycles from the pure CPU tasks - but on a modern multitasking OS, they won't notice the difference. ID: 59377 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 9,834 Level Scientific publications	Message 59379 - Posted: 30 Sep 2022, 16:22:35 UTC - in response to Message 59375. I tried to run 1 Python on a second BOINC instance. So far, they have run on the "regular" instance, 1 task ea. on 2 RTX3070, without problems. Runtime was about 22-23hours. On the "regular" instance I now run 2 Primegrid tasks, such ones with GPU use only, no CPU use. Hence, to run Pythons in addition would be a nice supplement - using a lot of CPU and only part of the GPU. After I started a Python on the second BOINC instance, all ran normal for a short while: CPU usage climed up close to 100%, VRAM usage was close to 4GB, system RAM some 8GB. However, after a few minutes, CPU usage for the Python went down to about 15%. RAM and VRAM usage stayed at same level as before. The progress bar in the BOINC manager showed some 2.980% after about 3 hours. So it was clear that something was going wrong, and I aborted the task. Stderr can be seen here: https://www.gpugrid.net/result.php?resultid=33056430 I then started another task, just to preclude that the problem from before was a "one-timer". However, same problem again. What's going wrong? FYI, recently I ran altogether 3 Pythons on 2 RTX3070, which means on one of the RTX two Pythons were crunched simultaneously. No problem at all, the total runtime for each of the two tasks was just a little longer than for 1 task per GPU. i think you're trying to do too much at once. 22-24hrs is incredibly slow for a single task on a 3070. my 3060 does them in 13hrs, doing 3 tasks at a time (4.3hrs effective speed). if you want any kind of reasonable performance, you need to stop processing other projects on the same system. or at the very least, adjust your app_config file to reserve more CPU for your Python task to prevent BOINC from running too much extra work from other projects. switch to Linux for even better performance. ID: 59379 · Rating: 0 · rate: / Reply Quote

jjch Send message Joined: 10 Nov 13 Posts: 101 Credit: 15,776,211,122 RAC: 90 Level Scientific publications	Message 59380 - Posted: 1 Oct 2022, 1:04:27 UTC - in response to Message 59379. Erich56 The first two tasks I checked you didn't let them finish extracting. The others looks a bit inconclusive however you restarted the tasks so that could be it. Leave them alone and let them run. If they stall at 2% for an extended time check the stderr file to see if there is an error that should be addressed. Look to see if they are actually running or not before you abort. If its working it should get to the Created Learner. step and continue running from there. There are some jobs that just fail with an unknown cause but these haven't gotten that far yet. 8Gb system memory is on the low side to run Python apps successfully. It can be done but you really shouldn't be running anything else. Also, the Python apps need up to 48Gb of swap space configured on Windows systems. If you haven't already done it I would suggest increasing it. Simplify your troubleshooting and cut down on the variables. Run only one Boinc instance and one Python task. See how that goes first. After you confirm that's working you can possibly run an additional Python task or maybe a different GPU project at the same time. While generally you do want to maximize the usage of your system it's not good to slam it to the ceiling either. ID: 59380 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1168 Credit: 12,317,898,501 RAC: 91,654 Level Scientific publications	Message 59381 - Posted: 1 Oct 2022, 6:13:19 UTC - in response to Message 59379. Ian&Steve C. wrote: i think you're trying to do too much at once. 22-24hrs is incredibly slow for a single task on a 3070. my 3060 does them in 13hrs, doing 3 tasks at a time (4.3hrs effective speed). if you want any kind of reasonable performance, you need to stop processing other projects on the same system. or at the very least, adjust your app_config file to reserve more CPU for your Python task to prevent BOINC from running too much extra work from other projects. switch to Linux for even better performance. I agree, at the moment it may be "too much at once" :-) FYI, I recently bought another PC with 2 CPUs (8-c/8-HT each) and 1 GPU, I upgraded the RAM from 128GB to 256GB and created a 128GB Ramdisk; and on an existing PC with a 10-c/10-HT CPU plus 2 RTX3070 I upgraded the RAM from 64GB to 128GB (=maximum possible on this MoBo). So no surprise that now I am just testing what's possible. And by doing this, I keep finding out, of course, that sometimes I am expecting too much. What concerns the (low) speed of my two RTX3070: I have always been on the very conservative side what concerns GPU temperatures. Which means I have them run on about 60/61°C, not higher. With two such GPUs inside the same box, heat of course is a topic. Despite of good airflow, in order to keep the GPUs at the above mentioned temperature, I need to throttle them down to about 50-65% (different for each GPU). So this explains for the longer runtimes of the Pythons. If I had to boxes with 1 RTX3070 inside each, I am sure that there would be no need for throtteling. ID: 59381 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1168 Credit: 12,317,898,501 RAC: 91,654 Level Scientific publications	Message 59382 - Posted: 1 Oct 2022, 7:35:58 UTC - in response to Message 59380. jjch wrote: Erich56 The first two tasks I checked you didn't let them finish extracting. The others looks a bit inconclusive however you restarted the tasks so that could be it. Leave them alone and let them run. If they stall at 2% for an extended time check the stderr file to see if there is an error that should be addressed. Look to see if they are actually running or not before you abort. If its working it should get to the Created Learner. step and continue running from there. There are some jobs that just fail with an unknown cause but these haven't gotten that far yet. 8Gb system memory is on the low side to run Python apps successfully. It can be done but you really shouldn't be running anything else. Also, the Python apps need up to 48Gb of swap space configured on Windows systems. If you haven't already done it I would suggest increasing it. Simplify your troubleshooting and cut down on the variables. Run only one Boinc instance and one Python task. See how that goes first. After you confirm that's working you can possibly run an additional Python task or maybe a different GPU project at the same time. While generally you do want to maximize the usage of your system it's not good to slam it to the ceiling either. thanks for taking your time for dealing with my problem. well, by now it's become clear to me what the cause for failure was: obviously, running a Primegrid GPU task and Python on the same GPU does not work for the Python. After a Primegrid got finished, I started another Python, and it runs well. What concerns memory, you may have misunderstood: when I mentioned the 8GB, I meant to say that I could see in the Windows Task Manager that Python was using 8GB. Total RAM on this machine is 64GB, so more than enough. Also what concerns the swap space: I had set this manually to 100GB min. and 150 GB max., so also more than enough. Again - the problem has been detected anyway. Whereas I had no problem to run two Pythons on the same GPU (even 3 might work), it is NOT possible to have a Python run along with a Primegrid task. So for me, this was a good learning process :-) Again, thanks anyway for your time investigating my failed tasks. ID: 59382 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1168 Credit: 12,317,898,501 RAC: 91,654 Level Scientific publications	Message 59383 - Posted: 1 Oct 2022, 7:56:44 UTC I just discovered the following problem on the PC which consists of: 2 CPUs Xeon E5 8-core / 16-HT each. 1 GPU Quadro P5000 128 GB Ramdisk 128 GB system memory until a few days ago, I ran 2 Pythons simultaneously (with a setting in the app_config.xml: 0.5 gpu usage). Now, while only 1 Python is running and I push the update button on the BOINC manager for fetching another Python, the BOINC event log tells me that no Pythons are available. Which is not the case though, as the server status page shows some 550 tasks for download; besides, I just downloaded one on another PC. BTW: the Python tasks uses only some 50% of the processor - which seems logical with 2 CPUs inside. So I tried to download tasks from other projects, and in all cases the event log says: not requesting tasks: don't need (CPU; NVIDIA GPU: job cache full). How can that be the case? In the BOINC computing preferences, I now set the "store at least work" to 10 days, and under "store up to an additional" also 10 days. However, this did not solve the problem. There is about 94GB free space on the Ramdisk, and some 150GB free system RAM. What also catches my eye: on the one running Python, which right now shows 45% progress after come 10 hours, it shows a remaining runtime of 34 days! Before, like on my other machines, remaining runtime for Pythons was indicated as 1-2 days. Could this entry be the cause why nothing else can be downloaded and I get the message "job cache full"? Can anyone help me to get out of this problem? ID: 59383 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1168 Credit: 12,317,898,501 RAC: 91,654 Level Scientific publications	Message 59386 - Posted: 2 Oct 2022, 10:41:35 UTC - in response to Message 59383. I just discovered the following problem on the PC which consists of: 2 CPUs Xeon E5 8-core / 16-HT each. 1 GPU Quadro P5000 128 GB Ramdisk 128 GB system memory until a few days ago, I ran 2 Pythons simultaneously (with a setting in the app_config.xml: 0.5 gpu usage). Now, while only 1 Python is running and I push the update button on the BOINC manager for fetching another Python, the BOINC event log tells me that no Pythons are available. Which is not the case though, as the server status page shows some 550 tasks for download; besides, I just downloaded one on another PC. BTW: the Python tasks uses only some 50% of the processor - which seems logical with 2 CPUs inside. So I tried to download tasks from other projects, and in all cases the event log says: not requesting tasks: don't need (CPU; NVIDIA GPU: job cache full). How can that be the case? In the BOINC computing preferences, I now set the "store at least work" to 10 days, and under "store up to an additional" also 10 days. However, this did not solve the problem. There is about 94GB free space on the Ramdisk, and some 150GB free system RAM. What also catches my eye: on the one running Python, which right now shows 45% progress after come 10 hours, it shows a remaining runtime of 34 days! Before, like on my other machines, remaining runtime for Pythons was indicated as 1-2 days. Could this entry be the cause why nothing else can be downloaded and I get the message "job cache full"? Can anyone help me to get out of this problem? Meanwhile, the problem has become even worse: After downloading 1 Python, it starts and in the BOINC manager it shows a remaing runtime of about 60 days (!!!). In reality, he task proceeds with normal speed and will be finished within 24 hours, like all other tasks before on this machine. Hence, nothing else can be downoladed. When trying to download tasks from other projects, it shows not requesting tasks: don't need (CPU; NVIDIA GPU: job cache full). when I try to download a second Python, it says "no tasks are available for Python apps for GPU hosts" which is not correct, there are some 150 available for download at the moment. Can anyone give me advice how to get this problem solved? ID: 59386 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 42,316 Level Scientific publications	Message 59387 - Posted: 2 Oct 2022, 17:30:49 UTC It can't. Due to the dual nature of the python tasks, BOINC has no mechanism to correctly show the estimated time to completion. The tasks do not take the time shown to complete and can in fact be returned well within the standard 5 day deadline. ID: 59387 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1168 Credit: 12,317,898,501 RAC: 91,654 Level Scientific publications	Message 59388 - Posted: 2 Oct 2022, 18:10:14 UTC - in response to Message 59387. It can't. Due to the dual nature of the python tasks, BOINC has no mechanism to correctly show the estimated time to completion. The tasks do not take the time shown to complete and can in fact be returned well within the standard 5 day deadline. But how come that on three other of my systems on which I am running Pythons for a while, the "remaining runtimes" are shown pretty correctly (+/- 24 hours)? And also on the machine in question, up to recently the time was indicated okay. Something must have happened yesterday, but I do not know what. If your assumption was right, on no Boinc instance more than 1 Python could be run in parallel. Didn't you say somewhere here in the forum that you are running 3 Pythons in parallel? How can a second and a third task be downloaded if the first one shows a remaining runtime of 30 or 60 days? What are the remaining runtimes shown for your Pythons once they get started? ID: 59388 · Rating: 0 · rate: / Reply Quote

kksplace Send message Joined: 4 Mar 18 Posts: 53 Credit: 2,819,226,011 RAC: 23,105 Level Scientific publications	Message 59389 - Posted: 2 Oct 2022, 22:03:50 UTC - in response to Message 59386. Last modified: 2 Oct 2022, 22:04:41 UTC Let me offer another possible "solution". (I am running two Python tasks on my system.) I found I had to change my Resource Share much, much higher for GPUGrid to effectively share other projects. I originally had Resource shares of 160 for GPUGrid vs 10 for Einstein and 40 for TN-Grid. Since the Python tasks 'use' so much CPU time in particular (at least reported CPU time), it seems to affect the Resource Share calculations at well. I had to move my Resource Share of GPUGrid (for example) to 2,000 to get it both to do two at once and to get Boinc to share with Einstein and TN-Grid roughly the way I wanted. (Nothing magic about my Resource Share ratios; just providing an example of how extreme I went to get it to balance the way I wanted.) Regarding the estimated time to completion, I have not seem them correct on my system yet, though it is getting better. At first Python tasks were starting at 1338 days (!) and now are at 23 days to start. Interesting to hear some of yours are showing correct! What setup are you using in the hosts showing correct times? ID: 59389 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 42,316 Level Scientific publications	Message 59390 - Posted: 3 Oct 2022, 0:49:02 UTC - in response to Message 59388. No, that was my teammate who is running 3X concurrent on his gpus. He runs nothing but GPUGrid on those hosts. I OTOH run multiple projects at the same time on my hosts. So the GPUGrid tasks have to share resources. That is a balancing act. I run a custom client that allows me to get around the normal BOINC client and project limitations. I can ask for as much or as little amount of work that I want on any host. Currently, I am running a single task on half a gpu in each host. I tried to run 2X on the gpu but I don't have enough resources to support 2 tasks on the host and run all my other projects at the same time. But the task runs well sharing the gpu with my other gpu projects. Keeps the gpu utilization much higher than if running only the Python task. The GPUGrid tasks start up with multiple hundreds of days expected before completion. That drops down to only a couple of days once the task gets over 90% completion. This is what BoincTasks is showing for the 5 tasks I am currently running on my hosts for estimated completion times. GPUGRID 4.03 Python apps for GPU hosts (cuda1131) e00014a06316-ABOU_rnd_ppod_expand_demos25-0-1-RND9172_3 01:05:30 (02:57:04) 90.11 3.970 157d,17:33:34 10/7/2022 4:27:00 PM 3C + 0.5NV (d1) Running High P. Darksider GPUGRID 4.03 Python apps for GPU hosts (cuda1131) e00005a00032-ABOU_rnd_ppod_expand_demos25_2-0-1-RND9669_0 13:30:26 (04d,00:21:21) 237.79 34.660 27d,12:31:49 10/7/2022 4:02:16 AM 3C + 0.5NV (d2) Running High P. Numbskull GPUGRID 4.03 Python apps for GPU hosts (cuda1131) e00012a04847-ABOU_rnd_ppod_expand_demos25-0-1-RND2344_4 13:27:51 (01d,09:45:50) 83.59 48.520 10d,20:41:45 10/7/2022 4:05:00 AM 3C + 0.5NV (d1) Running High P. Pipsqueek GPUGRID 4.03 Python apps for GPU hosts (cuda1131) e00015a05913-ABOU_rnd_ppod_expand_demos25-0-1-RND9942_0 21:04:49 (05d,14:22:40) 212.49 39.610 28d,03:53:33 10/6/2022 8:04:45 PM 3C + 0.5NV (d2) Running High P. Rocinante GPUGRID 4.03 Python apps for GPU hosts (cuda1131) e00008a00044-ABOU_rnd_ppod_expand_demos25_2-0-1-RND2891_2 01:23:31 (02:53:39) 69.30 3.970 22d,07:56:42 10/7/2022 4:09:00 PM 3C + 0.5NV (d0) Running High P. Serenity I'll finish all of the tasks before 24 hours on the high clocked hosts for maximum credit awards. I'll miss out on the 24 hour bonus by a half hour or so on the server hosts because of their slower clocks. ID: 59390 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1168 Credit: 12,317,898,501 RAC: 91,654 Level Scientific publications	Message 59391 - Posted: 3 Oct 2022, 6:34:19 UTC - in response to Message 59389. Regarding the estimated time to completion, I have not seem them correct on my system yet, though it is getting better. At first Python tasks were starting at 1338 days (!) and now are at 23 days to start. Interesting to hear some of yours are showing correct! What setup are you using in the hosts showing correct times? On one my hosts a new Python started some 25 minutes ago. "Remaining time" is shown as 13 hrs. No particular setup. In the past years, this host had crunched numerous ACEMD tasks. Since a few weeks ago, it's crunching Pythons. GTX980Ti. Besides, 2 "Theory" tasks from LHC are running. ID: 59391 · Rating: 0 · rate: / Reply Quote

Experimental Python tasks (beta) - task description