Experimental Python tasks (beta)

Author	Message
Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 42,316 Level Scientific publications	Message 59775 - Posted: 19 Jan 2023, 21:38:55 UTC - in response to Message 59762. Can http://www.gpugrid.net/apps.php link be put next to Server status link? I'd like to see this change in the website design also. Would be much easier for access than having to manually edit the URL or find the one apps link in the main project JoinUs page. ID: 59775 · Rating: 0 · rate: / Reply Quote

Pop Piasa Send message Joined: 8 Aug 19 Posts: 252 Credit: 458,054,251 RAC: 0 Level Scientific publications	Message 59780 - Posted: 22 Jan 2023, 21:12:44 UTC - in response to Message 59762. Last modified: 22 Jan 2023, 21:49:18 UTC Can http://www.gpugrid.net/apps.php link be put next to Server status link? You might want to repost that on the wish list thread so it's there when the webmaster gets around to updating the site. I fear they may be too busy at this time. I went ahead and put a link in my browser until then. Thanks for posting that page link. ID: 59780 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 9,834 Level Scientific publications	Message 59782 - Posted: 23 Jan 2023, 19:09:05 UTC - in response to Message 59676. Right now: ~ 14.200 "unsent" Python tasks in the queue. I guess it will take a while until they all are processed. now down to less than 500. these went much quicker than I anticipated. only about 3 weeks. ID: 59782 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 42,316 Level Scientific publications	Message 59783 - Posted: 23 Jan 2023, 20:30:42 UTC Last modified: 23 Jan 2023, 20:32:19 UTC So what again is going to be the status of the expected new application? Beta to start with? Removal of wandb? New nthreads value? New job_xxx.xml file? New compilation for Ada devices? ID: 59783 · Rating: 0 · rate: / Reply Quote

Ryan Munro Send message Joined: 6 Mar 18 Posts: 38 Credit: 1,343,792,080 RAC: 1,582 Level Scientific publications	Message 59784 - Posted: 23 Jan 2023, 20:43:34 UTC Will the new app be fine on 1 CPU core or will it still require many? on my Windows box atm I have to manually allocate 24 cores to the WU so it does not get starved with other projects running at the same time. ID: 59784 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 42,316 Level Scientific publications	Message 59787 - Posted: 23 Jan 2023, 22:04:11 UTC - in response to Message 59784. Last modified: 23 Jan 2023, 22:05:21 UTC Pretty sure you are confusing cores with processes. The app will still spin out 32 python processes. Processes are not cores. But from testing of the modified job.xml file, the new app will probably need as few as 4 cores/threads to run. ID: 59787 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 9,834 Level Scientific publications	Message 59788 - Posted: 24 Jan 2023, 4:04:42 UTC - in response to Message 59787. There are two separate mechanisms with this app spinning up multiple processes/threads. The fix will only reduce one of them. Since each task is training 32x agents at once, those 32 processes still spin up. The fix I helped uncover only addresses the unnecessary extra CPU usage from the n-cores extra processes spinning up. I’ve been running with those capped at 4. And it seems fine. About Ada support, since this app is not really an “app” as it’s not a compiled binary, but just a script, it works fine with Ada already according to some other users running it on their 40-series cards. It’s the Acemd3 app that needs to be recompiled for Ada. ID: 59788 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 59789 - Posted: 24 Jan 2023, 7:51:35 UTC - in response to Message 59783. Last modified: 24 Jan 2023, 7:52:16 UTC The job_xxx.xml will also remain the same, since the instructions are as simple as: - 1. unpack the conda python environment with all required dependencies. - 2. run the provided python script. - 3. return result files. So I am only changing the provided python script. As Ian mentioned, it is not a compiled app. The only difference is that the packed conda environment contains cuda10 (10.2.89) or cuda11 (11.3.1) depending on the host GPU. Is that enough to support ADA GPUs? ID: 59789 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 59790 - Posted: 24 Jan 2023, 8:10:17 UTC Only 75 jobs in the queue! Thank you all for your support :) I imagine will be all processed today. So as I mentioned in an earlier post, the next steps will be the following: 1- I will release a new version of our Reinforcement Learning library (https://github.com/PyTorchRL/pytorchrl), used in the python scripts to instantiate and train the AI agents. 2- I will send a small batch of PythonGPUBeta jobs with the new python script and also using the new version of the library. 3- If everything goes well, start sending PythonGPU tasks again. I am interested in your feedback regarding whether or not the new scripts configuration is helpful in terms of efficiency. In my machine seems to work fine. ID: 59790 · Rating: 0 · rate: / Reply Quote

Ryan Munro Send message Joined: 6 Mar 18 Posts: 38 Credit: 1,343,792,080 RAC: 1,582 Level Scientific publications	Message 59791 - Posted: 24 Jan 2023, 9:25:00 UTC - in response to Message 59787. Last modified: 24 Jan 2023, 9:27:24 UTC Yea it spins up that many processes but if I leave the app at default it will get choked because Boinc will only allocate 1 thread to it and the other projects running will take up the other 31 threads. I manually allocate it 24 threads as this is about what I observed it running when I only ran that task and nothing else, this stops it from getting choked when running multiple projects. What I would like to see is the app download and allocate however many threads it needs to complete the task automatically without needing a custom app_config file. ID: 59791 · Rating: 0 · rate: / Reply Quote

KAMasud Send message Joined: 27 Jul 11 Posts: 138 Credit: 539,953,398 RAC: 0 Level Scientific publications	Message 59793 - Posted: 25 Jan 2023, 5:13:08 UTC - in response to Message 59791. Yea it spins up that many processes but if I leave the app at default it will get choked because Boinc will only allocate 1 thread to it and the other projects running will take up the other 31 threads. I manually allocate it 24 threads as this is about what I observed it running when I only ran that task and nothing else, this stops it from getting choked when running multiple projects. What I would like to see is the app download and allocate however many threads it needs to complete the task automatically without needing a custom app_config file. I, second that. ID: 59793 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 59794 - Posted: 25 Jan 2023, 11:12:06 UTC Last modified: 25 Jan 2023, 11:12:28 UTC I just released the new version of the python library and sent the beta tasks. ID: 59794 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 59795 - Posted: 25 Jan 2023, 11:36:43 UTC - in response to Message 59793. Last modified: 25 Jan 2023, 11:43:57 UTC Is there any BOINC specifiable WU parameter for that? I could not find it but I would also like to avoid to the hosts having to manually change configuration if possible ID: 59795 · Rating: 0 · rate: / Reply Quote

kotenok2000 Send message Joined: 18 Jul 13 Posts: 79 Credit: 218,778,292 RAC: 12,880 Level Scientific publications	Message 59796 - Posted: 25 Jan 2023, 13:43:09 UTC - in response to Message 59795. Use this <app_config> <app> <name>PythonGPU</name> <plan_class>cuda1131</plan_class> <gpu_versions> <cpu_usage>8</cpu_usage> <gpu_usage>1</gpu_usage> </gpu_versions> <max_concurrent>1</max_concurrent> <fraction_done_exact/> </app> </app_config> ID: 59796 · Rating: 0 · rate: / Reply Quote

Ryan Munro Send message Joined: 6 Mar 18 Posts: 38 Credit: 1,343,792,080 RAC: 1,582 Level Scientific publications	Message 59797 - Posted: 25 Jan 2023, 14:14:12 UTC Last modified: 25 Jan 2023, 14:15:17 UTC Just grabbed one of the beta units and it still says Running (0.999 CPUs and 1 GPU) but it seems to be fluctuating between 50% and 100% load on my 32-thread CPU. If the app is spinning up a ton of processes that need their own threads can the app reflect that and allocate however many threads are needed, please? so for example it should say "Running (32 CPUs and 1 GPU)" or however many it needs. Would simplify things and I assume cut down on failed units from users who do not know the app spins up more than one process and run it on a single thread with other apps taking up the remainder. Thanks Edit after an initial 100% utilisation spike it's now settled down at around 30% - 40% CPU utilisation. ID: 59797 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 59799 - Posted: 25 Jan 2023, 14:23:04 UTC - in response to Message 59796. Last modified: 25 Jan 2023, 14:52:11 UTC But this is on the client side. On the server side I see I can adjust these parameters for a given app: https://boinc.berkeley.edu/trac/wiki/JobIn I am open to implement both solutions: 1- Force from the server side that host have more than 1 cpu, 4-8 for example (the tasks spawn 32 python threads but not 32 cpus are required to run them successfully). In case that is possible, but on the server I could not find any option to specify that so far.. 2- Specify that 32 processes are being created. I can add it to the logs, but where else can I mention it so users are aware? ID: 59799 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 42,316 Level Scientific publications	Message 59800 - Posted: 25 Jan 2023, 19:13:40 UTC I don't see any parameter in the jobin page that allocates the number of cpus the task will tie up. I don't know how the cpu resource is calculated. Must be internal in BOINC. Richard Haselgrove probably knows the answer. It varies among projects I've noticed. I think it is calculated internally in BOINC based on client benchmarks rating and the rsc_fpops_est value the work generator assigns tasks. The user has been able to override the project default values with their own values via the app_config mechanismm. But these values don't actually control how an app runs. Only the science app determines how much resources the task takes. The cpu_usage value is only a way to help the client determine how many tasks can be run for scheduling purposes and how much work should be downloaded. I'm currently running one of the beta tasks and it either runs faster or the workunit is smaller than normal. Probably the latter being beta. I notice 3 processes running run.py on the task along with the 32 spawned processes. I don't remember the previous app spinning up more than the one run.py process. I wonder if the 3 run.py processes are tied into my <cpu_usage>3.0</cpu_usage> setting in my app_config.xml file. ID: 59800 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 9,834 Level Scientific publications	Message 59801 - Posted: 25 Jan 2023, 19:34:05 UTC - in response to Message 59800. Last modified: 25 Jan 2023, 19:34:47 UTC I notice 3 processes running run.py on the task along with the 32 spawned processes. I don't remember the previous app spinning up more than the one run.py process. I wonder if the 3 run.py processes are tied into my <cpu_usage>3.0</cpu_usage> setting in my app_config.xml file. as you said earlier in your comment, the cpu_use only tells BOINC how much is being used. it does not exert any kind of "control" over the application directly. the previous tasks spun up a run.py child process for every core. these would be linked to the parent process. you can see them in htop. I have not been able to get any of these beta tasks myself (i got some very early morning before I got up, but they errored because of my custom edits) to see what might be going on. but there might be a problem with them still, some other users that got them seem to have errored. ID: 59801 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 42,316 Level Scientific publications	Message 59802 - Posted: 25 Jan 2023, 21:18:12 UTC I reset the project on all hosts prior to the release of the beta tasks to start with a clean slate. I have one of the beta tasks running well so far. 6.5 hrs in so far at 75% completion. GPUGRID 1.12 Python apps for GPU hosts beta (cuda1131) e00001a00027-ABOU_rnd_ppod_expand_demos29_betatest-0-1-RND7327_1 06:22:55 (15:21:33) 240.67 79.210 78d,21:06:03 1/30/2023 3:14:52 AM 0.998C + 1NV (d0) Running High P. Darksider I looked at this tasks in htop and it is different than before. I am not talking about the 32 spawned python processes. I was referring to 3 separate run.py process PID's that are using about 20% cpu each besides the main one. I hadn't configured my app_config.xml for the PythonGPUbeta before I picked up the task so I ended up with the default 0.998C core usage value rather than my normal 3.0 cpu value I have for the regular Python on GPU tasks. ID: 59802 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 9,834 Level Scientific publications	Message 59803 - Posted: 25 Jan 2023, 22:04:06 UTC - in response to Message 59802. Last modified: 25 Jan 2023, 22:07:33 UTC what you're showing in your screenshot is exactly what I saw before. the "green" processes are representative of the child processes. before, you would have a number of child threads in the same amount as the number of cores. on my 16-core system there would be 16 children, on the 24-core system there was 24 children, on the 64 core system there was 64 children. and so on, for each running task. if you move the selected line but pushing the down arrows or select one of the child processes with the cursor, you should see the top line as white text, which is the parent main process. this is all normal. check my screenshots from this message: https://www.gpugrid.net/forum_thread.php?id=5233&nowrap=true#59239 ID: 59803 · Rating: 0 · rate: / Reply Quote

Experimental Python tasks (beta) - task description