Experimental Python tasks (beta)

Author	Message
abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 58640 - Posted: 13 Apr 2022, 15:16:35 UTC - in response to Message 58639. has the allowed limit changed to 30,000,000,000 bytes? ID: 58640 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 42,316 Level Scientific publications	Message 58641 - Posted: 13 Apr 2022, 16:19:28 UTC Appears so. <rsc_disk_bound>30000000000.000000</rsc_disk_bound> ID: 58641 · Rating: 0 · rate: / Reply Quote

Greg _BE Send message Joined: 30 Jun 14 Posts: 153 Credit: 129,654,684 RAC: 0 Level Scientific publications	Message 58642 - Posted: 13 Apr 2022, 19:28:33 UTC - in response to Message 58634. Last modified: 13 Apr 2022, 19:30:53 UTC The size for all the app files (including the compressed environment) are: 2.0G for windows with cuda102 2.7G for windows with cuda1131 1.8G for linux with cuda102 2.6G for linux with cuda1131 The additional task specific data goes from a few KB to a few MB. I did not expect 7.8G compressed (not even after unpacking the environment). Is that the case for all PythonGPU tasks now? Regarding CPU/GPU usage, this app actually uses a combination of both due to the nature of the problem we are tackling (training AI agent to develop intelligent behaviour in a simulated environment with reinforcement learning techniques). Interactions with the agent environment happen in CPU, learning happens in GPU. Note: I was commenting on Rosetta at home CPU pythons. What yours do, I don't know. I guess i had better add your project and see what happens. I readded your project to my system, so if I am home when a task is sent out, I'll have a look. ID: 58642 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 58643 - Posted: 14 Apr 2022, 7:36:34 UTC - in response to Message 58642. Thank you! I have added the subtask weights to the PythonGPUbeta app. Currently testing it with a small batch of tasks. ID: 58643 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 58644 - Posted: 14 Apr 2022, 8:42:41 UTC Last modified: 14 Apr 2022, 9:20:16 UTC Testing was successful, so we can add the weights to the PythonGPU app job.xml file ID: 58644 · Rating: 0 · rate: / Reply Quote

Greg _BE Send message Joined: 30 Jun 14 Posts: 153 Credit: 129,654,684 RAC: 0 Level Scientific publications	Message 58655 - Posted: 15 Apr 2022, 21:20:06 UTC abouh, can you have a look at my comments in a thread I created. The 4.0 task was not increasing in percentage done after watching it for 10 minutes. Time to completion kept jumping around 1 second up 1 second down. 40 minutes run time vs cpu time? That a hell of a lot of set up time! Here are the local host task details Application Python apps for GPU hosts 4.03 (cuda1131) Workunit name e2a18-ABOU_rnd_ppod_avoid_cnn_4-0-1-RND3898 State Running Received 4/15/2022 12:06:46 PM Report deadline 4/20/2022 12:06:46 PM Estimated app speed 53.74 GFLOPs/sec Estimated task size 1,000,000,000 GFLOPs Resources 0.987 CPUs + 1 NVIDIA GPU (GTX 1050) CPU time at last checkpoint 06:44:35 CPU time 06:47:39 Elapsed time 06:05:04 Estimated time remaining 198d,09:49:25 Fraction done 7.880% Virtual memory size 7,230.02 MB Working set size 2,057.87 MB ID: 58655 · Rating: 0 · rate: / Reply Quote

mikey Send message Joined: 2 Jan 09 Posts: 303 Credit: 7,322,550,090 RAC: 524 Level Scientific publications	Message 58666 - Posted: 17 Apr 2022, 20:16:19 UTC - in response to Message 58652. You can delete the previous post about ACMED3. I posted that incorrectly here. Some forums let you put a double space or a double period to delete your own post, but you must still do it within the editing time ID: 58666 · Rating: 0 · rate: / Reply Quote

Greg _BE Send message Joined: 30 Jun 14 Posts: 153 Credit: 129,654,684 RAC: 0 Level Scientific publications	Message 58669 - Posted: 18 Apr 2022, 12:27:00 UTC - in response to Message 58666. Mikey, I know. But the time limit expired on that post to edit it. I came back days later not within the 30-60 minutes allowed. ID: 58669 · Rating: 0 · rate: / Reply Quote

Werinbert Send message Joined: 12 May 13 Posts: 5 Credit: 100,032,540 RAC: 0 Level Scientific publications	Message 58672 - Posted: 18 Apr 2022, 19:31:43 UTC I am now running a Python task. It has a very low usage of my GPU most often around 5 to 10%, occasionally getting up to 20%. Is this normal? Should I wait until I move my GPU from an old 3770K to a 12500 computer for better CPU capabilities to do these tasks? ID: 58672 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 42,316 Level Scientific publications	Message 58673 - Posted: 18 Apr 2022, 23:12:34 UTC - in response to Message 58672. This is normal for Python on GPU tasks. The tasks run on both the cpu and gpu during parts of the computation for the inferencing and machine learning segments. Read the posts by the admin developer explaining what the process involves. - cyclical GPU load is expected in Reinforcement Learning algorithms. Whenever GPU load in lower, CPU usage should increase. It is correct. ID: 58673 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 58674 - Posted: 19 Apr 2022, 8:21:52 UTC - in response to Message 58655. Last modified: 19 Apr 2022, 8:24:36 UTC Sorry for the late reply Greg _BE, I hid the ACEMD3 posts. I checked your job e2a18-ABOU_rnd_ppod_avoid_cnn_4-0-1-RND3898. Did the progress get stuck or was it just increasing slowly? The job was finally completed by another Windows 10 host, but the CPU time is wrong because it says 668566.9 seconds. I am not sure, but maybe one problem is that we ask only for 0.987 CPUs, since that was ideal for ACEMD jobs. In reality Python tasks use more. I will look into it. ID: 58674 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 58675 - Posted: 19 Apr 2022, 8:25:47 UTC New tasks being issued this morning, allocated to the old Linux v4.01 'Python app for GPU hosts' issued in October 2021. All are failing with "ModuleNotFoundError: No module named 'yaml'". ID: 58675 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 58676 - Posted: 19 Apr 2022, 8:38:26 UTC - in response to Message 58674. I am not sure, but maybe one problem is that we ask only for 0.987 CPUs, since that was ideal for ACEMD jobs. In reality Python tasks use more. I will look into it. Asking for 1.00 CPUs (or above) would make a significant difference, because that would prompt the BOINC client to reduce the number of tasks being run for other projects. It would be problematic to increase the CPU demand above 1.00, because the CPU loading is dynamic - BOINC has no provision for allowing another project to utilise the cycles available during periods when the GPUGrid app is quiescent. Normally, a GPU app is given a higher process priority for CPU usage than a pure CPU app, so the operating system should allocate resources to your advantage, but that can be problematic when the wrapper app is in use. That was changed recently: I'll look into the situation with your server version and our current client versions. ID: 58676 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 58677 - Posted: 19 Apr 2022, 9:23:32 UTC - in response to Message 58675. Last modified: 19 Apr 2022, 9:24:44 UTC Definitely only the latest version 403 should be sent. Thanks for letting us know. ID: 58677 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 58678 - Posted: 19 Apr 2022, 12:01:04 UTC BOINC GPU apps, wrapper apps, and process priority The basic rule for BOINC applications (originally CPU only) has been to run applications at idle priority, to avoid interfering with foreground use of the computer. Since the introduction of GPU apps into BOINC around 2008, the CPU portion of a GPU app has been automatically run at a slightly higher process priority (below normal) - an attempt to avoid highly-productive GPU work being throttled by competition for CPU resources. Normally, the BOINC client manages these two different process priorities directly. But when a wrapper app is interpolated between the client and a worker app, it's the wrapper which sets the priority for the worker app. It was a user on this project who first noticed (Issue 3764 - May 2020) that the process priority of a GPU app wasn't being set correctly when it was executing under the control of a wrapper app. Many false starts later (PRs 3826, 3948, 3988, 3999), a fully consistent set of process priority tools was developed, effective from about 25 September 2020. But in order for these tools to be useful, compatible versions of both the BOINC client and the wrapper application have to be used. So far as I can tell, BOINC client for Windows v7.16.20 (current) is compliant; Wrapper version 26203 is compliant; but no full public release versions of the BOINC client for Linux are yet compliant (Gianfranco Costamagna's prototyping PPA client should be). This project appears to be using wrapper code 26016 for Windows, and wrapper code 26198 for Linux. Unless these have been patched locally, neither wrapper will yet allow full process control management. It's not urgent, but with the new Python apps running in a mixed CPU/GPU environment, it might be helpful to update the project's wrapper codebase. Fortunately, the basic server platform is unaffected by all this. ID: 58678 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 58696 - Posted: 21 Apr 2022, 15:04:23 UTC - in response to Message 58675. We have deprecated v4.01 Hopefully, if everything went fine, the error All are failing with "ModuleNotFoundError: No module named 'yaml'". should not happen any more. And all jobs should use v4.03 ID: 58696 · Rating: 0 · rate: / Reply Quote

Greg _BE Send message Joined: 30 Jun 14 Posts: 153 Credit: 129,654,684 RAC: 0 Level Scientific publications	Message 58752 - Posted: 27 Apr 2022, 18:52:49 UTC Last modified: 27 Apr 2022, 19:41:00 UTC abouh, I got another python finally. But here is something interesting, the CPU value according to BOINC Tasks is 221%! How can you get more than 100% of a single core? Another observation, elapsed time vs CPU time. The two are off by about 5 hours. 4:01 vs 8:54 currently Progress is not moving very fast. In the time it has taken me to write this it is stuck at 7.88% Now 4:16 to 9:24 and still 7.88%!!, 15 mins and no progress? If this hasn't changed in the next hour, I am also aborting this task. BTW, 46 checkpoints in the 4hrs of run time. https://www.gpugrid.net/workunit.php?wuid=27219917 Exit status 195 (0xc3) EXIT_CHILD_FAILED Computer ID 589200 Exception: The wandb backend process has shutdown GeForce GTX 1050 (2047MB) driver: 512.15 Exit status 203 (0xcb) EXIT_ABORTED_VIA_GUI Computer ID 590211 Run time 241,306.00 CPU time 1,471.50 GeForce RTX 3080 Ti (4095MB) driver: 497. The point of this information is: 1)I have GTX 1050 and 1080. Previous python failed with the same exit error as the first person in this python task. What is EXIT_CHILD_FAILED? Something on your end or on our end? 2) Person 2 probably aborted because of the way BOINC reads the data to determine the time. I killed my first python because it shows 160+ days to completion. *I give up. No progress in 30 minutes since I started this post* Computer: DESKTOP-LFM92VN Project GPUGRID Name e5a13-ABOU_rnd_ppod_avoid_cnn_4-0-1-RND0256_2 Application Python apps for GPU hosts 4.03 (cuda1131) Workunit name e5a13-ABOU_rnd_ppod_avoid_cnn_4-0-1-RND0256 State Running Received 4/27/2022 4:35:18 PM Report deadline 5/2/2022 4:35:18 PM Estimated app speed 3,171.20 GFLOPs/sec Estimated task size 1,000,000,000 GFLOPs Resources 0.987 CPUs + 1 NVIDIA GPU (device 1) CPU time at last checkpoint 09:58:18 CPU time 10:08:59 Elapsed time 04:37:57 Estimated time remaining 161d,06:23:41 Fraction done 7.880% Virtual memory size 6,429.20 MB Working set size 1,072.13 MB Directory slots/12 Process ID 16828 Debug State: 2 - Scheduler: 2 That's 4:01 to 4:38 and still at 7.88% Checkpoints count up. CPU is 219% This is all messed up. I join the abort team. ------------ Something about the other task that failed with exit child. A few extracts: wandb: Network error (ReadTimeout), entering retry loop. Exception in thread StatsThr: Traceback (most recent call last): File "D:\data\slots\13\lib\site-packages\psutil\_common.py", line 449, in wrapper ret = self._cache[fun] AttributeError: 'Process' object has no attribute '_cache' During handling of the above exception, another exception occurred: (followed by line this and line that, etc) And then this: OSError: [WinError 1455] The paging file is too small for this operation to complete But the next person who got has this kind of setup: CPU type AuthenticAMD AMD Ryzen 5 5600X 6-Core Processor [Family 25 Model 33 Stepping 0] Number of processors 12 Coprocessors NVIDIA NVIDIA GeForce RTX 3080 (4095MB) driver: 512.15 Operating System Microsoft Windows 11 x64 Edition, (10.00.22000.00 I run GTX and Win10 with a Ryzen 7 2800 and 7.16.20 BOINC ID: 58752 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 42,316 Level Scientific publications	Message 58753 - Posted: 27 Apr 2022, 19:35:21 UTC But here is something interesting, the CPU value according to BOINC Tasks is 221%! How can you get more than 100% of a single core? Because the task was actually using a little more than two cores to process the work. Why I have set Python task to allocate 3 cpu threads for BOINC scheduling. ID: 58753 · Rating: 0 · rate: / Reply Quote

Greg _BE Send message Joined: 30 Jun 14 Posts: 153 Credit: 129,654,684 RAC: 0 Level Scientific publications	Message 58754 - Posted: 27 Apr 2022, 19:45:18 UTC - in response to Message 58753. Last modified: 27 Apr 2022, 19:46:26 UTC But here is something interesting, the CPU value according to BOINC Tasks is 221%! How can you get more than 100% of a single core? Because the task was actually using a little more than two cores to process the work. Why I have set Python task to allocate 3 cpu threads for BOINC scheduling. Ok...interesting, but what accounts for the lack of progress in 30 mins on this task that I just killed and the exit child error and blow up on the previous Python? I mean really...0% with 2 decimal points, 7.88 for more than 30 minutes? I don't know of any project that can't even 1/100th in 30 minutes. I've seen my share of slow tasks in other projects, but this one...wow.... And how do you go about setting just python for 3 cpu cores? That's beyond my knowledge level. ID: 58754 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 42,316 Level Scientific publications	Message 58755 - Posted: 27 Apr 2022, 22:31:48 UTC - in response to Message 58754. You use an app_config.xml file in the project like this: <app_config> <app> <name>acemd3</name> <gpu_versions> <gpu_usage>1.0</gpu_usage> <cpu_usage>1.0</cpu_usage> </gpu_versions> </app> <app> <name>acemd4</name> <gpu_versions> <gpu_usage>1.0</gpu_usage> <cpu_usage>1.0</cpu_usage> </gpu_versions> </app> <app> <name>PythonGPU</name> <gpu_versions> <gpu_usage>1.0</gpu_usage> <cpu_usage>3.0</cpu_usage> </gpu_versions> </app> <app> <name>PythonGPUbeta</name> <gpu_versions> <gpu_usage>1.0</gpu_usage> <cpu_usage>3.0</cpu_usage> </gpu_versions> </app> </app_config> ID: 58755 · Rating: 0 · rate: / Reply Quote

Experimental Python tasks (beta) - task description