Experimental Python tasks (beta)

Author	Message
Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 59145 - Posted: 22 Aug 2022, 10:56:47 UTC - in response to Message 59144. Last modified: 22 Aug 2022, 10:57:35 UTC Regarding cpu_usage, I remember having this discussion with Toni and I think the reason why we set the number of cores to that number is because with a single core the jobs can actually be executed. Even if they create 32 threads. Definitely do not require 32 cores. Is there an advantage of setting it to an arbitrary number higher than 1? Couldn't that cause some allocation problems? sorry it is a bit outside of my knowledge zone... This is a consequence of the handling of GPU plan_classes in the released BOINC server code. In the raw BOINC code, the cpu_usage value is calculated by some obscure (and, in all honesty, irrelevant and meaningless) calculation of the ratio of the number of flops that will be performed on the CPU and the GPU - the GPU, in particular, being assumed to be processing at an arbitrary fraction of the theoretical peak speed. In short, it's useless. I don't think the raw BOINC code expects you to make manual alterations to the calculated value. If you've found a way of over-riding and fixing it - great. More power to your elbow. The current issue arises because the Python app is neither a pure GPU app, nor a pure multi-threaded CPU app. It operates in both modes - and the BOINC developers didn't think of that. I think you need to create a special, new, plan_class name for this application, and experiment on that. Don't meddle with the existing plan_classes - that will mess up the other GPUGrid lines of research. I'm running with a manual override which devotes the whole GPU power, plus 3 CPUs, to the Python tasks. That seems to work reasonably well: it keeps enough work from other BOINC projects off the CPU while Python is running. ID: 59145 · Rating: 0 · rate: / Reply Quote

KAMasud Send message Joined: 27 Jul 11 Posts: 138 Credit: 539,953,398 RAC: 0 Level Scientific publications	Message 59152 - Posted: 23 Aug 2022, 18:13:17 UTC - in response to Message 59145. Last modified: 23 Aug 2022, 18:20:18 UTC Regarding cpu_usage, I remember having this discussion with Toni and I think the reason why we set the number of cores to that number is because with a single core the jobs can actually be executed. Even if they create 32 threads. Definitely do not require 32 cores. Is there an advantage of setting it to an arbitrary number higher than 1? Couldn't that cause some allocation problems? sorry it is a bit outside of my knowledge zone... This is a consequence of the handling of GPU plan_classes in the released BOINC server code. In the raw BOINC code, the cpu_usage value is calculated by some obscure (and, in all honesty, irrelevant and meaningless) calculation of the ratio of the number of flops that will be performed on the CPU and the GPU - the GPU, in particular, being assumed to be processing at an arbitrary fraction of the theoretical peak speed. In short, it's useless. I don't think the raw BOINC code expects you to make manual alterations to the calculated value. If you've found a way of over-riding and fixing it - great. More power to your elbow. The current issue arises because the Python app is neither a pure GPU app, nor a pure multi-threaded CPU app. It operates in both modes - and the BOINC developers didn't think of that. I think you need to create a special, new, plan_class name for this application, and experiment on that. Don't meddle with the existing plan_classes - that will mess up the other GPUGrid lines of research. I'm running with a manual override which devotes the whole GPU power, plus 3 CPUs, to the Python tasks. That seems to work reasonably well: it keeps enough work from other BOINC projects off the CPU while Python is running. Could you tell us a bit more about this manual override? Just now it is sprawled over five cores, ten threads. If it sees the sixth core free, it grabs that one also. ID: 59152 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 42,316 Level Scientific publications	Message 59153 - Posted: 23 Aug 2022, 19:39:42 UTC - in response to Message 59152. If you run other projects concurrently, then it is adviseable to limit the number of cores the Python tasks occupies for scheduling. I am not talking about the number of threads each task uses since that is fixed. Just create an app_config.xml file and place it into the GPUGrid projects directory and either re-read config files from the Manager or just restart BOINC. The file minimally just needs this: <app_config> <app> <name>PythonGPU</name> <gpu_versions> <gpu_usage>1.0</gpu_usage> <cpu_usage>3.0</cpu_usage> </gpu_versions> </app> </app_config> This will tell the BOINC client not to overcommit other projects cpu usage as the Python app gets 3 cores reserved for its use. I have found that to be plenty even when running 95% of all cpu cores on 3 other cpu projects along with running 2 other gpu projects which also use some or all of a cpu core to process the gpu task. ID: 59153 · Rating: 0 · rate: / Reply Quote

KAMasud Send message Joined: 27 Jul 11 Posts: 138 Credit: 539,953,398 RAC: 0 Level Scientific publications	Message 59154 - Posted: 24 Aug 2022, 7:15:27 UTC - in response to Message 59153. If you run other projects concurrently, then it is adviseable to limit the number of cores the Python tasks occupies for scheduling. I am not talking about the number of threads each task uses since that is fixed. Just create an app_config.xml file and place it into the GPUGrid projects directory and either re-read config files from the Manager or just restart BOINC. The file minimally just needs this: <app_config> <app> <name>PythonGPU</name> <gpu_versions> <gpu_usage>1.0</gpu_usage> <cpu_usage>3.0</cpu_usage> </gpu_versions> </app> </app_config> This will tell the BOINC client not to overcommit other projects cpu usage as the Python app gets 3 cores reserved for its use. I have found that to be plenty even when running 95% of all cpu cores on 3 other cpu projects along with running 2 other gpu projects which also use some or all of a cpu core to process the gpu task. Thank you Keith. Why is it using so many cores plus is it something like OpenIFS on CPDN? ID: 59154 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 59155 - Posted: 24 Aug 2022, 8:04:26 UTC - in response to Message 59154. Thank you Keith. Why is it using so many cores plus is it something like OpenIFS on CPDN? Yes - or nbody at MilkyWay. This Python task shares characteristics of a cuda (GPU) plan class, and a MT (multithreaded) plan class, and works best if treated as such. ID: 59155 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 59163 - Posted: 25 Aug 2022, 10:12:52 UTC Possible bad workunit: 27278732 ValueError: Expected value argument (Tensor of shape (1024,)) to be within the support (IntegerInterval(lower_bound=0, upper_bound=17)) of the distribution Categorical(logits: torch.Size([1024, 18])), but found invalid values: tensor([ 7, 9, 7, ..., 10, 9, 3], device='cuda:0') ID: 59163 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 59178 - Posted: 30 Aug 2022, 7:34:11 UTC - in response to Message 59163. Interesting I had never seen this error before, thank you! ID: 59178 · Rating: 0 · rate: / Reply Quote

Toby Broom Send message Joined: 11 Dec 08 Posts: 26 Credit: 668,444,294 RAC: 31,174 Level Scientific publications	Message 59192 - Posted: 3 Sep 2022, 10:27:16 UTC - in response to Message 59145. Thanks Richard, is 3 CPU cores enough to not slow down the GPU? ID: 59192 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 9,834 Level Scientific publications	Message 59203 - Posted: 8 Sep 2022, 16:53:31 UTC I'm noticing an interesting difference in application behavior between different systems. abouh, can you help explain the reason? I can see that each running task will spawn 32x processes (multiprocessing.spawn) as well as [number of cores]x processes for the main run.py application. so on my 8-core/16-thread Intel system, a single running task spawns 8x run.py processes, and 32x multiprocessing.spawn threads. and on my 24-core/48-thread AMD EPYC system, a single running task spawns 24x run.py processes, and 32x multiprocessing.spawn threads. What is confusing is the utilization of each thread between these systems. the EPYC system is uses ~600-800% CPU for the run.py process (~20-40% each thread) whereas the Intel system uses ~120% CPU for the run.py process (~2-5% each thread) I replicated the same high CPU use on another EPYC system (in a VM) where I've constrained it to the same 8-core/16-thread, and again its using a much larger share of the CPU than the intel system. is the application coded in some way that will force more work to be done on more modern processors? as far as I can tell, the increased CPU use isnt making the overall task run any faster. the Intel system is just as productive with far less CPU use. I was trying to run some python tasks on my Plex VM to let it use the GPU since plex doesnt use it very much, but the CPU use is making it troublesome. ID: 59203 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 9,834 Level Scientific publications	Message 59204 - Posted: 8 Sep 2022, 17:33:16 UTC - in response to Message 59203. or perhaps the Broadwell based Intel CPU is able to hardware accelerate some tasks that the EPYC has to do in software, leading to higher CPU use? ID: 59204 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 59205 - Posted: 9 Sep 2022, 6:40:53 UTC - in response to Message 59203. The application is not coded in any specific way to force more work to be done on more modern processors. Maybe python handles it under the hood somehow? ID: 59205 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 9,834 Level Scientific publications	Message 59206 - Posted: 9 Sep 2022, 12:37:59 UTC - in response to Message 59205. Maybe python handles it under the hood somehow? it might be related to pytorch actually. I did some more digging and it seems like AMD has worse performance due to some kind of CPU detection issue in the MKL (or maybe deliberate by Intel). do you know what version of MKL your package uses? and are you able to set specific env variables in your package? if your MKL is version <=2020.0, setting MKL_DEBUG_CPU_TYPE=5 might help this issue on AMD CPUs. but it looks like this will not be effective if you are on a newer version of the MKL as Intel has since removed this variable. ID: 59206 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 9,834 Level Scientific publications	Message 59207 - Posted: 9 Sep 2022, 14:45:37 UTC - in response to Message 59206. Last modified: 9 Sep 2022, 14:59:12 UTC and are you able to set specific env variables in your package? if your MKL is version <=2020.0, setting MKL_DEBUG_CPU_TYPE=5 might help this issue on AMD CPUs. but it looks like this will not be effective if you are on a newer version of the MKL as Intel has since removed this variable. to add: I was able to inspect your MKL version as 2019.0.4, and I tried setting the env variable by adding os.environ["MKL_DEBUG_CPU_TYPE"] = "5" to the run.py main program, but it had no effect. either I didn't put the command in the right place (I inserted it below line 433 in the run.py script), or the issue is something else entirely. edit: you also might consider compiling your scripts into binaries to prevent inquisitive minds from messing about in your program ;) ID: 59207 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 42,316 Level Scientific publications	Message 59208 - Posted: 10 Sep 2022, 3:15:17 UTC Last modified: 10 Sep 2022, 3:29:52 UTC Should the environment variable for fixing AMD computation in the MKL library be in the task package or just in the host environment? Or both? I would have thought the latter as the system calls the MKL library is using eventually have to be passed through to the cpu. export MKL_DEBUG_CPU_TYPE=5 and add to your .bashrc script. So you need to set the OS environent variable up first then pass it through to the Python code with your os.environ("MKL_DEBUG_CPU_TYPE") Of course if the embedded MKL package is the later version where the variable is ignored now, a moot point of using the variable to fix the intentional hamstringing of AMD processors. [Edit] Looks like there is a workaround for the Intel MKL check whether it is running on an Intel processor. https://danieldk.eu/Posts/2020-08-31-MKL-Zen.html So make the fake shared library and use LD_PRELOAD= to load the fake shared library That might be the easiest method to get the math libraries to use the advanced SIMD instructions like AVX2. ID: 59208 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 9,834 Level Scientific publications	Message 59209 - Posted: 10 Sep 2022, 4:34:43 UTC - in response to Message 59208. Last modified: 10 Sep 2022, 4:40:03 UTC I didn’t explicitly state it in my previous reply. But I tried all that already and it didn’t make any difference. I even ran run.py standalone outside of BOINC to be sure that the env variable was set. Neither the env variable being set nor the fake Intel library made any difference at all. But the embedded MKL version is actually an old one. It’s from 2019 as I mentioned before. So it should accept the debug variable. I just think now that it’s probably not the reason. ID: 59209 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 42,316 Level Scientific publications	Message 59210 - Posted: 10 Sep 2022, 5:00:10 UTC Ohh . . . . OK. Didn't know you had tried all the previous existing fixes. So must be something else going on in the code I guess. Just thought I would throw it out there in case you hadn't seen the other fixes. ID: 59210 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 59211 - Posted: 10 Sep 2022, 6:49:18 UTC - in response to Message 59207. I could definitely set the env variable depending on package version in my scripts if that made AI agents train faster. No need to create binaries. I am fine with any user that feels like it tinkering with the code, it always provides useful information. :) ID: 59211 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 42,316 Level Scientific publications	Message 59212 - Posted: 10 Sep 2022, 7:48:33 UTC Don't know if the math functions being used by the Python libraries are any higher than SSE2 or not. But if they are the MKL library functions default to SSE2 only when the MKL library is called and it detects any NON-Intel cpu. Probably only way to know for sure is examine the code and see it tries to run any SIMD instruction higher than SSE2, then implement the fix and see if the computations on the cpu are sped up. Depending on the math function being called, the speedup with the fix in place can be orders of magnitude faster. Based on Ian's experiment running on his Intel host, the lower cpu usage didn't make the tasks run any faster. But less cpu usage per task (when the tasks run the same with either hi or lo cpu usage) would be beneficial when also running other cpu tasks and aren't taking resources away from those processes. ID: 59212 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 9,834 Level Scientific publications	Message 59213 - Posted: 10 Sep 2022, 11:46:05 UTC - in response to Message 59211. I could definitely set the env variable depending on package version in my scripts if that made AI agents train faster. No need to create binaries. I am fine with any user that feels like it tinkering with the code, it always provides useful information. :) Was my location for the variable in the script right or appropriate? inserted below line 433. Does the script inherit the OS variables already? Just wanted to make sure I had it set properly. I figured the script runs in its own environment outside of BOINC (in Python). That’s why I tried adding it to the script. ID: 59213 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 9,834 Level Scientific publications	Message 59214 - Posted: 10 Sep 2022, 11:51:57 UTC - in response to Message 59212. Based on Ian's experiment running on his Intel host, the lower cpu usage didn't make the tasks run any faster. But less cpu usage per task (when the tasks run the same with either hi or lo cpu usage) would be beneficial when also running other cpu tasks and aren't taking resources away from those processes. It’s hard to say whether it’s faster or not since it’s not a true apples to apples comparison. So far it feels not faster, but that’s against different CPUs and different GPUs. Maybe my EPYC system seems similarly fast because the EPYC is just brute forcing it. It had much higher IPC than the old Broadwell based Intel. ID: 59214 · Rating: 0 · rate: / Reply Quote

Experimental Python tasks (beta) - task description