Experimental Python tasks (beta)

Author	Message
Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 9,834 Level Scientific publications	Message 59238 - Posted: 12 Sep 2022, 14:58:40 UTC - in response to Message 59237. Last modified: 12 Sep 2022, 15:33:37 UTC Not sure if it would have made a difference, but I would have placed your code before line 433, only after importing os and sys """ if __name__ == "__main__": import sys sys.stderr.write("Starting!!\n") import os os.environ["MKL_DEBUG_CPU_TYPE"] = "5" import platform """ thanks :) I'll try anyway edit - nope, no different. ID: 59238 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 9,834 Level Scientific publications	Message 59239 - Posted: 12 Sep 2022, 15:35:31 UTC - in response to Message 59237. Last modified: 12 Sep 2022, 15:37:58 UTC really unfortunate to use so much more resources on AMD than Intel. It's something about the multithreaded nature of the main run.py process itself. on intel it uses about 2-5% per process, and more run.py processes spin up the more cores you have. with AMD, it uses like 20-40% per process, so with high core count CPUs, that makes total CPU utilization crazy high. here is what it looks like running 4x python tasks (2 GPUs, 2 tasks each) on an intel 8-core, 16-thread system. what you're seeing is the 4 main run.py processes and their multithreaded components. notice that the total CPU used by each main process is a little more than 100%, this equates to a full thread for each process. now here is what it looks like running only 2x python tasks (1 GPU, 2 tasks each) on an AMD EPYC system with 24-cores, 48-threads. you can see the main run.py multithread components each using 20-40%, and each thread cumulatively using 600-800% CPU, EACH. that's 6-8 whole threads occupied for a single process. making it roughly 6-8x more resource intensive to run on AMD than Intel. I even swapped my 8c/16t intel CPU for a 16t/32c one, and while it spun up a more multithread components for the main run.py, each one was still only 2-5% used making it only about 150% CPU used from each main process. something definitely weird going on with these task between AMD and Intel the CPU used by the 32x multiprocessing.spawns is about the same between intel and AMD. it's only the threads that stem from the main run.py process that's showing this huge difference. ID: 59239 · Rating: 0 · rate: / Reply Quote

Diplomat Send message Joined: 1 Sep 10 Posts: 15 Credit: 894,769,989 RAC: 5,774 Level Scientific publications	Message 59240 - Posted: 12 Sep 2022, 15:57:00 UTC - in response to Message 59225. No. You cannot alter the task configuration. It will always create 32 spawned processes for each task during computation. If the task is interfering with your other cpu tasks then you have a choice, either stop the Python tasks or reduce your other cpu tasks. All you can do for making the Python task run reasonably well is assign 3-5 cpu cores for BOINC scheduling to keep other cpu work off the host. You can do that through a app_config.xml file in the project directory. Like this: <app_config> <app> <name>PythonGPU</name> <gpu_versions> <gpu_usage>1.0</gpu_usage> <cpu_usage>3.0</cpu_usage> </gpu_versions> </app> </app_config> does it improve GPU utilization? on average I see barely 20% with seldom spikes up to 35% ID: 59240 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 9,834 Level Scientific publications	Message 59241 - Posted: 12 Sep 2022, 16:01:23 UTC - in response to Message 59240. does it improve GPU utilization? on average I see barely 20% with seldom spikes up to 35% not directly. but if your GPU is being bottlenecked by not enough CPU resources then it could help. the best configuration so far is to not run ANY other CPU or GPU work. run only these tasks, and run 2 at a time to occupy a little more GPU. ID: 59241 · Rating: 0 · rate: / Reply Quote

gemini8 Send message Joined: 3 Jul 16 Posts: 31 Credit: 2,250,309,169 RAC: 50 Level Scientific publications	Message 59248 - Posted: 13 Sep 2022, 7:48:09 UTC - in response to Message 59241. Hi everyone. the best configuration so far is to not run ANY other CPU or GPU work. run only these tasks, and run 2 at a time to occupy a little more GPU. I'm thinking about putting every other Boinc CPU work into a VM instead of running it directly on the host. You could have a VM using only 90 per cent of processing power through the VM settings. This would leave the rest for the Python stuff, so on a sixteen-thread CPU it could use 160% of one thread's power or 10% of the CPU. If this wasn't enough the VM could be adjusted to only using eighty per cent (320% of one thread's power or 20% of the CPU for the Python work) and so on. Return [adjust and try] until the machine does fine. Plus, you could run other GPU stuff on your GPU to have it fully utilized which should prevent high temperature variations which I see as unnecessary stress for a GPU. MilkyWay has a small VRAM footprint and doesn't use a full GPU, and maybe I'll try WCG OPNG as well. - - - - - - - - - - Greetings, Jens ID: 59248 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1168 Credit: 12,317,898,501 RAC: 91,654 Level Scientific publications	Message 59251 - Posted: 13 Sep 2022, 19:24:52 UTC - in response to Message 59248. ... and maybe I'll try WCG OPNG as well. forget about WCG OPNG for the time being. Most of the time no tasks available; and if tasks are available for a short period of time, it's extremely hard to get them downloaded. The downloads get stuck most of the time, and only manual intervention helps. ID: 59251 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1168 Credit: 12,317,898,501 RAC: 91,654 Level Scientific publications	Message 59254 - Posted: 14 Sep 2022, 18:08:39 UTC Question: can a running Python task be interrupted for the time a Windows Update takes place (with rebooting of the PC), or does this damage the task? ID: 59254 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 42,316 Level Scientific publications	Message 59255 - Posted: 14 Sep 2022, 19:56:05 UTC - in response to Message 59254. Last modified: 14 Sep 2022, 20:18:46 UTC Question: can a running Python task be interrupted for the time a Windows Update takes place (with rebooting of the PC), or does this damage the task? Yes, one great thing about the Python tasks compared to acemd3 tasks is that they can be interrupted and continue on later with no penalty. They save checkpoints well which are replayed to get the task back to the point in progress it was at before interruption. Just be advised, that the replay process takes a few minutes after restart. The task will show 2% completion percentage upon restart but will eventually jump back to the progress point it was at and continue calculation until end. Just be patient and let the task run. ID: 59255 · Rating: 0 · rate: / Reply Quote

Jim1348 Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level Scientific publications	Message 59259 - Posted: 15 Sep 2022, 11:42:38 UTC - in response to Message 59255. Yes, one great thing about the Python tasks compared to acemd3 tasks is that they can be interrupted and continue on later with no penalty. I have a problem that they fail on reboot however. Is that common? http://www.gpugrid.net/results.php?hostid=583702 That is only on Windows though. I have not seen it yet on Linux, but I don't reboot often there. ID: 59259 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 42,316 Level Scientific publications	Message 59260 - Posted: 15 Sep 2022, 15:59:48 UTC - in response to Message 59259. Yes, one great thing about the Python tasks compared to acemd3 tasks is that they can be interrupted and continue on later with no penalty. I have a problem that they fail on reboot however. Is that common? http://www.gpugrid.net/results.php?hostid=583702 That is only on Windows though. I have not seen it yet on Linux, but I don't reboot often there. Guess it must be only on Windows. No problem restarting a task after a reboot on Ubuntu. ID: 59260 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 59261 - Posted: 16 Sep 2022, 8:03:30 UTC - in response to Message 59259. Last modified: 16 Sep 2022, 8:09:53 UTC The restart is supposed to work fine on Windows as well. Could you provide more information about when this error happens please? Does it happen systematically every time you interrupt and try to resume a task? Is there anyone for which the Windows checkpointing works fine? I tested locally and it worked. ID: 59261 · Rating: 0 · rate: / Reply Quote

Jim1348 Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level Scientific publications	Message 59262 - Posted: 16 Sep 2022, 15:48:36 UTC - in response to Message 59261. Last modified: 16 Sep 2022, 16:36:09 UTC Could you provide more information about when this error happens please? Does it happen systematically every time you interrupt and try to resume a task? I can pause and restart them with no problem. The error occurred only on reboot. But I think I have found it. I was using a large write cache, PrimoCache, set with a 8 GB cache size and 1 hour latency. By disabling that, I am able to reboot without a problem. So there was probably a delay in flushing the cache on reboot that caused the error. But I used the write cache to protect my SSD, since I was seeing writes of around 370 GB a day, too much for me. But this time I am seeing only 200 GB/day. That is still a lot, but not fatal for some time. It seems that the work units vary in how much they will write. I will monitor it. I use SsdReady to monitor the writes to disk; the free version is OK. PS - I can set PrimoCache to only a 1 GB write-cache size with a 5 minute latency, and it reboots without a problem. Whether that is good enough to protect the SSD will have to be determined by monitoring the actual writes to disk. PrimoCache gives a measure of that. (SsdReady gives the OS writes, but not the actual writes to disk.) PPS: I should point out that the reason a write cache can cut down on the writes to disk is because of the nature of scientific algorithms. They invariable read from a location, do a calculation, and then write back to the same location much of the time. Then, the cache can store that, and only write to the disk the changes that occur at the end of the flush period. If you have a large enough cache, and set the write-delay to infinite, you essentially have a ramdisk. But the cache can be good enough, with less memory than a ramdisk would require. (And now it seems that 2 GB and 10 minutes works OK.) ID: 59262 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1168 Credit: 12,317,898,501 RAC: 91,654 Level Scientific publications	Message 59265 - Posted: 18 Sep 2022, 13:41:51 UTC Question for the experts here: One of my PCs has 2 RTX3070 inside, Pythons are running quite well. The interesting thing is that VRAM usage of one GPU always is about 3.7GB, usage of the other always is about 4.3GB. So with one of the GPUs I could (try to) process 2 Pythons simultaneously, with the other not (VRAM of the RTX3070 is 8GB). Is it possible to arrange for such a setting via app_config.xml? BTW, I know what the app_config.xml looks like for running 2 Pythons on both GPUs (<gpu_usage>0.5</gpu_usage>), but I have no idea how to configure the xml according to my wishes as outlined above. Can anyone help? ID: 59265 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 9,834 Level Scientific publications	Message 59266 - Posted: 18 Sep 2022, 13:54:46 UTC - in response to Message 59265. Last modified: 18 Sep 2022, 14:51:19 UTC Sorry. There is no way to configure an app_config to differentiate between devices. You can only have different settings for different applications. The only option, which you might not want to do, is to run two different BOINC clients on the same system, to the project this will look like two different computers each having one GPU. Then you could configure one to run 2x and the other to run 1x. But the amount of VRAM used by the Python app is likely the same between your cards. But the first GPU will always have more vram used because it’s running your display. a second task wont use 4.3GB again. most likely only another +3.6 ID: 59266 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1168 Credit: 12,317,898,501 RAC: 91,654 Level Scientific publications	Message 59267 - Posted: 18 Sep 2022, 14:53:37 UTC - in response to Message 59266. Sorry. There is no way to configure an app_config to differentiate between devices. You can only have different settings for different applications. The only option, which you might not want to do, is to run two different BOINC clients on the same system, to the project this will look like two different computers each having one GPU. Then you could configure one to run 2x and the other to run 1x. But the amount of VRAM used by the Python app is likely the same between your cards. But the first GPU will always have more vram used because it’s running your display. In fact, I have 2 BOINC clients on this PC; I had to establish the second one with the BOINC DataDir on the SSD, since the first one is on the 32GB Ramdisk which would not let download Python tasks ("not enough disk space"). However, next week I will double the RAM on this PC, from 64 to 128GB, and then I will increase the Ramdisk size to at least 64GB; this should make it possible to download Python - at least that' what I hope. So then I could run 1 Python on each of the 2 GPUs on the SSD client, and a third Python on the Ramdisk client. The only two questions now are: how do I tell the Ramdisk client to run only 1 Python (although 2 GPUs available)? And how do I tell the Ramdisk client to choose the GPU with the lower amount of VRAM usage (i.e. the one that's NOT running the display)? In fact, I would prefer to run 2 Pythons on the Ramdisk client and 1 Python on the SSD client; however, the question is whether I could download 2 Pythons on the 64GB Ramdisk - the only thing I could do is to try. ID: 59267 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 9,834 Level Scientific publications	Message 59268 - Posted: 18 Sep 2022, 15:22:30 UTC - in response to Message 59267. Last modified: 18 Sep 2022, 16:18:36 UTC please read the BOINC documentation for client configuration. all of the options and what they do are in there. https://boinc.berkeley.edu/wiki/Client_configuration you will need to change several things to run multiple clients at the same time. you need to start them on different ports, as well as add several things to cc_config. you will also need to exclude the GPU you dont want to use from each client. either use the <exclude_gpu> section (where BOINC can see the device but wont use it for a given project) or use the <ignore_nvidia_dev> tag (where BOINC wont see this device at all for any project) ID: 59268 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 9,834 Level Scientific publications	Message 59269 - Posted: 18 Sep 2022, 16:26:35 UTC - in response to Message 59267. Last modified: 18 Sep 2022, 16:30:24 UTC personally I would stop running the ram disk. it's just extra complication and eats up ram space that the Python tasks crave. your biggest benefit will be moving to linux, it's easily 2x faster, maybe more. I don't know how you have your systems set up, but i see your longest runtimes on your 3070 are like 24hrs. that's crazy long. are you not leaving enough CPU available? are you running other CPU work at the same time? for comparison, I built a Linux machine dedicated to these tasks. 2x RTX 3060 and a 24-core EPYC CPU and 128GB system ram. I am not running any other work on it, only PythonGPU. to give these tasks the optimum conditions to run as fast as possible. with 12GB of VRAM, i can run 3x per GPU and it completes tasks in about 13hrs at the longest, for an effective longest completion time of about 1 task every 4.3hrs, which means at minimum, this system with 2x GPUs (6x tasks running) completes about 11 tasks per day (1,155,000 cred) + the bonus of some tasks completing earlier. you can see that my 3060 in this system is 6x more productive than your 3070. that's an insane difference doing this uses about 80-90% of the CPU, and ~56GB of system ram. I have enough spare VRAM to add another GPU, but maybe not enough CPU power to support more than 1 more task. if I want another GPU i will probably need a more powerful (more cores) CPU. ID: 59269 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1168 Credit: 12,317,898,501 RAC: 91,654 Level Scientific publications	Message 59270 - Posted: 18 Sep 2022, 17:01:28 UTC - in response to Message 59268. ... either use the <exclude_gpu> section (where BOINC can see the device but wont use it for a given project) or use the <ignore_nvidia_dev> tag (where BOINC wont see this device at all for any project) thanks very much for your hints:-) One other thing that I now noticed when reading the stderr of the 3 Pythons that failed short time after start: "RuntimeError: [enforce fail at C:\cb\pytorch_1000000000000\work\c10\core\impl\alloc_cpu.cpp:81] data. DefaultCPUAllocator: not enough memory: you tried to allocate 3612672 bytes" So the reason why the tasks crashed after a few seconds was not the too small VRAM (this would probably have come up a little later), but the lack of system RAM. In fact, I remember that right after start of the 4 Pythons, the Meminfo tool showed a rapid decrease of free system RAM, and shortly thereafter the free RAM was going up again (i.e. after 3 tasks had crashed thus releasing memory). Any idea how mugh system RAM, roughly, a Python task takes? ID: 59270 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1168 Credit: 12,317,898,501 RAC: 91,654 Level Scientific publications	Message 59271 - Posted: 18 Sep 2022, 17:23:23 UTC - in response to Message 59270. One other thing that I now noticed when reading the stderr of the 3 Pythons that failed short time after start: "RuntimeError: [enforce fail at C:\cb\pytorch_1000000000000\work\c10\core\impl\alloc_cpu.cpp:81] data. DefaultCPUAllocator: not enough memory: you tried to allocate 3612672 bytes" So the reason why the tasks crashed after a few seconds was not the too small VRAM (this would probably have come up a little later), but the lack of system RAM. In fact, I remember that right after start of the 4 Pythons, the Meminfo tool showed a rapid decrease of free system RAM, and shortly thereafter the free RAM was going up again (i.e. after 3 tasks had crashed thus releasing memory). Any idea how mugh system RAM, roughly, a Python task takes? From what I can see in the Windows Task Manager on this PC and on others running Python tasks, RAM usage of a Python can be from about 1GB to 6GB (!) How come that it varies that much? ID: 59271 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 9,834 Level Scientific publications	Message 59272 - Posted: 18 Sep 2022, 17:34:44 UTC - in response to Message 59271. Last modified: 18 Sep 2022, 17:35:32 UTC you should figure 7-8GB per python task. that's what it seems to use on my linux system. i would imagine it uses a little when the task starts up, then slowly increases once it gets to running full out. that might be the reason for the variance of 1GB in the beginning, and 6+GB by the time it gets to running the main program. these tasks work in 3 phases from what i've seen Phase 1: extraction phase. just extracting the compressed package. usually takes about 5 minutes, depending on CPU speed. uses only a single core. Phase 2: pre-processing and/or pre-loading. uses a large % of CPU power, GPU gets intermittently used, and VRAM preloads to about 60% of what will be eventually used. (in my case, VRAM preloads about 2100MB). this also lasts about 5 mins. Phase 3: main program. CPU use drops down, and VRAM use loads up to 100% of what is needed (in my case 3600MB per task). ID: 59272 · Rating: 0 · rate: / Reply Quote

Experimental Python tasks (beta) - task description