Experimental Python tasks (beta)

Author	Message
Erich56 Send message Joined: 1 Jan 15 Posts: 1168 Credit: 12,317,898,501 RAC: 91,654 Level Scientific publications	Message 59418 - Posted: 10 Oct 2022, 11:24:20 UTC - in response to Message 59409. Ian&Steve C. wrote: even if you solve the problem, you wont get more tasks until you change the GPUGRID task to use 0.5 GPU for 2x. as said before, I had done this change in the app_config.xml. After a few days of having had run other projects on this host, I tried again GPUGRID. After all, I got 2 tasks downloaded (although I would have expected 4 since I had tweaked the coproc_info.xml to show 2 GPUs (so obviously this tweak has no effect, for what reason ever). Then, the next disappointment: although 2 Pythons were downloaded, only one started, the other one stayed in "ready to start" status. A view on the status line of the inactive task revealed why so: it says "0.988 CPUs + 1 NVIDIA GPU". Although in the app_config.xml I have set "<gpu_usage>0.5</gpu_usage>". In fact, I am using exactly the same app_config.xml on another host (with less hardware ressources), and there it works - 2 Pythons are crunched simultaneously, the status line of each task says "0.988 CPUs + 0.5 NVIDIA GPUs". FYI, the complete app_config reads as follows: <app_config> <app> <name>PythonGPU</name> <max_concurrent>2</max_concurrent> <gpu_versions> <gpu_usage>0.5</gpu_usage> <cpu_usage>1.0</cpu_usage> </gpu_versions> </app> </app_config> What could be the reason why neither the above mentioned entry in the coproc_info.xml nor the "0.5 GPU" entry in the app_config.xml have the expected effect? I have been using these changes to 0.5 GPU (or even 0.33 and 0.25 GPU - when crunching WCG OPNG tasks) in various projects - it always worked. Why does it not work with GPUGRID on this particular host? This is especially annoying since this host has 2 CPUs and hence would be ideal for crunching 2 Pythons in parallel. Actually, I think that even 3 Pythons would work well (the VRAM of the GPU is 16GB, so no problem from this side). Can anyone give me hints as to what I could do? ID: 59418 · Rating: 0 · rate: / Reply Quote

kotenok2000 Send message Joined: 18 Jul 13 Posts: 79 Credit: 218,778,292 RAC: 12,880 Level Scientific publications	Message 59419 - Posted: 10 Oct 2022, 11:28:53 UTC Last modified: 10 Oct 2022, 11:29:20 UTC You can reduce hard drive requirement by 1.93 GB if you remove these files from E:\programdata\BOINC\slots\1\Lib\site-packages\torch\lib when windows_fix.py has finished disabling ASLR and making .nv_fatb sections read-only. 05.01.2022 10:28 70 403 584 cudnn_ops_train64_8.dll_bak 05.01.2022 10:23 88 405 504 cudnn_ops_infer64_8.dll_bak 03.08.2022 04:04 1 329 664 torch_cuda_cpp.dll_bak 05.01.2022 11:21 81 487 360 cudnn_cnn_train64_8.dll_bak 05.01.2022 10:36 129 872 896 cudnn_adv_infer64_8.dll_bak 05.01.2022 10:46 97 293 824 cudnn_adv_train64_8.dll_bak 03.08.2022 05:05 871 934 464 torch_cuda_cu.dll_bak 05.01.2022 11:15 736 718 848 cudnn_cnn_infer64_8.dll_bak Can you distribute these dlls already patched with python environment, or does NVIDIA license agreement forbid it? ID: 59419 · Rating: 0 · rate: / Reply Quote

kotenok2000 Send message Joined: 18 Jul 13 Posts: 79 Credit: 218,778,292 RAC: 12,880 Level Scientific publications	Message 59420 - Posted: 10 Oct 2022, 11:50:45 UTC - in response to Message 59386. I just discovered the following problem on the PC which consists of: 2 CPUs Xeon E5 8-core / 16-HT each. 1 GPU Quadro P5000 128 GB Ramdisk 128 GB system memory until a few days ago, I ran 2 Pythons simultaneously (with a setting in the app_config.xml: 0.5 gpu usage). Now, while only 1 Python is running and I push the update button on the BOINC manager for fetching another Python, the BOINC event log tells me that no Pythons are available. Which is not the case though, as the server status page shows some 550 tasks for download; besides, I just downloaded one on another PC. BTW: the Python tasks uses only some 50% of the processor - which seems logical with 2 CPUs inside. So I tried to download tasks from other projects, and in all cases the event log says: not requesting tasks: don't need (CPU; NVIDIA GPU: job cache full). How can that be the case? In the BOINC computing preferences, I now set the "store at least work" to 10 days, and under "store up to an additional" also 10 days. However, this did not solve the problem. There is about 94GB free space on the Ramdisk, and some 150GB free system RAM. What also catches my eye: on the one running Python, which right now shows 45% progress after come 10 hours, it shows a remaining runtime of 34 days! Before, like on my other machines, remaining runtime for Pythons was indicated as 1-2 days. Could this entry be the cause why nothing else can be downloaded and I get the message "job cache full"? Can anyone help me to get out of this problem? Meanwhile, the problem has become even worse: After downloading 1 Python, it starts and in the BOINC manager it shows a remaing runtime of about 60 days (!!!). In reality, he task proceeds with normal speed and will be finished within 24 hours, like all other tasks before on this machine. Hence, nothing else can be downoladed. When trying to download tasks from other projects, it shows not requesting tasks: don't need (CPU; NVIDIA GPU: job cache full). when I try to download a second Python, it says "no tasks are available for Python apps for GPU hosts" which is not correct, there are some 150 available for download at the moment. Can anyone give me advice how to get this problem solved? You can add <fraction_done_exact/> to your app_config.xml ID: 59420 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 9,834 Level Scientific publications	Message 59421 - Posted: 10 Oct 2022, 12:20:24 UTC - in response to Message 59418. Ian&Steve C. wrote: even if you solve the problem, you wont get more tasks until you change the GPUGRID task to use 0.5 GPU for 2x. as said before, I had done this change in the app_config.xml. After a few days of having had run other projects on this host, I tried again GPUGRID. After all, I got 2 tasks downloaded (although I would have expected 4 since I had tweaked the coproc_info.xml to show 2 GPUs (so obviously this tweak has no effect, for what reason ever). Then, the next disappointment: although 2 Pythons were downloaded, only one started, the other one stayed in "ready to start" status. A view on the status line of the inactive task revealed why so: it says "0.988 CPUs + 1 NVIDIA GPU". Although in the app_config.xml I have set "<gpu_usage>0.5</gpu_usage>". In fact, I am using exactly the same app_config.xml on another host (with less hardware ressources), and there it works - 2 Pythons are crunched simultaneously, the status line of each task says "0.988 CPUs + 0.5 NVIDIA GPUs". FYI, the complete app_config reads as follows: <app_config> <app> <name>PythonGPU</name> <max_concurrent>2</max_concurrent> <gpu_versions> <gpu_usage>0.5</gpu_usage> <cpu_usage>1.0</cpu_usage> </gpu_versions> </app> </app_config> What could be the reason why neither the above mentioned entry in the coproc_info.xml nor the "0.5 GPU" entry in the app_config.xml have the expected effect? I have been using these changes to 0.5 GPU (or even 0.33 and 0.25 GPU - when crunching WCG OPNG tasks) in various projects - it always worked. Why does it not work with GPUGRID on this particular host? This is especially annoying since this host has 2 CPUs and hence would be ideal for crunching 2 Pythons in parallel. Actually, I think that even 3 Pythons would work well (the VRAM of the GPU is 16GB, so no problem from this side). Can anyone give me hints as to what I could do? several things. first. after changing your app_config file to gpu_usage to 0.5, did you restart boinc or click "read config files" in the Options toolbar menu? you need to do this for any changes in your app_config to take effect. also even if you did click this, tasks downloaded as 1.0 GPU will not change their label to 0.5, but it will be treated as a 0.5 internally. to see this reflected in the task labeling you need to restart boinc. next this line: <max_concurrent>2</max_concurrent> this will prevent more than 2 task from running. even if you download 4, only 2 will run. just letting you know in case this is not what you intended. ID: 59421 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1168 Credit: 12,317,898,501 RAC: 91,654 Level Scientific publications	Message 59422 - Posted: 10 Oct 2022, 12:46:51 UTC - in response to Message 59421. several things. first. after changing your app_config file to gpu_usage to 0.5, did you restart boinc or click "read config files" in the Options toolbar menu? you need to do this for any changes in your app_config to take effect. also even if you did click this, tasks downloaded as 1.0 GPU will not change their label to 0.5, but it will be treated as a 0.5 internally. to see this reflected in the task labeling you need to restart boinc. next this line: <max_concurrent>2</max_concurrent> this will prevent more than 2 task from running. even if you download 4, only 2 will run. just letting you know in case this is not what you intended. after changing an app_config file, I always click "read config files" in the Options toolbar menu. As said before, I have worked with app_config.xml files very often for several years, so I am for sure doing it correctly. I know that tasks downloaded as 1.0 GPU will keep this label. Here, this is not the question though. Because I had set the 0.5 GPU even before I started downloading Pythons. Since then, 5 Pythons were downloaded (3 of them finished and uploaded, 1 active, another one waiting to start), all of them show 1.0 GPU, for unknown reason. I know the meaning of <max_concurrent>2</max_concurrent> thanks for the hint anyway. So, as said before: it's totally unclear to me why in this case the app_config does not work. I see this problem for the first time in all the years :-( What I could still try, after the currently running Python is over, to restart BOINC. Maybe this helps, however, I doubt it. ID: 59422 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 9,834 Level Scientific publications	Message 59423 - Posted: 10 Oct 2022, 12:49:28 UTC - in response to Message 59422. Last modified: 10 Oct 2022, 13:01:27 UTC what does your event log say about your app_config file? maybe you have some whitespace error in it that's causing boinc to not read it properly. when you click read config files, does boinc give any error/warning/complaint about the GPUGRID app_config file? or check that the file is properly named as 'app_config.xml' and that there's no typo and located in your gpugrid project folder ID: 59423 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1168 Credit: 12,317,898,501 RAC: 91,654 Level Scientific publications	Message 59424 - Posted: 10 Oct 2022, 13:39:51 UTC - in response to Message 59423. Last modified: 10 Oct 2022, 13:40:38 UTC what does your event log say about your app_config file? maybe you have some whitespace error in it that's causing boinc to not read it properly. when you click read config files, does boinc give any error/warning/complaint about the GPUGRID app_config file? or check that the file is properly named as 'app_config.xml' and that there's no typo and located in your gpugrid project folder I now double- and triple-checked everything you mentioned above. Also, no error/warning/complaint after clicking read config files. So this really is a huge conondrum :-( What I now did was spoofing the GPU count info in the coproc_info.xml, which caused download of total of 4 Pythons, but only 2 running (okay, I want to be modest: 2 better than 1). However, this cannot be the ultimate solution; since the GPU spoofing will have unwanted effects with other GPU projects. So, at the bottom line: no idea what I can yet to to get this app_config work the way it's supposed to. ID: 59424 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 9,834 Level Scientific publications	Message 59425 - Posted: 10 Oct 2022, 13:45:00 UTC - in response to Message 59424. but what does the event log say? does it claim to find the gpugrid app_config file? what you're describing sounds like BOINC is not reading the file. which can be because there's an error in the file or because you don't have the file in the right location. please confirm which directory contains your GPUGRID app_config file, and post the Event Log output after clicking "read config files" ID: 59425 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 9,834 Level Scientific publications	Message 59426 - Posted: 10 Oct 2022, 13:49:58 UTC - in response to Message 59424. Last modified: 10 Oct 2022, 13:51:50 UTC What I now did was spoofing the GPU count info in the coproc_info.xml, which caused download of total of 4 Pythons, but only 2 running (okay, I want to be modest: 2 better than 1). However, this cannot be the ultimate solution; since the GPU spoofing will have unwanted effects with other GPU projects. So, at the bottom line: no idea what I can yet to to get this app_config work the way it's supposed to. this is exactly what I would expect with the config you've described. 2x GPU spoofed = 4 tasks can download. if you have 2 running on a single GPU, then it's properly using 0.5 per GPU. the only way 2x can run on a single GPU is if the value 0.5 is being used. and only 2 running because of your max_concurrent statement (which you need for the spoofed GPU setup, otherwise it will try to run on the nonexistent second GPU and cause errors). if you want to run 3x on a single GPU now, leave the GPU spoofing in place, change app_config to max_concurrent of 3, and change gpu_usage to 0.33 unless you know how to edit BOINC code and recompile a custom client, you will need to spoof the GPUs to get more tasks to download since the project enforces 2x tasks per GPU. there's no other solution. ID: 59426 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1168 Credit: 12,317,898,501 RAC: 91,654 Level Scientific publications	Message 59427 - Posted: 10 Oct 2022, 14:06:14 UTC - in response to Message 59425. but what does the event log say? does it claim to find the gpugrid app_config file? what you're describing sounds like BOINC is not reading the file. which can be because there's an error in the file or because you don't have the file in the right location. please confirm which directory contains your GPUGRID app_config file, and post the Event Log output after clicking "read config files" sorry I had goofed before. The event log does complain, indeed: 10.10.2022 15:49:42 \| GPUGRID \| Found app_config.xml 10.10.2022 15:49:42 \| GPUGRID \| Missing </app> in app_config.xml however, this does not make any sense, because </app> is not missing, is it? <app_config> <app> <name>PythonGPU</name> <fraction_done_exact> <max_concurrent>3</max_concurrent> <gpu_versions> <gpu_usage>0.5</gpu_usage> <cpu_usage>1.0</cpu_usage> </gpu_versions> </app> </app_config> (I had added the <fraction_done_exact> meanwhile) As already said, this is exactly the same app which I use on another host, and there it works. I copied it. And yes, the file is contained in the GPUGRID project folder. ID: 59427 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 9,834 Level Scientific publications	Message 59428 - Posted: 10 Oct 2022, 14:11:54 UTC - in response to Message 59427. the line <fraction_done_exact> is not right. that's breaking your file. it needs to be <fraction_done_exact/>. you're missing the '/' before the close of the tag ID: 59428 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1168 Credit: 12,317,898,501 RAC: 91,654 Level Scientific publications	Message 59429 - Posted: 10 Oct 2022, 14:29:55 UTC - in response to Message 59428. the line <fraction_done_exact> is not right. that's breaking your file. it needs to be <fraction_done_exact/>. you're missing the '/' before the close of the tag OMG, shame on me :-( Many thanks for your valuable help. What I am questioning is how this error can happen by copying the file from another host (on which everything works fine). Of course, it would have helped if the entry in the event log would have been a little clearer, it was referring to something else. But anyway, the mistake was clearly on my side, and thanks again for your patience :-) BTW, now 3 Pythons are running concurrently. Still, the load on the Quadro P5000 is moderate, the load on the 2 Xeon E5 is 100% each. I will have to observe whether it would'nt make more sense to run 2 Pythons only. ID: 59429 · Rating: 0 · rate: / Reply Quote

[CSF] Aleksey Belkov Send message Joined: 26 Dec 13 Posts: 87 Credit: 1,292,358,731 RAC: 0 Level Scientific publications	Message 59430 - Posted: 10 Oct 2022, 18:19:58 UTC - in response to Message 59397. Last modified: 10 Oct 2022, 18:23:58 UTC Good day, abouh I still see that unpacking is done by 2-step: ".\7za.exe" x pythongpu_windows_x86_64__cuda1131.txz -y ".\7za.exe" x pythongpu_windows_x86_64__cuda1131.tar -y Is there any problem with implementing pipelined unpacking process? ID: 59430 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 42,316 Level Scientific publications	Message 59431 - Posted: 10 Oct 2022, 18:35:03 UTC - in response to Message 59429. The app_config.xml code you posted is not valid as proclaimed by the XML validator. An error has been found! Click on to jump to the error. In the document, you can point at with your mouse to see the error message. Errors in the XML document: 10: 3 The element type "fraction_done_exact" must be terminated by the matching end-tag "</fraction_done_exact>". XML document: 1 <app_config> 2 <app> 3 <name>PythonGPU</name> 4 <fraction_done_exact> 5 <max_concurrent>3</max_concurrent> 6 <gpu_versions> 7 <gpu_usage>0.5</gpu_usage> 8 <cpu_usage>1.0</cpu_usage> 9 </gpu_versions> 10 </ app> 11 </app_config> You should always check your syntax of your XML files at the validator. https://www.xmlvalidation.com/index.php ID: 59431 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 59432 - Posted: 10 Oct 2022, 18:44:15 UTC - in response to Message 59431. And you shouldn't have a mid-line break, as shown in line 10. ID: 59432 · Rating: 0 · rate: / Reply Quote

KAMasud Send message Joined: 27 Jul 11 Posts: 138 Credit: 539,953,398 RAC: 0 Level Scientific publications	Message 59435 - Posted: 11 Oct 2022, 4:44:15 UTC We, "Boincers" are like cows. If there are no WU's. we move on to greener pastures. Forget about running several WU's on one GPU, give my GPU's something to run. ID: 59435 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1168 Credit: 12,317,898,501 RAC: 91,654 Level Scientific publications	Message 59436 - Posted: 11 Oct 2022, 5:58:26 UTC - in response to Message 59431. You should always check your syntax of your XML files at the validator. https://www.xmlvalidation.com/index.php Thanks, Keith, for the link. to be frank, I didn't know that such a validator exists. ID: 59436 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 42,316 Level Scientific publications	Message 59437 - Posted: 11 Oct 2022, 6:16:40 UTC - in response to Message 59436. Been around and published since early Seti days when we all had to do a lot of XML writing for custom app_info's and app_config's ID: 59437 · Rating: 0 · rate: / Reply Quote

kotenok2000 Send message Joined: 18 Jul 13 Posts: 79 Credit: 218,778,292 RAC: 12,880 Level Scientific publications	Message 59438 - Posted: 11 Oct 2022, 12:39:14 UTC - in response to Message 59435. You can run something like this cd e:\Program Files\BOINC e: :loop TIMEOUT /T 10 boinccmd.exe --project https://www.gpugrid.net update TIMEOUT /T 120 goto loop or write something like that for bash. ID: 59438 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 9,834 Level Scientific publications	Message 59439 - Posted: 11 Oct 2022, 14:06:07 UTC hey abouh, I've noticed some new task names containing 'demos25_2-0-1' this differs from the majority of the previous tasks labelled as just 'demos25-0-1'. can you briefly explain what is different about these tasks? also, the past few days (and mostly with these _2 tasks) the majority of the tasks have been either "early ending" or pre-coded to run a smaller number of iterations leading to very short runtimes (on the order of minutes instead of hours). Thanks :) ID: 59439 · Rating: 0 · rate: / Reply Quote

Experimental Python tasks (beta) - task description