Message boards :
News :
Experimental Python tasks (beta) - task description
Message board moderation
Previous · 1 . . . 29 · 30 · 31 · 32 · 33 · 34 · 35 . . . 50 · Next
| Author | Message |
|---|---|
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Ian&Steve C. wrote: even if you solve the problem, you wont get more tasks until you change the GPUGRID task to use 0.5 GPU for 2x. as said before, I had done this change in the app_config.xml. After a few days of having had run other projects on this host, I tried again GPUGRID. After all, I got 2 tasks downloaded (although I would have expected 4 since I had tweaked the coproc_info.xml to show 2 GPUs (so obviously this tweak has no effect, for what reason ever). Then, the next disappointment: although 2 Pythons were downloaded, only one started, the other one stayed in "ready to start" status. A view on the status line of the inactive task revealed why so: it says "0.988 CPUs + 1 NVIDIA GPU". Although in the app_config.xml I have set "<gpu_usage>0.5</gpu_usage>". In fact, I am using exactly the same app_config.xml on another host (with less hardware ressources), and there it works - 2 Pythons are crunched simultaneously, the status line of each task says "0.988 CPUs + 0.5 NVIDIA GPUs". FYI, the complete app_config reads as follows: <app_config> <app> <name>PythonGPU</name> <max_concurrent>2</max_concurrent> <gpu_versions> <gpu_usage>0.5</gpu_usage> <cpu_usage>1.0</cpu_usage> </gpu_versions> </app> </app_config> What could be the reason why neither the above mentioned entry in the coproc_info.xml nor the "0.5 GPU" entry in the app_config.xml have the expected effect? I have been using these changes to 0.5 GPU (or even 0.33 and 0.25 GPU - when crunching WCG OPNG tasks) in various projects - it always worked. Why does it not work with GPUGRID on this particular host? This is especially annoying since this host has 2 CPUs and hence would be ideal for crunching 2 Pythons in parallel. Actually, I think that even 3 Pythons would work well (the VRAM of the GPU is 16GB, so no problem from this side). Can anyone give me hints as to what I could do? |
|
Send message Joined: 18 Jul 13 Posts: 79 Credit: 210,528,292 RAC: 0 Level ![]() Scientific publications
|
You can reduce hard drive requirement by 1.93 GB if you remove these files from E:\programdata\BOINC\slots\1\Lib\site-packages\torch\lib when windows_fix.py has finished disabling ASLR and making .nv_fatb sections read-only. 05.01.2022 10:28 70 403 584 cudnn_ops_train64_8.dll_bak 05.01.2022 10:23 88 405 504 cudnn_ops_infer64_8.dll_bak 03.08.2022 04:04 1 329 664 torch_cuda_cpp.dll_bak 05.01.2022 11:21 81 487 360 cudnn_cnn_train64_8.dll_bak 05.01.2022 10:36 129 872 896 cudnn_adv_infer64_8.dll_bak 05.01.2022 10:46 97 293 824 cudnn_adv_train64_8.dll_bak 03.08.2022 05:05 871 934 464 torch_cuda_cu.dll_bak 05.01.2022 11:15 736 718 848 cudnn_cnn_infer64_8.dll_bak Can you distribute these dlls already patched with python environment, or does NVIDIA license agreement forbid it? |
|
Send message Joined: 18 Jul 13 Posts: 79 Credit: 210,528,292 RAC: 0 Level ![]() Scientific publications
|
I just discovered the following problem on the PC which consists of: You can add <fraction_done_exact/> to your app_config.xml |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 4,772 Level ![]() Scientific publications
|
Ian&Steve C. wrote: several things. first. after changing your app_config file to gpu_usage to 0.5, did you restart boinc or click "read config files" in the Options toolbar menu? you need to do this for any changes in your app_config to take effect. also even if you did click this, tasks downloaded as 1.0 GPU will not change their label to 0.5, but it will be treated as a 0.5 internally. to see this reflected in the task labeling you need to restart boinc. next this line: <max_concurrent>2</max_concurrent> this will prevent more than 2 task from running. even if you download 4, only 2 will run. just letting you know in case this is not what you intended.
|
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
several things. after changing an app_config file, I always click "read config files" in the Options toolbar menu. As said before, I have worked with app_config.xml files very often for several years, so I am for sure doing it correctly. I know that tasks downloaded as 1.0 GPU will keep this label. Here, this is not the question though. Because I had set the 0.5 GPU even before I started downloading Pythons. Since then, 5 Pythons were downloaded (3 of them finished and uploaded, 1 active, another one waiting to start), all of them show 1.0 GPU, for unknown reason. I know the meaning of <max_concurrent>2</max_concurrent> thanks for the hint anyway. So, as said before: it's totally unclear to me why in this case the app_config does not work. I see this problem for the first time in all the years :-( What I could still try, after the currently running Python is over, to restart BOINC. Maybe this helps, however, I doubt it. |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 4,772 Level ![]() Scientific publications
|
what does your event log say about your app_config file? maybe you have some whitespace error in it that's causing boinc to not read it properly. when you click read config files, does boinc give any error/warning/complaint about the GPUGRID app_config file? or check that the file is properly named as 'app_config.xml' and that there's no typo and located in your gpugrid project folder
|
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
what does your event log say about your app_config file? maybe you have some whitespace error in it that's causing boinc to not read it properly. when you click read config files, does boinc give any error/warning/complaint about the GPUGRID app_config file? I now double- and triple-checked everything you mentioned above. Also, no error/warning/complaint after clicking read config files. So this really is a huge conondrum :-( What I now did was spoofing the GPU count info in the coproc_info.xml, which caused download of total of 4 Pythons, but only 2 running (okay, I want to be modest: 2 better than 1). However, this cannot be the ultimate solution; since the GPU spoofing will have unwanted effects with other GPU projects. So, at the bottom line: no idea what I can yet to to get this app_config work the way it's supposed to. |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 4,772 Level ![]() Scientific publications
|
but what does the event log say? does it claim to find the gpugrid app_config file? what you're describing sounds like BOINC is not reading the file. which can be because there's an error in the file or because you don't have the file in the right location. please confirm which directory contains your GPUGRID app_config file, and post the Event Log output after clicking "read config files"
|
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 4,772 Level ![]() Scientific publications
|
this is exactly what I would expect with the config you've described. 2x GPU spoofed = 4 tasks can download. if you have 2 running on a single GPU, then it's properly using 0.5 per GPU. the only way 2x can run on a single GPU is if the value 0.5 is being used. and only 2 running because of your max_concurrent statement (which you need for the spoofed GPU setup, otherwise it will try to run on the nonexistent second GPU and cause errors). if you want to run 3x on a single GPU now, leave the GPU spoofing in place, change app_config to max_concurrent of 3, and change gpu_usage to 0.33 unless you know how to edit BOINC code and recompile a custom client, you will need to spoof the GPUs to get more tasks to download since the project enforces 2x tasks per GPU. there's no other solution.
|
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
but what does the event log say? does it claim to find the gpugrid app_config file? what you're describing sounds like BOINC is not reading the file. which can be because there's an error in the file or because you don't have the file in the right location. sorry I had goofed before. The event log does complain, indeed: 10.10.2022 15:49:42 | GPUGRID | Found app_config.xml 10.10.2022 15:49:42 | GPUGRID | Missing </app> in app_config.xml however, this does not make any sense, because </app> is not missing, is it? <app_config> <app> <name>PythonGPU</name> <fraction_done_exact> <max_concurrent>3</max_concurrent> <gpu_versions> <gpu_usage>0.5</gpu_usage> <cpu_usage>1.0</cpu_usage> </gpu_versions> </app> </app_config> (I had added the <fraction_done_exact> meanwhile) As already said, this is exactly the same app which I use on another host, and there it works. I copied it. And yes, the file is contained in the GPUGRID project folder. |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 4,772 Level ![]() Scientific publications
|
the line <fraction_done_exact> is not right. that's breaking your file. it needs to be <fraction_done_exact/>. you're missing the '/' before the close of the tag
|
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
the line <fraction_done_exact> is not right. that's breaking your file. OMG, shame on me :-( Many thanks for your valuable help. What I am questioning is how this error can happen by copying the file from another host (on which everything works fine). Of course, it would have helped if the entry in the event log would have been a little clearer, it was referring to something else. But anyway, the mistake was clearly on my side, and thanks again for your patience :-) BTW, now 3 Pythons are running concurrently. Still, the load on the Quadro P5000 is moderate, the load on the 2 Xeon E5 is 100% each. I will have to observe whether it would'nt make more sense to run 2 Pythons only. |
|
Send message Joined: 26 Dec 13 Posts: 86 Credit: 1,292,358,731 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Good day, abouh I still see that unpacking is done by 2-step: ".\7za.exe" x pythongpu_windows_x86_64__cuda1131.txz -y ".\7za.exe" x pythongpu_windows_x86_64__cuda1131.tar -y Is there any problem with implementing pipelined unpacking process? |
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 662 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
The app_config.xml code you posted is not valid as proclaimed by the XML validator. An error has been found! Click on to jump to the error. In the document, you can point at with your mouse to see the error message. Errors in the XML document: 10: 3 The element type "fraction_done_exact" must be terminated by the matching end-tag "</fraction_done_exact>". XML document: 1 <app_config> 2 <app> 3 <name>PythonGPU</name> 4 <fraction_done_exact> 5 <max_concurrent>3</max_concurrent> 6 <gpu_versions> 7 <gpu_usage>0.5</gpu_usage> 8 <cpu_usage>1.0</cpu_usage> 9 </gpu_versions> 10 </ app> 11 </app_config> You should always check your syntax of your XML files at the validator. https://www.xmlvalidation.com/index.php |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 318 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
And you shouldn't have a mid-line break, as shown in line 10. |
|
Send message Joined: 27 Jul 11 Posts: 138 Credit: 539,953,398 RAC: 0 Level ![]() Scientific publications ![]()
|
We, "Boincers" are like cows. If there are no WU's. we move on to greener pastures. Forget about running several WU's on one GPU, give my GPU's something to run. |
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
You should always check your syntax of your XML files at the validator. Thanks, Keith, for the link. to be frank, I didn't know that such a validator exists. |
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 662 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Been around and published since early Seti days when we all had to do a lot of XML writing for custom app_info's and app_config's |
|
Send message Joined: 18 Jul 13 Posts: 79 Credit: 210,528,292 RAC: 0 Level ![]() Scientific publications
|
You can run something like this cd e:\Program Files\BOINC e: :loop TIMEOUT /T 10 boinccmd.exe --project https://www.gpugrid.net update TIMEOUT /T 120 goto loop or write something like that for bash. |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 4,772 Level ![]() Scientific publications
|
hey abouh, I've noticed some new task names containing 'demos25_2-0-1' this differs from the majority of the previous tasks labelled as just 'demos25-0-1'. can you briefly explain what is different about these tasks? also, the past few days (and mostly with these _2 tasks) the majority of the tasks have been either "early ending" or pre-coded to run a smaller number of iterations leading to very short runtimes (on the order of minutes instead of hours). Thanks :)
|
©2025 Universitat Pompeu Fabra