Experimental Python tasks (beta) - task description

Message boards : News : Experimental Python tasks (beta) - task description
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 29 · 30 · 31 · 32 · 33 · 34 · 35 . . . 50 · Next

AuthorMessage
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59418 - Posted: 10 Oct 2022, 11:24:20 UTC - in response to Message 59409.  

Ian&Steve C. wrote:
even if you solve the problem, you wont get more tasks until you change the GPUGRID task to use 0.5 GPU for 2x.

as said before, I had done this change in the app_config.xml.

After a few days of having had run other projects on this host, I tried again GPUGRID.
After all, I got 2 tasks downloaded (although I would have expected 4 since I had tweaked the coproc_info.xml to show 2 GPUs (so obviously this tweak has no effect, for what reason ever).

Then, the next disappointment:
although 2 Pythons were downloaded, only one started, the other one stayed in "ready to start" status.
A view on the status line of the inactive task revealed why so: it says "0.988 CPUs + 1 NVIDIA GPU". Although in the app_config.xml I have set "<gpu_usage>0.5</gpu_usage>".

In fact, I am using exactly the same app_config.xml on another host (with less hardware ressources), and there it works - 2 Pythons are crunched simultaneously, the status line of each task says "0.988 CPUs + 0.5 NVIDIA GPUs".

FYI, the complete app_config reads as follows:

<app_config>
<app>
<name>PythonGPU</name>
<max_concurrent>2</max_concurrent>
<gpu_versions>
<gpu_usage>0.5</gpu_usage>
<cpu_usage>1.0</cpu_usage>
</gpu_versions>
</app>
</app_config>


What could be the reason why neither the above mentioned entry in the coproc_info.xml nor the "0.5 GPU" entry in the app_config.xml have the expected effect?

I have been using these changes to 0.5 GPU (or even 0.33 and 0.25 GPU - when crunching WCG OPNG tasks) in various projects - it always worked.
Why does it not work with GPUGRID on this particular host?
This is especially annoying since this host has 2 CPUs and hence would be ideal for crunching 2 Pythons in parallel. Actually, I think that even 3 Pythons would work well (the VRAM of the GPU is 16GB, so no problem from this side).

Can anyone give me hints as to what I could do?

ID: 59418 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
kotenok2000

Send message
Joined: 18 Jul 13
Posts: 79
Credit: 210,528,292
RAC: 0
Level
Leu
Scientific publications
wat
Message 59419 - Posted: 10 Oct 2022, 11:28:53 UTC
Last modified: 10 Oct 2022, 11:29:20 UTC

You can reduce hard drive requirement by 1.93 GB if you remove these files from E:\programdata\BOINC\slots\1\Lib\site-packages\torch\lib when windows_fix.py has finished disabling ASLR and making .nv_fatb sections read-only.
05.01.2022 10:28 70 403 584 cudnn_ops_train64_8.dll_bak
05.01.2022 10:23 88 405 504 cudnn_ops_infer64_8.dll_bak
03.08.2022 04:04 1 329 664 torch_cuda_cpp.dll_bak
05.01.2022 11:21 81 487 360 cudnn_cnn_train64_8.dll_bak
05.01.2022 10:36 129 872 896 cudnn_adv_infer64_8.dll_bak
05.01.2022 10:46 97 293 824 cudnn_adv_train64_8.dll_bak
03.08.2022 05:05 871 934 464 torch_cuda_cu.dll_bak
05.01.2022 11:15 736 718 848 cudnn_cnn_infer64_8.dll_bak
Can you distribute these dlls already patched with python environment, or does NVIDIA license agreement forbid it?
ID: 59419 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
kotenok2000

Send message
Joined: 18 Jul 13
Posts: 79
Credit: 210,528,292
RAC: 0
Level
Leu
Scientific publications
wat
Message 59420 - Posted: 10 Oct 2022, 11:50:45 UTC - in response to Message 59386.  

I just discovered the following problem on the PC which consists of:

2 CPUs Xeon E5 8-core / 16-HT each.
1 GPU Quadro P5000
128 GB Ramdisk
128 GB system memory

until a few days ago, I ran 2 Pythons simultaneously (with a setting in the app_config.xml: 0.5 gpu usage).

Now, while only 1 Python is running and I push the update button on the BOINC manager for fetching another Python, the BOINC event log tells me that no Pythons are available. Which is not the case though, as the server status page shows some 550 tasks for download; besides, I just downloaded one on another PC.
BTW: the Python tasks uses only some 50% of the processor - which seems logical with 2 CPUs inside.

So I tried to download tasks from other projects, and in all cases the event log says:
not requesting tasks: don't need (CPU; NVIDIA GPU: job cache full).
How can that be the case?
In the BOINC computing preferences, I now set the "store at least work" to 10 days, and under "store up to an additional" also 10 days. However, this did not solve the problem.

There is about 94GB free space on the Ramdisk, and some 150GB free system RAM.

What also catches my eye: on the one running Python, which right now shows 45% progress after come 10 hours, it shows a remaining runtime of 34 days!
Before, like on my other machines, remaining runtime for Pythons was indicated as 1-2 days.
Could this entry be the cause why nothing else can be downloaded and I get the message "job cache full"?

Can anyone help me to get out of this problem?


Meanwhile, the problem has become even worse:

After downloading 1 Python, it starts and in the BOINC manager it shows a remaing runtime of about 60 days (!!!). In reality, he task proceeds with normal speed and will be finished within 24 hours, like all other tasks before on this machine.

Hence, nothing else can be downoladed.
When trying to download tasks from other projects, it shows
not requesting tasks: don't need (CPU; NVIDIA GPU: job cache full).

when I try to download a second Python, it says "no tasks are available for Python apps for GPU hosts" which is not correct, there are some 150 available for download at the moment.

Can anyone give me advice how to get this problem solved?


You can add <fraction_done_exact/> to your app_config.xml
ID: 59420 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 4,772
Level
Trp
Scientific publications
wat
Message 59421 - Posted: 10 Oct 2022, 12:20:24 UTC - in response to Message 59418.  

Ian&Steve C. wrote:
even if you solve the problem, you wont get more tasks until you change the GPUGRID task to use 0.5 GPU for 2x.

as said before, I had done this change in the app_config.xml.

After a few days of having had run other projects on this host, I tried again GPUGRID.
After all, I got 2 tasks downloaded (although I would have expected 4 since I had tweaked the coproc_info.xml to show 2 GPUs (so obviously this tweak has no effect, for what reason ever).

Then, the next disappointment:
although 2 Pythons were downloaded, only one started, the other one stayed in "ready to start" status.
A view on the status line of the inactive task revealed why so: it says "0.988 CPUs + 1 NVIDIA GPU". Although in the app_config.xml I have set "<gpu_usage>0.5</gpu_usage>".

In fact, I am using exactly the same app_config.xml on another host (with less hardware ressources), and there it works - 2 Pythons are crunched simultaneously, the status line of each task says "0.988 CPUs + 0.5 NVIDIA GPUs".

FYI, the complete app_config reads as follows:

<app_config>
<app>
<name>PythonGPU</name>
<max_concurrent>2</max_concurrent>
<gpu_versions>
<gpu_usage>0.5</gpu_usage>
<cpu_usage>1.0</cpu_usage>
</gpu_versions>
</app>
</app_config>


What could be the reason why neither the above mentioned entry in the coproc_info.xml nor the "0.5 GPU" entry in the app_config.xml have the expected effect?

I have been using these changes to 0.5 GPU (or even 0.33 and 0.25 GPU - when crunching WCG OPNG tasks) in various projects - it always worked.
Why does it not work with GPUGRID on this particular host?
This is especially annoying since this host has 2 CPUs and hence would be ideal for crunching 2 Pythons in parallel. Actually, I think that even 3 Pythons would work well (the VRAM of the GPU is 16GB, so no problem from this side).

Can anyone give me hints as to what I could do?



several things.

first. after changing your app_config file to gpu_usage to 0.5, did you restart boinc or click "read config files" in the Options toolbar menu? you need to do this for any changes in your app_config to take effect. also even if you did click this, tasks downloaded as 1.0 GPU will not change their label to 0.5, but it will be treated as a 0.5 internally. to see this reflected in the task labeling you need to restart boinc.

next this line:
<max_concurrent>2</max_concurrent>

this will prevent more than 2 task from running. even if you download 4, only 2 will run. just letting you know in case this is not what you intended.
ID: 59421 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59422 - Posted: 10 Oct 2022, 12:46:51 UTC - in response to Message 59421.  

several things.

first. after changing your app_config file to gpu_usage to 0.5, did you restart boinc or click "read config files" in the Options toolbar menu? you need to do this for any changes in your app_config to take effect. also even if you did click this, tasks downloaded as 1.0 GPU will not change their label to 0.5, but it will be treated as a 0.5 internally. to see this reflected in the task labeling you need to restart boinc.

next this line:
<max_concurrent>2</max_concurrent>

this will prevent more than 2 task from running. even if you download 4, only 2 will run. just letting you know in case this is not what you intended.


after changing an app_config file, I always click "read config files" in the Options toolbar menu. As said before, I have worked with app_config.xml files very often for several years, so I am for sure doing it correctly.

I know that tasks downloaded as 1.0 GPU will keep this label.
Here, this is not the question though. Because I had set the 0.5 GPU even before I started downloading Pythons. Since then, 5 Pythons were downloaded (3 of them finished and uploaded, 1 active, another one waiting to start), all of them show 1.0 GPU, for unknown reason.

I know the meaning of
<max_concurrent>2</max_concurrent>
thanks for the hint anyway.

So, as said before: it's totally unclear to me why in this case the app_config does not work. I see this problem for the first time in all the years :-(
What I could still try, after the currently running Python is over, to restart BOINC. Maybe this helps, however, I doubt it.
ID: 59422 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 4,772
Level
Trp
Scientific publications
wat
Message 59423 - Posted: 10 Oct 2022, 12:49:28 UTC - in response to Message 59422.  
Last modified: 10 Oct 2022, 13:01:27 UTC

what does your event log say about your app_config file? maybe you have some whitespace error in it that's causing boinc to not read it properly. when you click read config files, does boinc give any error/warning/complaint about the GPUGRID app_config file?

or check that the file is properly named as 'app_config.xml' and that there's no typo and located in your gpugrid project folder
ID: 59423 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59424 - Posted: 10 Oct 2022, 13:39:51 UTC - in response to Message 59423.  
Last modified: 10 Oct 2022, 13:40:38 UTC

what does your event log say about your app_config file? maybe you have some whitespace error in it that's causing boinc to not read it properly. when you click read config files, does boinc give any error/warning/complaint about the GPUGRID app_config file?

or check that the file is properly named as 'app_config.xml' and that there's no typo and located in your gpugrid project folder

I now double- and triple-checked everything you mentioned above.
Also, no error/warning/complaint after clicking read config files.
So this really is a huge conondrum :-(

What I now did was spoofing the GPU count info in the coproc_info.xml, which caused download of total of 4 Pythons, but only 2 running (okay, I want to be modest: 2 better than 1).
However, this cannot be the ultimate solution; since the GPU spoofing will have unwanted effects with other GPU projects.

So, at the bottom line: no idea what I can yet to to get this app_config work the way it's supposed to.
ID: 59424 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 4,772
Level
Trp
Scientific publications
wat
Message 59425 - Posted: 10 Oct 2022, 13:45:00 UTC - in response to Message 59424.  

but what does the event log say? does it claim to find the gpugrid app_config file? what you're describing sounds like BOINC is not reading the file. which can be because there's an error in the file or because you don't have the file in the right location.

please confirm which directory contains your GPUGRID app_config file, and post the Event Log output after clicking "read config files"
ID: 59425 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 4,772
Level
Trp
Scientific publications
wat
Message 59426 - Posted: 10 Oct 2022, 13:49:58 UTC - in response to Message 59424.  
Last modified: 10 Oct 2022, 13:51:50 UTC



What I now did was spoofing the GPU count info in the coproc_info.xml, which caused download of total of 4 Pythons, but only 2 running (okay, I want to be modest: 2 better than 1).
However, this cannot be the ultimate solution; since the GPU spoofing will have unwanted effects with other GPU projects.

So, at the bottom line: no idea what I can yet to to get this app_config work the way it's supposed to.


this is exactly what I would expect with the config you've described.

2x GPU spoofed = 4 tasks can download. if you have 2 running on a single GPU, then it's properly using 0.5 per GPU. the only way 2x can run on a single GPU is if the value 0.5 is being used. and only 2 running because of your max_concurrent statement (which you need for the spoofed GPU setup, otherwise it will try to run on the nonexistent second GPU and cause errors).

if you want to run 3x on a single GPU now, leave the GPU spoofing in place, change app_config to max_concurrent of 3, and change gpu_usage to 0.33

unless you know how to edit BOINC code and recompile a custom client, you will need to spoof the GPUs to get more tasks to download since the project enforces 2x tasks per GPU. there's no other solution.
ID: 59426 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59427 - Posted: 10 Oct 2022, 14:06:14 UTC - in response to Message 59425.  

but what does the event log say? does it claim to find the gpugrid app_config file? what you're describing sounds like BOINC is not reading the file. which can be because there's an error in the file or because you don't have the file in the right location.

please confirm which directory contains your GPUGRID app_config file, and post the Event Log output after clicking "read config files"


sorry I had goofed before. The event log does complain, indeed:

10.10.2022 15:49:42 | GPUGRID | Found app_config.xml
10.10.2022 15:49:42 | GPUGRID | Missing </app> in app_config.xml

however, this does not make any sense, because </app> is not missing, is it?

<app_config>
<app>
<name>PythonGPU</name>
<fraction_done_exact>
<max_concurrent>3</max_concurrent>
<gpu_versions>
<gpu_usage>0.5</gpu_usage>
<cpu_usage>1.0</cpu_usage>
</gpu_versions>
</app>
</app_config>

(I had added the <fraction_done_exact> meanwhile)
As already said, this is exactly the same app which I use on another host, and there it works. I copied it.

And yes, the file is contained in the GPUGRID project folder.


ID: 59427 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 4,772
Level
Trp
Scientific publications
wat
Message 59428 - Posted: 10 Oct 2022, 14:11:54 UTC - in response to Message 59427.  

the line <fraction_done_exact> is not right. that's breaking your file.

it needs to be <fraction_done_exact/>. you're missing the '/' before the close of the tag
ID: 59428 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59429 - Posted: 10 Oct 2022, 14:29:55 UTC - in response to Message 59428.  

the line <fraction_done_exact> is not right. that's breaking your file.

it needs to be <fraction_done_exact/>. you're missing the '/' before the close of the tag

OMG, shame on me :-(

Many thanks for your valuable help.

What I am questioning is how this error can happen by copying the file from another host (on which everything works fine).
Of course, it would have helped if the entry in the event log would have been a little clearer, it was referring to something else.

But anyway, the mistake was clearly on my side, and thanks again for your patience :-)

BTW, now 3 Pythons are running concurrently. Still, the load on the Quadro P5000 is moderate, the load on the 2 Xeon E5 is 100% each.
I will have to observe whether it would'nt make more sense to run 2 Pythons only.

ID: 59429 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
[CSF] Aleksey Belkov

Send message
Joined: 26 Dec 13
Posts: 86
Credit: 1,292,358,731
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59430 - Posted: 10 Oct 2022, 18:19:58 UTC - in response to Message 59397.  
Last modified: 10 Oct 2022, 18:23:58 UTC

Good day, abouh
I still see that unpacking is done by 2-step:
".\7za.exe" x pythongpu_windows_x86_64__cuda1131.txz -y

".\7za.exe" x pythongpu_windows_x86_64__cuda1131.tar -y


Is there any problem with implementing pipelined unpacking process?
ID: 59430 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 662
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59431 - Posted: 10 Oct 2022, 18:35:03 UTC - in response to Message 59429.  

The app_config.xml code you posted is not valid as proclaimed by the XML validator.


An error has been found!

Click on to jump to the error. In the document, you can point at with your mouse to see the error message.
Errors in the XML document:
10: 3 The element type "fraction_done_exact" must be terminated by the matching end-tag "</fraction_done_exact>".

XML document:
1 <app_config>
2 <app>
3 <name>PythonGPU</name>
4 <fraction_done_exact>
5 <max_concurrent>3</max_concurrent>
6 <gpu_versions>
7 <gpu_usage>0.5</gpu_usage>
8 <cpu_usage>1.0</cpu_usage>
9 </gpu_versions>
10 </
app>
11 </app_config>

You should always check your syntax of your XML files at the validator.

https://www.xmlvalidation.com/index.php
ID: 59431 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 318
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59432 - Posted: 10 Oct 2022, 18:44:15 UTC - in response to Message 59431.  

And you shouldn't have a mid-line break, as shown in line 10.
ID: 59432 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
KAMasud

Send message
Joined: 27 Jul 11
Posts: 138
Credit: 539,953,398
RAC: 0
Level
Lys
Scientific publications
watwat
Message 59435 - Posted: 11 Oct 2022, 4:44:15 UTC

We, "Boincers" are like cows. If there are no WU's. we move on to greener pastures. Forget about running several WU's on one GPU, give my GPU's something to run.
ID: 59435 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59436 - Posted: 11 Oct 2022, 5:58:26 UTC - in response to Message 59431.  

You should always check your syntax of your XML files at the validator.

https://www.xmlvalidation.com/index.php

Thanks, Keith, for the link. to be frank, I didn't know that such a validator exists.
ID: 59436 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 662
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59437 - Posted: 11 Oct 2022, 6:16:40 UTC - in response to Message 59436.  

Been around and published since early Seti days when we all had to do a lot of XML writing for custom app_info's and app_config's
ID: 59437 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
kotenok2000

Send message
Joined: 18 Jul 13
Posts: 79
Credit: 210,528,292
RAC: 0
Level
Leu
Scientific publications
wat
Message 59438 - Posted: 11 Oct 2022, 12:39:14 UTC - in response to Message 59435.  

You can run something like this
cd e:\Program Files\BOINC
e:
:loop
TIMEOUT /T 10 
boinccmd.exe --project https://www.gpugrid.net update

TIMEOUT /T 120 
goto loop

or write something like that for bash.
ID: 59438 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 4,772
Level
Trp
Scientific publications
wat
Message 59439 - Posted: 11 Oct 2022, 14:06:07 UTC

hey abouh,

I've noticed some new task names containing 'demos25_2-0-1' this differs from the majority of the previous tasks labelled as just 'demos25-0-1'.

can you briefly explain what is different about these tasks? also, the past few days (and mostly with these _2 tasks) the majority of the tasks have been either "early ending" or pre-coded to run a smaller number of iterations leading to very short runtimes (on the order of minutes instead of hours).

Thanks :)
ID: 59439 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 . . . 29 · 30 · 31 · 32 · 33 · 34 · 35 . . . 50 · Next

Message boards : News : Experimental Python tasks (beta) - task description

©2025 Universitat Pompeu Fabra