Experimental Python tasks (beta) - task description

Message boards : News : Experimental Python tasks (beta) - task description
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 25 · 26 · 27 · 28 · 29 · 30 · 31 . . . 50 · Next

AuthorMessage
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59280 - Posted: 20 Sep 2022, 10:08:45 UTC - in response to Message 59254.  

Erich56 asked:
Question: can a running Python task be interrupted for the time a Windows Update takes place (with rebooting of the PC), or does this damage the task?

I tried it now - the two tasks running on a RTX3070 each - on Windows - did not survive a reboot :-(
ID: 59280 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59281 - Posted: 20 Sep 2022, 11:58:10 UTC

since yesterday I upgraded the RAM of one of my PCs from 64GB to 128GB (so now I have a 64GB Ramdisk plus 64GB system RAM, before it was half each), every GPUGRID Python fails on this PC with 2 RTX3070 inside.

The task starts okay, RAM as well as VRAM is filling up continuously, also the CPU usage is close to 100%, and after a while (a few minutes up to half an hour) the task fails.
The BOINC manager says "aborted by the project", and the task description says "aufgegeben" = abandoned or so.

Interestingly, no times are shown, neither runtime nor CPU time, further there is no stderr.

See this example:

https://www.gpugrid.net/result.php?resultid=33044774

on another machine, I have two tasks running simultaneously on one GPU - no problem at all.

I was of course thinking of a defective RAM module; however, all night through I had running simultaneously 5 LHC ATLAS tasks 3-cores ea., without any problem. So I guess this was RAM test enough.

Also hundreds of WCG GPU tasks were processed this morning for hours, also without any problem.

Anyone and ideas ?
ID: 59281 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Diplomat

Send message
Joined: 1 Sep 10
Posts: 15
Credit: 888,019,989
RAC: 131
Level
Glu
Scientific publications
watwatwat
Message 59285 - Posted: 20 Sep 2022, 18:03:37 UTC - in response to Message 59153.  
Last modified: 20 Sep 2022, 18:09:57 UTC


<app_config>

<app>
<name>PythonGPU</name>
<gpu_versions>
<gpu_usage>1.0</gpu_usage>
<cpu_usage>3.0</cpu_usage>
</gpu_versions>
</app>

</app_config>


I'm new to config editing :) a few more questions

Do I need to be more specific in <name> tag and put full application name like Python apps for GPU hosts 4.03 (cuda1131) from task properties?

Because I don't see 3 CPUs been given to the task after client restart

Application Python apps for GPU hosts 4.03 (cuda1131)
Name e00015a03227-ABOU_rnd_ppod_expand_demos25-0-1-RND8538
State Running
Received Tue 20 Sep 2022 10:48:34 PM +05
Report deadline Sun 25 Sep 2022 10:48:34 PM +05
Resources 0.99 CPUs + 1 NVIDIA GPU
Estimated computation size 1,000,000,000 GFLOPs
CPU time 00:48:32
CPU time since checkpoint 00:00:07
Elapsed time 00:11:37
Estimated time remaining 50d 21:42:09
Fraction done 1.990%
Virtual memory size 18.16 GB
Working set size 5.88 GB
Directory slots/8
Process ID 5555
Progress rate 6.840% per hour
Executable wrapper_26198_x86_64-pc-linux-gnu

ID: 59285 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
KAMasud

Send message
Joined: 27 Jul 11
Posts: 138
Credit: 539,953,398
RAC: 0
Level
Lys
Scientific publications
watwat
Message 59286 - Posted: 20 Sep 2022, 18:32:36 UTC - in response to Message 59260.  

Yes, one great thing about the Python tasks compared to acemd3 tasks is that they can be interrupted and continue on later with no penalty.

I have a problem that they fail on reboot however. Is that common?
http://www.gpugrid.net/results.php?hostid=583702

That is only on Windows though. I have not seen it yet on Linux, but I don't reboot often there.

Guess it must be only on Windows. No problem restarting a task after a reboot on Ubuntu.



The restart works fine on Windows. Maybe, it might be the five-minute break at 2% which might be causing the confusion.
ID: 59286 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 662
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59287 - Posted: 20 Sep 2022, 19:22:29 UTC - in response to Message 59281.  


Anyone and ideas ?

Get rid of the ram disk.
ID: 59287 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 662
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59288 - Posted: 20 Sep 2022, 19:25:45 UTC - in response to Message 59285.  
Last modified: 20 Sep 2022, 19:26:22 UTC


<app_config>

<app>
<name>PythonGPU</name>
<gpu_versions>
<gpu_usage>1.0</gpu_usage>
<cpu_usage>3.0</cpu_usage>
</gpu_versions>
</app>

</app_config>


I'm new to config editing :) a few more questions

Do I need to be more specific in <name> tag and put full application name like Python apps for GPU hosts 4.03 (cuda1131) from task properties?

Because I don't see 3 CPUs been given to the task after client restart

Application Python apps for GPU hosts 4.03 (cuda1131)
Name e00015a03227-ABOU_rnd_ppod_expand_demos25-0-1-RND8538
State Running
Received Tue 20 Sep 2022 10:48:34 PM +05
Report deadline Sun 25 Sep 2022 10:48:34 PM +05
Resources 0.99 CPUs + 1 NVIDIA GPU
Estimated computation size 1,000,000,000 GFLOPs
CPU time 00:48:32
CPU time since checkpoint 00:00:07
Elapsed time 00:11:37
Estimated time remaining 50d 21:42:09
Fraction done 1.990%
Virtual memory size 18.16 GB
Working set size 5.88 GB
Directory slots/8
Process ID 5555
Progress rate 6.840% per hour
Executable wrapper_26198_x86_64-pc-linux-gnu



Any already downloaded task will see the original cpu-gpu resource assignment.

Any newly downloaded task will show the NEW task assignment.

The name for the tasks is PythonGPU as you show.

You should always refer to the client_state.xml file as it is the final arbiter of the correct naming and task configuation.
ID: 59288 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 662
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59289 - Posted: 20 Sep 2022, 19:29:24 UTC - in response to Message 59286.  

Yes, one great thing about the Python tasks compared to acemd3 tasks is that they can be interrupted and continue on later with no penalty.

I have a problem that they fail on reboot however. Is that common?
http://www.gpugrid.net/results.php?hostid=583702

That is only on Windows though. I have not seen it yet on Linux, but I don't reboot often there.

Guess it must be only on Windows. No problem restarting a task after a reboot on Ubuntu.



The restart works fine on Windows. Maybe, it might be the five-minute break at 2% which might be causing the confusion.


If you interrupt the task in its Stage 1 of downloading and unpacking the required support files, it may fail on Windows upon restart.

It normally shows the failure for this reason in the stderr.txt.

Best to interrupt the task once it is actually calculating and after its setup and has produced at least one checkpoint.
ID: 59289 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59290 - Posted: 20 Sep 2022, 19:52:08 UTC - in response to Message 59287.  


Anyone and ideas ?

Get rid of the ram disk.

on the other hand, ramdisk works perfectly on this machine:

https://www.gpugrid.net/show_host_detail.php?hostid=599484
ID: 59290 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 662
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59291 - Posted: 20 Sep 2022, 20:40:28 UTC - in response to Message 59290.  


Anyone and ideas ?

Get rid of the ram disk.

on the other hand, ramdisk works perfectly on this machine:

https://www.gpugrid.net/show_host_detail.php?hostid=599484

Then you need to investigate the differences between the two hosts.

All I'm stating is that the RAM disk is an unnecessary complication that is not needed to process the tasks.

Basic troubleshooting. Reduce to the most basic, absolute needed configuration for the tasks to complete correctly and then add back in one extra superfluous element at a time until the tasks fail again.

Then you have identified why the tasks fail.
ID: 59291 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Diplomat

Send message
Joined: 1 Sep 10
Posts: 15
Credit: 888,019,989
RAC: 131
Level
Glu
Scientific publications
watwatwat
Message 59292 - Posted: 21 Sep 2022, 16:42:56 UTC

Keith Myers thanks!
ID: 59292 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Diplomat

Send message
Joined: 1 Sep 10
Posts: 15
Credit: 888,019,989
RAC: 131
Level
Glu
Scientific publications
watwatwat
Message 59293 - Posted: 22 Sep 2022, 2:53:36 UTC

In my case config didn't want to work until I added <max_concurrent>

<app_config>

<app>
  <name>PythonGPU</name>
  <max_concurrent>1</max_concurrent>
  <gpu_versions>
    <gpu_usage>1.0</gpu_usage>
    <cpu_usage>3.0</cpu_usage>
  </gpu_versions>
</app>

</app_config>


Now I see as expected status: Running (3 CPUs + 1 NVIDIA GPU)

Unfortunately it doesn't help to get high GPU utilization/ Completion time it looks like gonna be slightly better though
ID: 59293 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 662
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59294 - Posted: 22 Sep 2022, 4:23:38 UTC - in response to Message 59293.  

In my case config didn't want to work until I added <max_concurrent>

<app_config>

<app>
  <name>PythonGPU</name>
  <max_concurrent>1</max_concurrent>
  <gpu_versions>
    <gpu_usage>1.0</gpu_usage>
    <cpu_usage>3.0</cpu_usage>
  </gpu_versions>
</app>

</app_config>


Now I see as expected status: Running (3 CPUs + 1 NVIDIA GPU)

Unfortunately it doesn't help to get high GPU utilization/ Completion time it looks like gonna be slightly better though


If you have enough cpu for support and enough VRAM on the card, you can get better gpu utilization by moving to 2X tasks on the card. Just change the gpu_usage to 0.5
ID: 59294 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59297 - Posted: 22 Sep 2022, 18:44:15 UTC - in response to Message 59291.  
Last modified: 22 Sep 2022, 18:47:30 UTC


Anyone and ideas ?

Get rid of the ram disk.

on the other hand, ramdisk works perfectly on this machine:

https://www.gpugrid.net/show_host_detail.php?hostid=599484

Then you need to investigate the differences between the two hosts.

All I'm stating is that the RAM disk is an unnecessary complication that is not needed to process the tasks.

Basic troubleshooting. Reduce to the most basic, absolute needed configuration for the tasks to complete correctly and then add back in one extra superfluous element at a time until the tasks fail again.

Then you have identified why the tasks fail.


I installed a RAMdisk because quite often I am crunching tasks which write many GB of data on the disk. E.g. LHC-Atlas, the GPU tasks from WCG, the Pythons from Rosetta, and last not least the Pythons from GPUGRID: about 200GB within 24 hours, which is much (so for my two RTX3070, this would be 400GB/day).
So, if the machines are running 24/7, in my opinion this is simply not good for a SSD lifetime.

Over the years, my experience with RAMdisk has been a good one. No idea what kind of problem the GPUGRID Pythons have with this particular RAMDisk - or vice versa. As said, on another machine with RAMDisk I also have 2 Pythons running concurrently, even on one GPU, and it works fine.

So what I did yesterday evening was letting only one of two RTX3070 crunch a Python. On the other GPU, I sometimes crunched WCG of nothing at all.
This evening, after about 22-1/2 hours, the Python finished successfully :-)
BTW - beside the Python, 3 ATLAS tasks 3 cores ea. were also running all the time.

Which means. what I know so far is that obviously I can run Pythons at least on one of the two RTX3070, and other projects on the other one.
Still I will try to further investigate why GPUGRID Pythons don't run on both RTX3070.
ID: 59297 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
[CSF] Aleksey Belkov

Send message
Joined: 26 Dec 13
Posts: 86
Credit: 1,292,358,731
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59307 - Posted: 24 Sep 2022, 17:46:41 UTC - in response to Message 56977.  
Last modified: 24 Sep 2022, 17:49:29 UTC

I do not know how to properly mention the project administrators in the topic in order to draw attention to the problem of non-optimal use of disk space by this application.
Only now I noticed what is contained in the slotX directory when performing a task.
I was very surprised to see there, in addition to the unpacked application files, also the archive itself, from which these files are unpacked/unzipped. At the same time, the archive is present in two copies at once, apparently due to the suboptimal process of unpacking the format tar.gz.
Here you can see that application's files itself occupy only half the working directory volume(slotX).



Apparently, when the application starts, the following happens:
1) The source archive(pythongpu_windows_x86_64__cuda1131.tar.gz.1a152f102cdad20f16638f0f269a5a17) of application is copied from the project directory(\projects\www.gpugrid.net) to the working directory(\slots\X\).
2) Then archive is unzipped (tar.gz >> tar).
3) At the last stage, the application files are unpacked from tar container.
At the same time, at the end of the process, unnecessary tar and tar.gz files( for some reason) does not deleted from working directory.
Thus, not only the peak amount of space occupied of each instance of this WU requires ~16 GiB, but this volume is occupied until WU's completing.

The whole process requires both much more time (copying and unpacking) and amount of written data.
Project tar.gz >> slotX (2,66 GiB) >> tar (5,48 GiB) >> app files (5,46 GiB) = 13,6 GiB

Both parameters can be significantly reduced by unpacking files directly into the working directory from the source archive, without all mentioned intermediate stages.
7za, which is used for unzipping/unpacking archives supports pipelining:

7za x "X:\BOINC\projects\www.gpugrid.net\pythongpu_windows_x86_64__cuda1131.tar.gz.1a152f102cdad20f16638f0f269a5a17" -so | 7za x -aoa -si -ttar -o"X:\BOINC\slots\0\"

Project tar.gz >> app files (5,46 GiB) = 5,46 GiB !


Moreover, if you use for archive not tar.gz format, but 7z (LZMA2 + "5 - Normal" profile, which is the default for recent 7-zip versions), then you can not only seriously save the amount of data downloaded by each user (and as a consequence the bandwidth of project's infrastructure), but speed up the process of unpacking data from archive.

Saving more than one GiB:



On my computer, unpacking by pipelining(as mentioned above) using the current(12 years old) 7za version(9.20) takes ~100 seconds.
And when using the recent version of 7za(22.01) only ~ 45-50 seconds.
7za x "X:\BOINC\projects\www.gpugrid.net\pythongpu_windows_x86_64__cuda1131.7z" -o"X:\BOINC\slots\0\"


I believe that the result of the described changes is worth implementing them (even if not all and/or not at once).
Moreover, all changes are reduced only to updating one executable file, repacking the archive and changing the command to unpack it.
ID: 59307 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 662
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59308 - Posted: 24 Sep 2022, 21:32:01 UTC

I believe the researcher has already been down this road with Windows not natively supporting the compression/decompression algorithms you mention.

It requires each volunteer to add support manually to their hosts.

In the quest for compatibility, a researcher tries to package applications for all attached hosts to run natively without jumping through hoops so that everyone can run the tasks.

ID: 59308 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
[CSF] Aleksey Belkov

Send message
Joined: 26 Dec 13
Posts: 86
Credit: 1,292,358,731
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59309 - Posted: 24 Sep 2022, 21:53:22 UTC - in response to Message 59308.  
Last modified: 24 Sep 2022, 21:56:53 UTC

It requires each volunteer to add support manually to their hosts.

No
Unfortunately, you have inattentively read what I wrote above.
It has already been mentioned there that is currently Windows app already comes with 7za.exe version 9.20(you can find it in project folder).
So nothing changing.
ID: 59309 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59310 - Posted: 25 Sep 2022, 2:31:01 UTC
Last modified: 25 Sep 2022, 2:42:26 UTC

Yes, I do have GPUGrid installed on my Win10 machine after all.
And 7za.exe is in the project folder, just not in the project folder on my Ubuntu machine.
ID: 59310 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 662
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59311 - Posted: 25 Sep 2022, 3:07:56 UTC - in response to Message 59309.  

It requires each volunteer to add support manually to their hosts.

No
Unfortunately, you have inattentively read what I wrote above.
It has already been mentioned there that is currently Windows app already comes with 7za.exe version 9.20(you can find it in project folder).
So nothing changing.

OK, so you can thank Richard Haselgrove for the application to now package that utility. Originally, the tasks failed because Windows does not come with that utility and Richard helped debug the issue with the developer.

If you think the application is not using the utility correctly you should inform the developer of your analysis and code fix so that other Windows users can benefit.
ID: 59311 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
[CSF] Aleksey Belkov

Send message
Joined: 26 Dec 13
Posts: 86
Credit: 1,292,358,731
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59312 - Posted: 25 Sep 2022, 10:39:15 UTC - in response to Message 59311.  

you should inform the developer of your analysis and code fix so that other Windows users can benefit.

I have already sent abouh PM to this tread, just in case.
ID: 59312 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
abouh

Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59335 - Posted: 26 Sep 2022, 7:33:56 UTC - in response to Message 59307.  
Last modified: 26 Sep 2022, 7:45:45 UTC

Hello, thank you very much for your help. I would like to implement the changes if they help optimise the tasks, but let me try to summarise your ideas to see if I got them right:



Change A --> As you say, the original file .tar.gz is first copied to the working directory and then unpacked in a 2-step process (tar.gz to tar and tar to plain files) and the tar.gz and tar files lie around after that. You suggest that these files should be deleted to save space and I agree, makes sense. Probably the sequence should be:
1) move .tar.gz file from project directory to working directory.
2) unpack .tar.gz to .tar
3) delete .tar.gz file
4) unpack .tar file to plain files
5) delete .tar file
This one is straightforward to implement.




Change B --> Additionally, you also suggest to replace the copying and the 2-step unpacking process for a single step process with the command line you propose. So the sequence would be further simplified to:
1) unpack .tar.gz to plain files
2) delete .tar.gz file
The only problem I see here is that I believe the step of first copying the files from the project directory(\projects\www.gpugrid.net) to the working directory(\slots\X\) I can not modify. It is general for all projects, even for the ones that do not contain files to be unpacked later. So not to mess with other GPUgrid projects the sequence should be:
1) move .tar.gz file from project directory to working directory.
2) unpack .tar.gz to plain files
3) delete .tar.gz file

in this case, would the command line would be simply this one? without the -o flag?
7za x "X:\BOINC\projects\www.gpugrid.net\pythongpu_windows_x86_64__cuda1131.tar.gz.1a152f102cdad20f16638f0f269a5a17" -so | 7za x -aoa -si -ttar





Change C --> Finally, you suggest using .7z encryption instead of .tar.gz to save memory and unpacking time with a more recent version of 7za.


Is all the above correct?

I believe these changes are worth implementing, thank you very much. I will try to start with Change A and Change B and unroll them into PythonGPUbeta first to test them this week.
ID: 59335 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 . . . 25 · 26 · 27 · 28 · 29 · 30 · 31 . . . 50 · Next

Message boards : News : Experimental Python tasks (beta) - task description

©2025 Universitat Pompeu Fabra