Message boards :
News :
Experimental Python tasks (beta) - task description
Message board moderation
Previous · 1 . . . 25 · 26 · 27 · 28 · 29 · 30 · 31 . . . 50 · Next
| Author | Message |
|---|---|
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Erich56 asked: Question: can a running Python task be interrupted for the time a Windows Update takes place (with rebooting of the PC), or does this damage the task? I tried it now - the two tasks running on a RTX3070 each - on Windows - did not survive a reboot :-( |
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
since yesterday I upgraded the RAM of one of my PCs from 64GB to 128GB (so now I have a 64GB Ramdisk plus 64GB system RAM, before it was half each), every GPUGRID Python fails on this PC with 2 RTX3070 inside. The task starts okay, RAM as well as VRAM is filling up continuously, also the CPU usage is close to 100%, and after a while (a few minutes up to half an hour) the task fails. The BOINC manager says "aborted by the project", and the task description says "aufgegeben" = abandoned or so. Interestingly, no times are shown, neither runtime nor CPU time, further there is no stderr. See this example: https://www.gpugrid.net/result.php?resultid=33044774 on another machine, I have two tasks running simultaneously on one GPU - no problem at all. I was of course thinking of a defective RAM module; however, all night through I had running simultaneously 5 LHC ATLAS tasks 3-cores ea., without any problem. So I guess this was RAM test enough. Also hundreds of WCG GPU tasks were processed this morning for hours, also without any problem. Anyone and ideas ? |
|
Send message Joined: 1 Sep 10 Posts: 15 Credit: 888,019,989 RAC: 131 Level ![]() Scientific publications ![]() ![]()
|
I'm new to config editing :) a few more questions Do I need to be more specific in <name> tag and put full application name like Python apps for GPU hosts 4.03 (cuda1131) from task properties? Because I don't see 3 CPUs been given to the task after client restart Application Python apps for GPU hosts 4.03 (cuda1131) Name e00015a03227-ABOU_rnd_ppod_expand_demos25-0-1-RND8538 State Running Received Tue 20 Sep 2022 10:48:34 PM +05 Report deadline Sun 25 Sep 2022 10:48:34 PM +05 Resources 0.99 CPUs + 1 NVIDIA GPU Estimated computation size 1,000,000,000 GFLOPs CPU time 00:48:32 CPU time since checkpoint 00:00:07 Elapsed time 00:11:37 Estimated time remaining 50d 21:42:09 Fraction done 1.990% Virtual memory size 18.16 GB Working set size 5.88 GB Directory slots/8 Process ID 5555 Progress rate 6.840% per hour Executable wrapper_26198_x86_64-pc-linux-gnu |
|
Send message Joined: 27 Jul 11 Posts: 138 Credit: 539,953,398 RAC: 0 Level ![]() Scientific publications ![]()
|
Yes, one great thing about the Python tasks compared to acemd3 tasks is that they can be interrupted and continue on later with no penalty. The restart works fine on Windows. Maybe, it might be the five-minute break at 2% which might be causing the confusion. |
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 662 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Get rid of the ram disk. |
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 662 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Any already downloaded task will see the original cpu-gpu resource assignment. Any newly downloaded task will show the NEW task assignment. The name for the tasks is PythonGPU as you show. You should always refer to the client_state.xml file as it is the final arbiter of the correct naming and task configuation. |
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 662 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Yes, one great thing about the Python tasks compared to acemd3 tasks is that they can be interrupted and continue on later with no penalty. If you interrupt the task in its Stage 1 of downloading and unpacking the required support files, it may fail on Windows upon restart. It normally shows the failure for this reason in the stderr.txt. Best to interrupt the task once it is actually calculating and after its setup and has produced at least one checkpoint. |
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
on the other hand, ramdisk works perfectly on this machine: https://www.gpugrid.net/show_host_detail.php?hostid=599484 |
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 662 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Then you need to investigate the differences between the two hosts. All I'm stating is that the RAM disk is an unnecessary complication that is not needed to process the tasks. Basic troubleshooting. Reduce to the most basic, absolute needed configuration for the tasks to complete correctly and then add back in one extra superfluous element at a time until the tasks fail again. Then you have identified why the tasks fail. |
|
Send message Joined: 1 Sep 10 Posts: 15 Credit: 888,019,989 RAC: 131 Level ![]() Scientific publications ![]() ![]()
|
Keith Myers thanks! |
|
Send message Joined: 1 Sep 10 Posts: 15 Credit: 888,019,989 RAC: 131 Level ![]() Scientific publications ![]() ![]()
|
In my case config didn't want to work until I added <max_concurrent>
<app_config>
<app>
<name>PythonGPU</name>
<max_concurrent>1</max_concurrent>
<gpu_versions>
<gpu_usage>1.0</gpu_usage>
<cpu_usage>3.0</cpu_usage>
</gpu_versions>
</app>
</app_config>
Now I see as expected status: Running (3 CPUs + 1 NVIDIA GPU) Unfortunately it doesn't help to get high GPU utilization/ Completion time it looks like gonna be slightly better though |
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 662 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
In my case config didn't want to work until I added <max_concurrent> If you have enough cpu for support and enough VRAM on the card, you can get better gpu utilization by moving to 2X tasks on the card. Just change the gpu_usage to 0.5 |
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I installed a RAMdisk because quite often I am crunching tasks which write many GB of data on the disk. E.g. LHC-Atlas, the GPU tasks from WCG, the Pythons from Rosetta, and last not least the Pythons from GPUGRID: about 200GB within 24 hours, which is much (so for my two RTX3070, this would be 400GB/day). So, if the machines are running 24/7, in my opinion this is simply not good for a SSD lifetime. Over the years, my experience with RAMdisk has been a good one. No idea what kind of problem the GPUGRID Pythons have with this particular RAMDisk - or vice versa. As said, on another machine with RAMDisk I also have 2 Pythons running concurrently, even on one GPU, and it works fine. So what I did yesterday evening was letting only one of two RTX3070 crunch a Python. On the other GPU, I sometimes crunched WCG of nothing at all. This evening, after about 22-1/2 hours, the Python finished successfully :-) BTW - beside the Python, 3 ATLAS tasks 3 cores ea. were also running all the time. Which means. what I know so far is that obviously I can run Pythons at least on one of the two RTX3070, and other projects on the other one. Still I will try to further investigate why GPUGRID Pythons don't run on both RTX3070. |
|
Send message Joined: 26 Dec 13 Posts: 86 Credit: 1,292,358,731 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I do not know how to properly mention the project administrators in the topic in order to draw attention to the problem of non-optimal use of disk space by this application. Only now I noticed what is contained in the slotX directory when performing a task. I was very surprised to see there, in addition to the unpacked application files, also the archive itself, from which these files are unpacked/unzipped. At the same time, the archive is present in two copies at once, apparently due to the suboptimal process of unpacking the format tar.gz. Here you can see that application's files itself occupy only half the working directory volume(slotX). Apparently, when the application starts, the following happens: 1) The source archive(pythongpu_windows_x86_64__cuda1131.tar.gz.1a152f102cdad20f16638f0f269a5a17) of application is copied from the project directory(\projects\www.gpugrid.net) to the working directory(\slots\X\). 2) Then archive is unzipped (tar.gz >> tar). 3) At the last stage, the application files are unpacked from tar container. At the same time, at the end of the process, unnecessary tar and tar.gz files( for some reason) does not deleted from working directory. Thus, not only the peak amount of space occupied of each instance of this WU requires ~16 GiB, but this volume is occupied until WU's completing. The whole process requires both much more time (copying and unpacking) and amount of written data. Project tar.gz >> slotX (2,66 GiB) >> tar (5,48 GiB) >> app files (5,46 GiB) = 13,6 GiB Both parameters can be significantly reduced by unpacking files directly into the working directory from the source archive, without all mentioned intermediate stages. 7za, which is used for unzipping/unpacking archives supports pipelining: 7za x "X:\BOINC\projects\www.gpugrid.net\pythongpu_windows_x86_64__cuda1131.tar.gz.1a152f102cdad20f16638f0f269a5a17" -so | 7za x -aoa -si -ttar -o"X:\BOINC\slots\0\" Project tar.gz >> app files (5,46 GiB) = 5,46 GiB ! Moreover, if you use for archive not tar.gz format, but 7z (LZMA2 + "5 - Normal" profile, which is the default for recent 7-zip versions), then you can not only seriously save the amount of data downloaded by each user (and as a consequence the bandwidth of project's infrastructure), but speed up the process of unpacking data from archive. Saving more than one GiB: On my computer, unpacking by pipelining(as mentioned above) using the current(12 years old) 7za version(9.20) takes ~100 seconds. And when using the recent version of 7za(22.01) only ~ 45-50 seconds. 7za x "X:\BOINC\projects\www.gpugrid.net\pythongpu_windows_x86_64__cuda1131.7z" -o"X:\BOINC\slots\0\" I believe that the result of the described changes is worth implementing them (even if not all and/or not at once). Moreover, all changes are reduced only to updating one executable file, repacking the archive and changing the command to unpack it. |
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 662 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
I believe the researcher has already been down this road with Windows not natively supporting the compression/decompression algorithms you mention. It requires each volunteer to add support manually to their hosts. In the quest for compatibility, a researcher tries to package applications for all attached hosts to run natively without jumping through hoops so that everyone can run the tasks. |
|
Send message Joined: 26 Dec 13 Posts: 86 Credit: 1,292,358,731 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
It requires each volunteer to add support manually to their hosts. No Unfortunately, you have inattentively read what I wrote above. It has already been mentioned there that is currently Windows app already comes with 7za.exe version 9.20(you can find it in project folder). So nothing changing. |
|
Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Yes, I do have GPUGrid installed on my Win10 machine after all. And 7za.exe is in the project folder, just not in the project folder on my Ubuntu machine. |
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 662 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
It requires each volunteer to add support manually to their hosts. OK, so you can thank Richard Haselgrove for the application to now package that utility. Originally, the tasks failed because Windows does not come with that utility and Richard helped debug the issue with the developer. If you think the application is not using the utility correctly you should inform the developer of your analysis and code fix so that other Windows users can benefit. |
|
Send message Joined: 26 Dec 13 Posts: 86 Credit: 1,292,358,731 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
you should inform the developer of your analysis and code fix so that other Windows users can benefit. I have already sent abouh PM to this tread, just in case. |
|
Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level ![]() Scientific publications
|
Hello, thank you very much for your help. I would like to implement the changes if they help optimise the tasks, but let me try to summarise your ideas to see if I got them right: Change A --> As you say, the original file .tar.gz is first copied to the working directory and then unpacked in a 2-step process (tar.gz to tar and tar to plain files) and the tar.gz and tar files lie around after that. You suggest that these files should be deleted to save space and I agree, makes sense. Probably the sequence should be: 1) move .tar.gz file from project directory to working directory. 2) unpack .tar.gz to .tar 3) delete .tar.gz file 4) unpack .tar file to plain files 5) delete .tar file This one is straightforward to implement. Change B --> Additionally, you also suggest to replace the copying and the 2-step unpacking process for a single step process with the command line you propose. So the sequence would be further simplified to: 1) unpack .tar.gz to plain files 2) delete .tar.gz file The only problem I see here is that I believe the step of first copying the files from the project directory(\projects\www.gpugrid.net) to the working directory(\slots\X\) I can not modify. It is general for all projects, even for the ones that do not contain files to be unpacked later. So not to mess with other GPUgrid projects the sequence should be: 1) move .tar.gz file from project directory to working directory. 2) unpack .tar.gz to plain files 3) delete .tar.gz file in this case, would the command line would be simply this one? without the -o flag? 7za x "X:\BOINC\projects\www.gpugrid.net\pythongpu_windows_x86_64__cuda1131.tar.gz.1a152f102cdad20f16638f0f269a5a17" -so | 7za x -aoa -si -ttar Change C --> Finally, you suggest using .7z encryption instead of .tar.gz to save memory and unpacking time with a more recent version of 7za. Is all the above correct? I believe these changes are worth implementing, thank you very much. I will try to start with Change A and Change B and unroll them into PythonGPUbeta first to test them this week. |
©2025 Universitat Pompeu Fabra