Experimental Python tasks (beta) - task description

Message boards : News : Experimental Python tasks (beta) - task description
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 18 · 19 · 20 · 21 · 22 · 23 · 24 . . . 50 · Next

AuthorMessage
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58862 - Posted: 25 May 2022, 17:09:27 UTC

As abouh has posted previously, the two resource types are used alternately - "cyclical GPU load is expected in Reinforcement Learning algorithms. Whenever GPU load in lower, CPU usage should increase." (message 58590). Any instantaneous observation won't reveal the full situation: either CPU will be high, and GPU low, or vice-versa.
ID: 58862 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Boca Raton Community HS

Send message
Joined: 27 Aug 21
Posts: 38
Credit: 7,254,068,306
RAC: 0
Level
Tyr
Scientific publications
wat
Message 58863 - Posted: 25 May 2022, 19:35:37 UTC - in response to Message 58862.  

Yep- I observe the alternation. When I suspend all other work units, I can see that just one of these tasks will use a little more than half of the logical processors. I know it has been talked about that although it says it uses 1 processor (or, 0.996, to be exact) that it uses more. I am running E@H work units and I think that running both is choking the CPU. Is there a way to limit the processor count that these python tasks use? In the past, I changed the app config to use 32, but it did not seem to speed anything up, even though they were reserved for the work unit.

I am not sure there is a way to post images, but here are some links to show CPU and GPU usage when only running one python task. Is it supposed to use that much of the CPU?

https://i.postimg.cc/Kv8zcMGQ/CPU-Usage1.jpg
https://i.postimg.cc/LX4dkj0b/GPU-Usage-1.jpg
https://i.postimg.cc/tRM0PZdB/GPU-Usage-2.jpg

I am sorry for all of the questions.... just trying my best to understand.


ID: 58863 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ServicEnginIC
Avatar

Send message
Joined: 24 Sep 10
Posts: 595
Credit: 12,249,686,510
RAC: 1,390,367
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58864 - Posted: 25 May 2022, 19:37:27 UTC - in response to Message 58862.  

As abouh has posted previously, the two resource types are used alternately - "cyclical GPU load is expected in Reinforcement Learning algorithms. Whenever GPU load in lower, CPU usage should increase.

This can be very well graphically noticed at the following two images.

Higher CPU - Lower GPU usage cycle:


Higher GPU - Lower CPU usage cycle:


CPU and GPU usage graphics follow an anti cyclical pattern.
ID: 58864 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1424
Credit: 9,189,946,190
RAC: 42,316
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58865 - Posted: 26 May 2022, 1:17:49 UTC - in response to Message 58863.  

Is there a way to limit the processor count that these python tasks use? In the past, I changed the app config to use 32, but it did not seem to speed anything up, even though they were reserved for the work unit.

I am sorry for all of the questions.... just trying my best to understand.


No there isn't as the user. These are not real MT tasks or any form that BOINC recognizes and provides some configuration options.

Your only solution is to only run one at a time via an max_concurrent statement in an app_config.xml file and then also restrict the number of cores being allowed to be used by your other projects.

That said, I don't know why you are having such difficulties. Maybe chalk it up to Windows, I don't know.

I run 3 other cpu projects at the same times as I run the GPUGrid Python on GPU tasks with 28-46 cpu cores being occupied by Universe, TN-Grid or yoyo depending on the host. Every host primarily runs Universe as the major cpu project.

No impact on the python tasks while running the other cpu apps.
ID: 58865 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ServicEnginIC
Avatar

Send message
Joined: 24 Sep 10
Posts: 595
Credit: 12,249,686,510
RAC: 1,390,367
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58866 - Posted: 26 May 2022, 12:51:52 UTC - in response to Message 58865.  

No impact on the python tasks while running the other cpu apps.

Conversely, I notice a performance loss on other CPU tasks when python tasks are in execution.
I processed yesterday python task e7a30-ABOU_rnd_ppod_demo_sharing_large-0-1-RND2847_2 at my host #186626
It was received at 11:33 UTC, and result was returned on 22:50 UTC
At the same period, PrimeGrid PPS-MEGA CPU tasks were also being processed.
The medium processing time for eighteen (18) PPS-MEGA CPU tasks was 3098,81 seconds.
The medium processing time for 18 other PPS-MEGA CPU tasks processed outside that period was 2699,11 seconds.
This represents an extra processing time of about 400 seconds per task, or about a 12,9% performance loss.
There is not such a noticeable difference when running Gpugrid ACEMD tasks.
ID: 58866 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1424
Credit: 9,189,946,190
RAC: 42,316
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58867 - Posted: 26 May 2022, 17:59:57 UTC

I also notice an impact on my running Universe tasks. Generally adds 300 seconds to the normal computation times when running in conjunction with a python task.
ID: 58867 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
captainjack

Send message
Joined: 9 May 13
Posts: 171
Credit: 4,610,796,466
RAC: 26,261
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58871 - Posted: 28 May 2022, 2:03:28 UTC

Windows 10 machine running task 32899765. Had a power outage. When the power came back on, task was restarted but just sat there doing nothing. The stderr.txt file showed the following error:

file pythongpu_windows_x86_64__cuda102.tar
already exists. Overwrite with
pythongpu_windows_x86_64__cuda102.tar?
(Y)es / (N)o / (A)lways / (S)kip all / A(u)to rename all / (Q)uit?


Task was stalled waiting on a response.

BOINC was stopped and the pythongpu_windows_x86_64__cuda102.tar file was removed from the slots folder.

Computer was restarted then the task was restarted. Then the following error message appeared several times in the stderr.txt file.

OSError: [WinError 1455] The paging file is too small for this operation to complete. Error loading "C:\ProgramData\BOINC\slots\0\lib\site-packages\torch\lib\caffe2_detectron_ops_gpu.dll" or one of its dependencies.
Detected memory leaks!


Page file size was increased to 64000MB and rebooted.

Started task again and still got the error message about page file size too small. Then task abended.

If you need more info about this task, please let me know.
ID: 58871 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
abouh

Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58876 - Posted: 28 May 2022, 16:41:56 UTC - in response to Message 58871.  

Thank you captainjack for the info.


1.

Interesting that the job gets stuck with:

(Y)es / (N)o / (A)lways / (S)kip all / A(u)to rename all / (Q)uit?


The job command line is the following:

7za.exe pythongpu_windows_x86_64__cuda102.tar -y


and I got from the application documentation (https://info.nrao.edu/computing/guide/file-access-and-archiving/7zip/7z-7za-command-line-guide):

7-Zip will prompt the user before overwriting existing files unless the user specifies the -y


So essentially -y assumes "Yes" on all Queries. Honestly I am confused by this behaviour, thanks for pointing it out. Maybe I am missing the x, as in

7za.exe x pythongpu_windows_x86_64__cuda102.tar -y


I will test it on the beta app.




2.

Regarding the other error

OSError: [WinError 1455] The paging file is too small for this operation to complete. Error loading "C:\ProgramData\BOINC\slots\0\lib\site-packages\torch\lib\caffe2_detectron_ops_gpu.dll" or one of its dependencies.
Detected memory leaks!


is related to pytorch and nvidia and it only affects some windows machines. It is explained here: https://stackoverflow.com/questions/64837376/how-to-efficiently-run-multiple-pytorch-processes-models-at-once-traceback

TL;DR: Windows and Linux treat multiprocessing in python differently, and in windows each process commits much more memory, especially when using pytorch.

We use the script suggested in the link to mitigate the problem, but it could be that for some machines memory is still insufficient. Does that make sense in your case?


ID: 58876 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
captainjack

Send message
Joined: 9 May 13
Posts: 171
Credit: 4,610,796,466
RAC: 26,261
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58878 - Posted: 29 May 2022, 14:19:12 UTC

Thank you abouh for responding,

I looked through my saved messages from the task to see if there was anything else I could find that might be of value and couldn't find anything.

In regard to the "out of memory" error, I tried to read through the stackoverflow link about the memory error. It is way above my level of technical expertise at this point, but it seemed like the amount of nvidia memory might have something to do with it. I am using an antique GTX970 card. It's old but still works.

Good luck coming up with a solution. If you want me to do any more testing, please let me know.
ID: 58878 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
abouh

Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58879 - Posted: 31 May 2022, 8:20:29 UTC
Last modified: 31 May 2022, 8:21:09 UTC

Seems like here are some possible workarounds:

https://github.com/Spandan-Madan/Pytorch_fine_tuning_Tutorial/issues/10

basically, two users mentioned


I think I managed to solve it (so far). Steps were:

1)- Windows + pause key
2)- Advanced system settings
3)- Advanced tab
4)- Performance - Settings button
5)- Advanced tab - Change button
6)- Uncheck the "Automatically... BLA BLA" checkbox
7)- Select the System managed size option box.


and

If it's of any value, I ended up setting the values into manual and some ridiculous amount of 360GB as the minimum and 512GB for the maximum. I also added an extra SSD and allocated all of it to Virtual memory. This solved the problem and now I can run up to 128 processes using pytorch and CUDA.
I did find out that every launch of Python and pytorch, loads some ridiculous amount of memory to the RAM and then when not used often goes into the virtual memory.


Maybe it can be helpful for someone
ID: 58879 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
bibi

Send message
Joined: 4 May 17
Posts: 15
Credit: 17,494,375,743
RAC: 352,575
Level
Trp
Scientific publications
watwatwatwatwat
Message 58880 - Posted: 1 Jun 2022, 19:13:37 UTC - in response to Message 58876.  

Hi abouh,

is there a commandline like
7za.exe pythongpu_windows_x86_64__cuda102.tar.gz
without -y to get pythongpu_windows_x86_64__cuda102.tar ?

ID: 58880 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
WR-HW95

Send message
Joined: 16 Dec 08
Posts: 7
Credit: 1,549,469,403
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58881 - Posted: 1 Jun 2022, 20:23:47 UTC

So whats going on here?
https://www.gpugrid.net/workunit.php?wuid=27228431
RuntimeError: CUDA out of memory. Tried to allocate 446.00 MiB (GPU 0; 11.00 GiB total capacity; 470.54 MiB already allocated; 8.97 GiB free; 492.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
22:40:37 (12736): python.exe exited; CPU time 3346.203125

All kinds of errors on other tasks from too old card (1080ti) to out of ram.
Atm. commit charge is 70Gb and ram usage is 22Gb of 64Gb.
ID: 58881 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
abouh

Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58883 - Posted: 6 Jun 2022, 9:49:06 UTC - in response to Message 58880.  

The command line

7za.exe pythongpu_windows_x86_64__cuda102.tar.gz


works fine if the job is executed without interruptions.

However, in case the job is interrupted and restarted later, the command is executed again. Then, 7za needs to know whether or not to replace the already existing files with the new ones.

The flag -y is just to make sure the script does not get stuck in that command prompt waiting for an answer.

ID: 58883 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
abouh

Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58884 - Posted: 6 Jun 2022, 10:29:20 UTC - in response to Message 58881.  

Unfortunately recent versions of PyTorch do not support all GPU's, older ones might not be compatible...

Regarding this error

RuntimeError: CUDA out of memory. Tried to allocate 446.00 MiB (GPU 0; 11.00 GiB total capacity; 470.54 MiB already allocated; 8.97 GiB free; 492.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
22:40:37 (12736): python.exe exited; CPU time 3346.203125


does it happen recurrently in the same machine? or depending on the job?
ID: 58884 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1424
Credit: 9,189,946,190
RAC: 42,316
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58886 - Posted: 6 Jun 2022, 20:18:53 UTC - in response to Message 58881.  

So whats going on here?
https://www.gpugrid.net/workunit.php?wuid=27228431

All kinds of errors on other tasks from too old card (1080ti) to out of ram.
Atm. commit charge is 70Gb and ram usage is 22Gb of 64Gb.


The problem is not with the card but with the Windows environment.

I have no issues running the Python on GPU tasks in Linux on my 1080 Ti card.

https://www.gpugrid.net/results.php?hostid=456812
ID: 58886 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
STARBASEn
Avatar

Send message
Joined: 17 Feb 09
Posts: 91
Credit: 1,603,303,394
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 58906 - Posted: 11 Jun 2022, 19:18:07 UTC

Well so far, these new python WU's have been consistently completing and even surviving multiple reboots, OS kernel upgrades, and OS upgrades:

Kernels --> 5.17.13
OS Fedora35 --> Fedora36

3 machines w/GTX-1060 510.73.05
ID: 58906 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1424
Credit: 9,189,946,190
RAC: 42,316
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58907 - Posted: 11 Jun 2022, 20:31:43 UTC

Yes, one nice thing about the Python gpu tasks is that they survive a reboot and can be restarted on a different gpu without erroring.

Very nice compared to the acemd3/4 tasks which will error out under similar circumstance.

The Python tasks create and reread checkpoints very well. Upon restart the task will show 1% completion but after a while jump forward to the point that the task was stopped, exited or suspended and continue on till the finish.
ID: 58907 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
STARBASEn
Avatar

Send message
Joined: 17 Feb 09
Posts: 91
Credit: 1,603,303,394
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 58915 - Posted: 12 Jun 2022, 15:36:57 UTC

Yes, one nice thing about the Python gpu tasks is that they survive a reboot and can be restarted on a different gpu without erroring.


Good to know as I did not try a driver update or using a different GPU on a WU in progress.

I do think BOINC needs to patch their estimated time to completion. XXXdays remaining makes it impossible to have any in a cache.

ID: 58915 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1424
Credit: 9,189,946,190
RAC: 42,316
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58919 - Posted: 12 Jun 2022, 18:37:36 UTC

I haven't had any reason to carry a cache. I have my cache level set at only one task for each host as I don't want GPUGrid to monopolize my hosts and compete with my other projects.

That said, I haven't gone 12 hours without a Python task on every host at all times.
ID: 58919 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1424
Credit: 9,189,946,190
RAC: 42,316
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58920 - Posted: 12 Jun 2022, 18:40:52 UTC - in response to Message 58915.  

Yes, one nice thing about the Python gpu tasks is that they survive a reboot and can be restarted on a different gpu without erroring.


Good to know as I did not try a driver update or using a different GPU on a WU in progress.

I do think BOINC needs to patch their estimated time to completion. XXXdays remaining makes it impossible to have any in a cache.


BOINC would have to completely rewrite that part of the code. The fact that these tasks run on both the cpu and gpu makes them impossible to decipher by BOINC.

The closest mechanism is the MT or multi-task category but that only knows about cpu tasks which run solely on the cpu.
ID: 58920 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 . . . 18 · 19 · 20 · 21 · 22 · 23 · 24 . . . 50 · Next

Message boards : News : Experimental Python tasks (beta) - task description

©2026 Universitat Pompeu Fabra