Message boards :
News :
Experimental Python tasks (beta) - task description
Message board moderation
Previous · 1 . . . 30 · 31 · 32 · 33 · 34 · 35 · 36 . . . 50 · Next
| Author | Message |
|---|---|
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I notice a big difference in VRAM use between various Python tasks and/or systems, eg: - GPU with running 3 tasks simultaneously: 5.250 MB - GPU with running 2 tasks simultaneously: 5.012 MB - GPU with running 2 tasks simulteanously: 8.055 MB with the third one cited above I was lucky, VRAM of the GPU is 8.142 MB (FYI, all values including a few hundred MB for the monitor). Does anyone else make the same experience? |
|
Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level ![]() Scientific publications
|
Hello Aleksey, Yes, I struggled a bit with the single command solution. BOINC job requires specifying tasks in the following way. <task> And this is the command that should work right? 7za x "X:\BOINC\projects\www.gpugrid.net\pythongpu_windows_x86_64__cuda1131.txz.1a152f102cdad20f16638f0f269a5a17" -so | 7za x -aoa -si -ttar Isn't it actually using 7za 2 times? After some testing, the conclusion I arrived to is that in principle it actually requires 2 BOINC tasks to do it, because 7za decompresses .txz to .tar, and then .tar to plain files. The only way to do it in one task would be to compress the files into a format that 7za can decompress in a single call (like zip, but we already discussed that ziped filed are too big). Does anyone know is that reasoning is correct? can BOINC wrappers execute commands like the one Aleksey suggested? |
|
Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level ![]() Scientific publications
|
Hello, of course, let me explain tasks names "demos25" and "demos25_2" belong to 2 different variants of the same experiment. In particular the selection of the agents sent to GPUGrid is different. In both experiments the AI agents sent to GPUGrid learn using Reinforcement Learning, a machine learning technique that allows them to learn specific behaviours from interactions with their simulated environment (actually to make it faster they interact with 32 copies of the environment at the same time, the famous 32 threads). Also in both cases, when the agents "discover" something relevant, the job finishes and the info is sent back to be shared with the rest of the population. The difference between "demos25" and "demos25_2" experiments is that in "demos25_2" I am experimenting with a more careful selection of the environment regions each agent is targeted to explore. I try to direct each agent to explore a different region of the environment (or with little overlap with the rest). The result is that agents in "demos25_2" are more likely to find something relevant that the rest of the population has not found yet and therefore more likely to finish earlier. The "demos25" experiment, contrarily, uses a more "brute force" approach, and as the population grows it becomes more difficult for new agents to discover new things. I hope the explanation will make sense. Let me know if you have any other doubt, I will try to answer to it as well. There is also an experiment "demos25_3" in process which is similar to "demos25_2". |
|
Send message Joined: 18 Jul 13 Posts: 79 Credit: 210,528,292 RAC: 0 Level ![]() Scientific publications
|
Each task patches several dlls to disable ASLR and make .nv_fatb sections read-only and leaves 1.93 GB of backup files. 05.01.2022 10:28 70 403 584 cudnn_ops_train64_8.dll_bak 05.01.2022 10:23 88 405 504 cudnn_ops_infer64_8.dll_bak 03.08.2022 04:04 1 329 664 torch_cuda_cpp.dll_bak 05.01.2022 11:21 81 487 360 cudnn_cnn_train64_8.dll_bak 05.01.2022 10:36 129 872 896 cudnn_adv_infer64_8.dll_bak 05.01.2022 10:46 97 293 824 cudnn_adv_train64_8.dll_bak 03.08.2022 05:05 871 934 464 torch_cuda_cu.dll_bak 05.01.2022 11:15 736 718 848 cudnn_cnn_infer64_8.dll_bak Can patched dlls be included in pythongpu_windows_x86_64__cuda1131.txz? |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 4,772 Level ![]() Scientific publications
|
I notice a big difference in VRAM use between various Python tasks and/or systems, eg: more powerful GPUs will use more VRAM than less powerful GPUs, it scales roughly with core count of the GPU. so a 3090 would use more VRAM than say a 1050Ti on the same exact task. it's just the way it works when the GPU sets up the task, if the task has to scale to 10,000 cores instead of 2,000, it needs to use more memory.
|
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
more powerful GPUs will use more VRAM than less powerful GPUs, it scales roughly with core count of the GPU. okay, I see. Many thanks for explaining :-) One thing here that's a pitty is that the GPU with the largest VRAM (Quadro P5000: 16GB) has the lowest number of cores (2.560) :-( But, as so many times: one cannot have everything in life :-) |
|
Send message Joined: 18 Jul 13 Posts: 79 Credit: 210,528,292 RAC: 0 Level ![]() Scientific publications
|
Is here anyone with NVIDIA A100 80GB? |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 4,772 Level ![]() Scientific publications
|
Is here anyone with NVIDIA A100 80GB? only those with $10,000 to spare to use for free on DC. so likely no one ;) lol faster GPUs don't provide much benefit for these tasks since they are so CPU bound. sure there's a lot of VRAM on this card, and maybe you could theoretically spin up 10-15 tasks on a single card, but unless you have A LOT of CPU power and bandwidth to feed it, you're gonna hit another bottleneck before you can hope to benefit from running that many tasks. just 6x tasks maxes out my EPYC 7443P 48 threads @ 3.9GHz. maybe in the future the project can get these tasks to the point where they lean more on the GPU tensor cores and a more GPU only environment, but for now it's mostly a CPU environment with a small contribution by the GPU.
|
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
just wanted to download another Python task, but BOINC event log tells me the following: 13.10.2022 07:49:38 | GPUGRID | Nachricht vom Server: Python apps for GPU hosts needs 1296.10MB more disk space. You currently have 32082.50 MB available and it needs 33378.60 MB. I wonder why a Python needs 33.378 MB free disk space. Experience has shown that a Python takes some 8 GB disk space when being processed. So how come it says it needs 33GB ? |
|
Send message Joined: 26 Dec 13 Posts: 86 Credit: 1,292,358,731 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Check my previous post about space usage at PythonGPU startup stage. Previously: tar.gz >> slotX (2,66 GiB) >> tar (5,48 GiB) >> app files (~8,13 GiB) = 16,27 GiB (Since archives(tar.gz & tar) were not deleted). Now, after implementation of some improvements, at peak, consumption is about 13,61 GiB, and then(after startup stage) ~8,13 GiB. In any case, it seems to require adjustment. |
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
In any case, it seems to require adjustment. I agree |
|
Send message Joined: 26 Dec 13 Posts: 86 Credit: 1,292,358,731 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Yeah, it seems you are right. Try use this: <task> <application>C:\Windows\System32\cmd.exe</application> <command_line>/C ".\7za.exe x pythongpu_windows_x86_64__cuda1131.txz -so | .\7za.exe x -aoa -si -ttar"</command_line> </task> |
|
Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level ![]() Scientific publications
|
Patching seemed to be required to run so many threads with pytorchrl as these jobs do. Otherwise windows used a lot of memory for every new thread. The script that does the patching is relatively fast. So doing it locally would not save a lot of time. However, are you saying that after the patching some files could be deleted to further optimise memory use? If this is the case, I can look into it. These .dll_bak files? I am not very used to windows... |
|
Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level ![]() Scientific publications
|
Does anyone know if these requirements are estimated by BOINC and adjusted over time like completion time? or if manual adjustment is required? |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 4,772 Level ![]() Scientific publications
|
my runtime estimates have come down to basically reasonable and real levels now. so i think it will adjust on its own over time.
|
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 318 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
abouh's message 59454 was in response to a question about disk storage requirements. No, they won't adjust themselves over time: the amount of disk space required by the task is set by the server, and the amount available to the client is calculated from readings taken of the current state of the host computer. They will only change if the user adjusts the hardware or BOINC client options, or the project staff adjust the job specifications passed to the workunit generator. One the subject of runtimes: the (calculated) runtime estimation relies on just three things: The job speed (sent by the server in the <app_version> specification). The job size (again set on the server) and the Duration Correction Factor (dynamically adjusted by the client) SPEED seems to have fallen by approaching a half over the last month, but I haven't currently got a job I can verify that for. SIZE has remained the same while I've been monitoring it. DCF will have fallen dramatically - mine is now below 1 |
|
Send message Joined: 18 Jul 13 Posts: 79 Credit: 210,528,292 RAC: 0 Level ![]() Scientific publications
|
What can this output mean? e00003a00008-ABOU_rnd_ppod_expand_demos25_9-0-1-RND2053 Update 464, num samples collected 118784, FPS 344 Algorithm: loss 0.1224, value_loss 0.0002, ivalue_loss 0.0113, rnd_loss 0.0307, action_loss 0.0846, entropy_loss 0.0043, mean_intrinsic_rewards 0.0421, min_intrinsic_rewards 0.0084, max_intrinsic_rewards 0.1857, mean_embed_dist 0.0000, max_embed_dist 0.0000, min_embed_dist 0.0000, min_external_reward 0.0000 Episodes: TrainReward 0.0000, l 360.6000, t 649.8340, UnclippedReward 0.0000, VisitedRooms 1.0000 REWARD DEMOS 25, INTRINSIC DEMOS 25, RHO 0.05, PHI 0.05, REWARD THRESHOLD 0.0, MAX DEMO REWARD -inf, INTRINSIC THRESHOLD 1000 FRAMES TO AVOID: 0 Update 465, num samples collected 122880, FPS 347 Algorithm: loss 0.1329, value_loss 0.0002, ivalue_loss 0.0098, rnd_loss 0.0317, action_loss 0.0955, entropy_loss 0.0043, mean_intrinsic_rewards 0.0414, min_intrinsic_rewards 0.0082, max_intrinsic_rewards 0.1516, mean_embed_dist 0.0000, max_embed_dist 0.0000, min_embed_dist 0.0000, min_external_reward 0.0000 Episodes: TrainReward 0.0000, l 341.3529, t 658.7952, UnclippedReward 0.0000, VisitedRooms 1.00000 |
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 662 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Nothing of any meaning or consequence for you. Pertinent only to the researcher. |
|
Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level ![]() Scientific publications
|
These are just the logs of the algorithm, printing out the relevant metrics during agent training. |
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I now have had 5 tasks in a row which failed after some 2.100 secs, one after the other, within about half an hour. https://www.gpugrid.net/result.php?resultid=33098926 https://www.gpugrid.net/result.php?resultid=33100629 https://www.gpugrid.net/result.php?resultid=33100675 https://www.gpugrid.net/result.php?resultid=33100715 https://www.gpugrid.net/result.php?resultid=33100745 anyone any idea what is the problem? On the same host, another task has been running for 22 hours now, but I have stopped download of new tasks until it's clear what's going on. |
©2025 Universitat Pompeu Fabra