Message boards :
News :
Python Runtime (GPU, beta)
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 . . . 6 · Next
Author | Message |
---|---|
Send message Joined: 21 Feb 20 Posts: 1114 Credit: 40,838,909,595 RAC: 4,232,576 Level ![]() Scientific publications ![]() |
The Obstacle Tower environment is a simulated environment for machine learning (Reinforcement Learning) research. Note that in order to research how to train and deploy embodied agents in the real word it is common to research in 3D world simulations like this on. This is the github page of the project: https://github.com/Unity-Technologies/obstacle-tower-env do you have any plans to utilize the Tensor cores present on many newer Nvidia GPUs? these are designed for machine learning tasks. ![]() |
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 295,172 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Thanks for the feedback - on that basis, I'll keep pushing them through. Had an odd finish: FileNotFoundError: [Errno 2] No such file or directory: '/var/lib/boinc-client/slots/6/model.state_dict.3073' "Environment shut down with return code 0" sounds like a happy ending, but "called boinc_finish(195)" is 'Child failed'. |
![]() Send message Joined: 13 Dec 17 Posts: 1416 Credit: 9,119,446,190 RAC: 614,515 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() |
Tried a LOT of the PythonGPU tasks today. Still no joy for a successful run. Think they are getting further along though since I think I see progress in how far they get before the environment collapses and errors out. |
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 295,172 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
The next round of testing has started. e1a10-ABOU_PPOObstacle6-0-1-RND2533_0 - I was going to say 'is running', but it's crashed already. After only 20 seconds, I got an apparently normal finish, followed by upload failure: <file_xfer_error> |
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 295,172 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Got another from what looks like the same batch. Limit is <max_nbytes>100000000.000000</max_nbytes> I'll catch the output and see how big it is. Edit - couldn't catch it ('report immediately' operated too fast). But I watched the next one in the slot directory: the output file was created right at the end, but was cleaned up almost immediately. I read it as 169 MB, but can't be certain. |
Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level ![]() Scientific publications ![]() |
Yes the file should be 170M approx. |
Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level ![]() Scientific publications ![]() |
Yes the file should be 170M approx. |
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 295,172 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Well, I got one for you to study: e1a8-ABOU_PPOObstacle7-0-1-RND2466_3 That was done by manually increasing the maximum allowed size in BOINC. I think that's an internal setting in the BOINC system - specifically, the workunit generator or its template files - rather than the Python package. I've suspended work fetch for now - please let us know when the next iteration is ready to test. Edit - this it what the upload file contained: ![]() ![]() It seems a bit odd to return the ObstacleTower zip back to you unchanged? |
Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level ![]() Scientific publications ![]() |
The git-related errors should be solved now. ERROR: Cannot find command 'git' - do you have 'git' installed and in your PATH? We will study the errors related to downloading the Obstacle Tower environment. Thank you for the feedback. |
Send message Joined: 7 Jan 17 Posts: 34 Credit: 1,371,429,518 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() |
Got one that ended in 195 (0xc3) EXIT_CHILD_FAILED after 15 minutes: ==> WARNING: A newer version of conda exists. <== current version: 4.8.3 latest version: 4.10.3 Please update conda by running $ conda update -n base -c defaults conda 13:14:06 (11501): /usr/bin/flock exited; CPU time 470.306190 13:14:06 (11501): wrapper: running ./gpugridpy/bin/python (run.py) path: ['/var/lib/boinc-client/slots/34', '/var/lib/boinc-client/slots/34/gpugridpy/lib/python3.8/site-packages/git/ext/gitdb', '/var/lib/boinc-client/slots/34/gpugridpy/lib/python38.zip', '/var/lib/boinc-client/slots/34/gpugridpy/lib/python3.8', '/var/lib/boinc-client/slots/34/gpugridpy/lib/python3.8/lib-dynload', '/var/lib/boinc-client/slots/34/gpugridpy/lib/python3.8/site-packages', '/var/lib/boinc-client/slots/34/gpugridpy/lib/python3.8/site-packages/gitdb/ext/smmap'] git path: /var/lib/boinc-client/slots/34/gpugridpy/lib/python3.8/site-packages/git Traceback (most recent call last): File "run.py", line 340, in <module> main() File "run.py", line 53, in main print("GPU available: {}".format(torch.cuda.is_available())) NameError: name 'torch' is not defined 13:14:10 (11501): ./gpugridpy/bin/python exited; CPU time 1.602758 13:14:10 (11501): app exit status: 0x1 13:14:10 (11501): called boinc_finish(195) </stderr_txt> ]]> |
![]() Send message Joined: 13 Dec 17 Posts: 1416 Credit: 9,119,446,190 RAC: 614,515 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() |
Got five PythonGPU tasks to finish and report after about ten minutes that were valid. |
Send message Joined: 22 May 20 Posts: 110 Credit: 115,525,136 RAC: 345 Level ![]() Scientific publications ![]() |
My machine is a dual boot machine (Win10/Ubuntu 20.04). Are there plans for a Windows app for these tasks or should I boot into Linux to get some of these tasks? |
![]() Send message Joined: 13 Dec 17 Posts: 1416 Credit: 9,119,446,190 RAC: 614,515 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() |
Haven't heard of any posts by admin types that Windows apps will be made. That stated, often the new beta apps are tested first on Linux to get the bugs out and then the Windows apps are generated. |
![]() Send message Joined: 13 Dec 17 Posts: 1416 Credit: 9,119,446,190 RAC: 614,515 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() |
This task looks to have run through all of its parameter set to complete normally at around 3000 seconds and was validated for ~ 200K credits. https://www.gpugrid.net/result.php?resultid=32660133 |
![]() Send message Joined: 7 Mar 14 Posts: 18 Credit: 6,575,125,525 RAC: 1,038 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() |
Did you notice if it used the GPU and if it did what percentage ? I had one that ran for about 3 hours before failing, never saw the fans running during that time. |
Send message Joined: 21 Feb 20 Posts: 1114 Credit: 40,838,909,595 RAC: 4,232,576 Level ![]() Scientific publications ![]() |
just ran this one on my RTX 3080Ti: https://www.gpugrid.net/result.php?resultid=32660184 16:19:48 (1841951): wrapper (7.7.26016): starting ran for about 2 mins and errored out. file size too big? how big could the file get in 2 minutes? lol. looks like everyone in this WU chain is having the same issue though. https://www.gpugrid.net/workunit.php?wuid=27085637 Bad WU? and I saw no evidence that it ever touched the GPU, refreshing nvidia-smi every 2 seconds showed no process running on the GPU. must still be using only the CPU. Can an admin please directly comment if these are actually using the GPU or not? I know an admin mentioned that they were only doing CPU work "as a test". Is that still the case? Having GPU tasks that only use the CPU core is very confusing. ![]() |
![]() Send message Joined: 13 Dec 17 Posts: 1416 Credit: 9,119,446,190 RAC: 614,515 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() |
The ones that have partially ran and were validated only used 31% of the gpu in nvidia-smi. The one task that appears to have successfully run through to normal completion was done while I was out of the house and did not see it run unfortunately. Will have to wait for more to observe. |
![]() Send message Joined: 13 Dec 17 Posts: 1416 Credit: 9,119,446,190 RAC: 614,515 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() |
Looks like the tasks fluctuate between a few seconds at 1% utilization before returning to hovering around 10-13% utilization. I was watching one on a 2070 and it was running for almost 60 minutes in nvidia-smi. They are marked at C+G type in that program. I think I killed it when I pulled up htop to look at how much cpu it was using because it finished with an error instantly at the same time as htop populated the screen. |
Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level ![]() Scientific publications ![]() |
The contents of the obstacletower.zip downloaded file are necessary to generate the data required for the machine learning agent to train. That is why the file itself is not modified. Only used to generate the training data. The expected behaviour is for the file to be downloaded, used during the job completion and then deleted. Should not be returned. Some jobs have already finished successfully. Thank you for the feedback. Current jobs being tested should use around 30% GPU and around 8000MiB GPU memory. |
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 295,172 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
The expected behaviour is for the file to be downloaded, used during the job completion and then deleted. Should not be returned. That makes much more sense. Standing by for the next round of debugging... :-) |
©2025 Universitat Pompeu Fabra