Message boards :
News :
Python Runtime (GPU, beta)
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next
Author | Message |
---|---|
Send message Joined: 22 May 20 Posts: 110 Credit: 115,525,136 RAC: 345 Level ![]() Scientific publications ![]() |
That's the next bad news for me as my GPU is maxed out at 6GB. Without upgrading my GPU and that's not likely gonna be soon, I suppose I have to give up on these types of tasks - at least for the time being. Thanks for the update though |
Send message Joined: 21 Feb 20 Posts: 1114 Credit: 40,838,722,595 RAC: 4,266,994 Level ![]() Scientific publications ![]() |
Some jobs have already finished successfully. Thank you for the feedback. Current jobs being tested should use around 30% GPU and around 8000MiB GPU memory. why such low GPU utilization? and 8000? or do you mean 800? 8GB? or 800MB? ![]() |
Send message Joined: 22 May 20 Posts: 110 Credit: 115,525,136 RAC: 345 Level ![]() Scientific publications ![]() |
I can only speculate in regards to the former one. But your latter question likely resolves to 8,000 MiB (Mebibyte) which is just another convention to count bits – if he indeed meant to write 8,000. While k (kilo), M (Mega), G (Giga) and T (Tera) are the SI-prefix units and are computed as base 10 by 10^3, 10^6, 10^9 and 10^12 respectively, the binary prefix units of Ki (Kibi), Mi (Mebi), Gi (Gibi) and Ti (Tebi) are computed as base 2 by 2^10, 2^20, 2^30 and 2^40. As such M/Mi = (10^6/2^20) ~ 95.37% or a difference of ~4.63% between the SI and binary prefix units. 1 kB = 1000 B 1 KiB = 1024 B |
Send message Joined: 21 Feb 20 Posts: 1114 Credit: 40,838,722,595 RAC: 4,266,994 Level ![]() Scientific publications ![]() |
yeah I know the conversions and such. I'm just wondering if it's a typo, Keith ran some of these beta tasks successfully and did not report such high memory usage, he claimed it only used about 200MB ![]() |
Send message Joined: 22 May 20 Posts: 110 Credit: 115,525,136 RAC: 345 Level ![]() Scientific publications ![]() |
ah, all right. didn't mean to offend you if that's what I did. still don't understand their beta testing procedure anyway. so far not many tasks have been run, only few of them successfully, but meanwhile nearly no information has been shared rendering the whole procedure rather intransparent and leaving others in the dark wondering about their piles of unsuccessful tasks. and the little information that is indeed shared seems to conflict a lot with the user experience and observations. for a ML task 8 GB isn't untypical though |
Send message Joined: 21 Feb 20 Posts: 1114 Credit: 40,838,722,595 RAC: 4,266,994 Level ![]() Scientific publications ![]() |
I agree that lots of memory use wouldnt be atypical for AI/ML work. and also agree that the admins should be a little more transparent about what these tasks are doing and the expected behaviors. it seems so far they have tons and tons of errors, then the admins come back and say they fixed the errors, then just more errors again. I'd also like to know if these are using the Tensor cores on RTX GPUs. ![]() |
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 295,172 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I think the Beta testing process is (as usual anywhere) very much an incremental process. It will have started with small test units, and as each little buglet surfaces and is solved, the process moves on to test a later segment that wasn't accessible until the previous problem had been overcome. Thus - Abouh has confirmed that yesterday's upload file size problem was caused by including a source data file in the output - "Should not be returned". I also noted that some of Keith's successful runs were resends of tasks which had failed on other machines - some of them generic problems which I would have expected to cause a failure on his machine too. So it seems that dynamic fixes may have been applied too. Normally, a new BOINC replication task is an exact copy of its predecessor, but I don't think can be automatically assumed during this Beta phase. In particular, Keith's observation that one test task only used 200 MB of GPU memory isn't necessarily a foolproof guide to the memory demand of later tests. |
Send message Joined: 21 Feb 20 Posts: 1114 Credit: 40,838,722,595 RAC: 4,266,994 Level ![]() Scientific publications ![]() |
which is why I asked for clarification in light of the disparity between expected and observed behaviors. ![]() |
![]() Send message Joined: 13 Dec 17 Posts: 1416 Credit: 9,119,446,190 RAC: 614,515 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() |
yeah I know the conversions and such. I'm just wondering if it's a typo, Keith ran some of these beta tasks successfully and did not report such high memory usage, he claimed it only used about 200MB Yes, I have watched tasks complete fully to a proper boinc finish end and I never saw more than 290MB of gpu memory reported in nvidia-smi at a max 13% utilization. Unless nvidia-smi has an issue in reporting gpu RAM used, the 8GB of memory post is out of line. Or the tasks the scientist-developer mentioned haven't been released to us out of the laboratory yet. ![]() |
Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level ![]() Scientific publications ![]() |
We are progressing in our debugging and have managed to solve several errors, but as mentioned in a previous post, it is an incremental process. We are trying to train AI agent using reinforcement learning, which generally interleaves stages in which the agent collects data (a process less GPU intensive) and stages in which the agent learns from that data. The nature of the problem, in which data in progressively generated, accounts for a lower GPU utilisation that in supervised machine learning, although we will work to progressively make it more efficient once debugging is completed. Since the obstacle tower environment (https://github.com/Unity-Technologies/obstacle-tower-env), the source of data, also runs in GPU, during the learning stage, the neural network and the training data together with the environment occupy approximately 8,000 MiB (Mebibyte, was not a typo) of GPU memory when checked locally with nvidia-smi. Basically, the python script has the following steps: step 1: Defining the conda environment with all dependencies. step 2: Downloading obstacletower.zip, a necessary file used to generate the data. step 3: Initialising the data generator using the contents of obstacletower.zip. step 4: Creating the AI agent and alternating data collection and data training stages. step 5: Returning the trained AI agent, and not obstacletower.zip. Only after reaching step 4 and step 5 the GPU is used. Some of the jobs that succeeded but barely used the GPU were to test that indeed problems in step 1 and step 2 had been solved (most of them solved by Keith Myers). We noticed that most recent failed jobs returned the following error at step 3: mlagents_envs.exception.UnityTimeOutException: The Unity environment took too long to respond. Make sure that : The environment does not need user interaction to launch The Agents' Behavior Parameters > Behavior Type is set to "Default" The environment and the Python interface have compatible versions. We are working to solve it. If step 3 is completed without errors, jobs reaching steps 4 and 5 should be using GPU. We hope that helped shed some light on our work and the recent results. We will try to solve any further doubts and inform about our progress. |
Send message Joined: 21 Feb 20 Posts: 1114 Credit: 40,838,722,595 RAC: 4,266,994 Level ![]() Scientific publications ![]() |
Thanks for the more detailed answer. regarding the 8GB of memory used. -which step of the process does this happen? -was Keith's nvidia-smi screenshot that he posted in another thread showing low memory use, from an earlier unit that did not require that much VRAM? -will these units fail from too little VRAM? -what will you do or are you doing about GPUs with less than 8GB VRAM, or even with 8GB? -do you have some filter in the WU scheduler to not send these units to GPUs with less than 8GB? ![]() |
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 295,172 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
-do you have some filter in the WU scheduler to not send these units to GPUs with less than 8GB? It's certainly possible to set such a filter: referring again to Specifying plan classes in C++, the scheduler can check a CUDA plan class specification like if (!strcmp(plan_class, "cuda23")) { if (!cuda_check(c, hu, 100, // minimum compute capability (1.0) 200, // max compute capability (2.0) 2030, // min CUDA version (2.3) 19500, // min display driver version (195.00) 384*MEGA, // min video RAM 1., // # of GPUs used (may be fractional, or an integer > 1) .01, // fraction of FLOPS done by the CPU .21 // estimated GPU efficiency (actual/peak FLOPS) )) { return false; } } We last discussed that code in connection with compute capability, but I think we're still having problems implementing filters via tools like that. |
Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level ![]() Scientific publications ![]() |
At step 3, initialising the environment requires a small amount of GPU memory (somewhere around 1GB). At step 4 the AI agent is initialised and trained, and a data storage class and a neural network are created and placed on the GPU. This is when more memory is required. However, in the next round of tests we will lower the GPU memory requirements of the script while debugging step 3. Eventually for steps 4 and 5 we expect it to require the 8G mentioned earlier. Keith's nvidia-smi screenshot showing a job with low memory use was a job that returned after step 2, to verify problems in steps 1 and 2 had been solved. |
![]() Send message Joined: 13 Dec 17 Posts: 1416 Credit: 9,119,446,190 RAC: 614,515 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() |
So was this WU https://www.gpugrid.net/result.php?resultid=32660133 the one that was completed after steps 1 and 2? Or after steps 4 and 5? I never got to witness this one in realtime. I had nvidia-smi polling update set at 1 second and I never saw the gpu memory usage go above 290MB for that screenshot. It was not taken from the task linked above. The BOINC completion percentage just went to 10% and stayed there and never showed 100% completion when it finished. Think that is an issue with BOINC historically. |
![]() Send message Joined: 13 Dec 17 Posts: 1416 Credit: 9,119,446,190 RAC: 614,515 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() |
The environment and the Python interface have compatible versions. Is the reason why I was able to complete a workunit properly because of having my local python environment match the zipped wrapper python interface? I use several pypi applications that probably have setup the python environment variable. Is there something I can dump out of the host that completed the workunit properly that will help you debug the application package? |
Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level ![]() Scientific publications ![]() |
This one completed the whole python script. Including steps 4 and 5. Should have used the GPU. |
![]() Send message Joined: 13 Dec 17 Posts: 1416 Credit: 9,119,446,190 RAC: 614,515 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() |
Thanks for confirming the one I completed used the gpu. |
Send message Joined: 6 Jan 15 Posts: 76 Credit: 25,499,534,331 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Did a check on one host running GPUGridpy units. e4a6-ABOU_ppo_gym_demos3-0-1-RND1018_0 Run time 4,999.53 GPU Memory: nvidia-smi report 2027MiB No check-pointing yet but works well. |
Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level ![]() Scientific publications ![]() |
We sent out some jobs yesterday and almost all finished successfully. We are still working on avoiding the following error related to the Obstacle Tower environment: mlagents_envs.exception.UnityTimeOutException: The Unity environment took too. long to respond. Make sure that : The environment does not need user interaction to launch The Agents' Behavior Parameters > Behavior Type is set to "Default" The environment and the Python interface have compatible versions. However, to test the rest of the code we tried with another set of environments that are less problematic (https://gym.openai.com/). The successful jobs used these environments. While we find and test a solution for the Obstacle Tower locally we will continue to send jobs with these environments to test the rest of the code. Note that reinforcement learning (RL) techniques are independent of the environment. The environment represents the world where the AI agent learns intelligent behaviours. Switching to another environment simply means applying the learning technique to a different problem that can be equally challenging (placing the agent in a different world). Thus, we will now finish debugging the app with these Gym environments simply because are less prone to errors and, once we know the only possible source of problems is the environment, consider solving others. |
Send message Joined: 21 Feb 20 Posts: 1114 Credit: 40,838,722,595 RAC: 4,266,994 Level ![]() Scientific publications ![]() |
I had a few failures: https://www.gpugrid.net/result.php?resultid=32660680 and https://www.gpugrid.net/result.php?resultid=32660448 seems to be a bad WU on both instances since all wingmen are erroring in the same way. mainly used ~6-7% GPU utilization on my 3080Ti, with intermittent spikes to ~20% every 10s or so. power use near idle, GPU memory utilization around 2GB, and system memory use around 4.8GB. make sure your system has enough memory in multi-GPU systems. ![]() |
©2025 Universitat Pompeu Fabra