Experimental Python tasks (beta) - task description

Message boards : News : Experimental Python tasks (beta) - task description
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 41 · 42 · 43 · 44 · 45 · 46 · 47 . . . 50 · Next

AuthorMessage
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59825 - Posted: 28 Jan 2023, 13:19:44 UTC

yesterday I downloaded and started 2 Pythons on my box with the Intel Xeon E5 2667v4 (2 CPUs) and the Quadro P5000 inside.

What I realized after some time was that the progress bars in the BOINC manager became more and more different.
Finally, one task got finished after 24 hrs + a few minutes (how nice, thus missing the <24 hrs bonus), the other task now is at 29,920%.
What I notice now, with only this one task running, is: no GPU utilization, just CPU.
Any idea how come?
I guess this task is invalid and I should abort it, right?

BTW: with the other task which worked fine I could not see any increasing VRAM usage. It stayed at about 3.5GB all time long.
ID: 59825 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 5,269
Level
Trp
Scientific publications
wat
Message 59826 - Posted: 28 Jan 2023, 15:00:20 UTC - in response to Message 59825.  
Last modified: 28 Jan 2023, 15:03:47 UTC

Old low(er) VRAM use tasks are still going out.

The old tasks have “test” in the WU name, and have the same VRAM use, and high CPU use as before.

The new tasks have “exp” in the name, have less CPU used, but more VRAM.

And the new windows app could be acting differently than the Linux version.
ID: 59826 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59827 - Posted: 28 Jan 2023, 15:32:20 UTC

thanks for the hint regarding "old" and "new" tasks.

The 2 which I downloaded yesterday were "new" ones with "exp" in the name.

Right now, I have running 4 new ones in parallel (I was surprised that they were downladed while the server status page is has been showing "0 unsent" for quite a while).
According to the Windows task manager, they seem to run well, although I cannot tell for sure at this early point whether they all use the GPU. I will be able to tell better from the progress bar after some more time (while at least one looks suspicious at this time).

VRAM use at this point is 9.180 MB (including whatever the monitor uses).
ID: 59827 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59828 - Posted: 28 Jan 2023, 15:54:00 UTC - in response to Message 59827.  

According to the Windows task manager, they seem to run well, although I cannot tell for sure at this early point whether they all use the GPU. I will be able to tell better from the progress bar after some more time (while at least one looks suspicious at this time).

is there any other way to find out whether a task is using the GPU at all, except for watching the BOINC Manager progress bar for a while and comparing to each other the progress of the individual running tasks?
ID: 59828 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 5,269
Level
Trp
Scientific publications
wat
Message 59829 - Posted: 28 Jan 2023, 16:12:24 UTC

as a reference, this is what it's looking like running 3 tasks on 4x A4000s. a good amount of variance in VRAM use. not consistent and I'm not sure if it increases over time, or some tasks just require more than others. but definitely more than before and different behavior than before.



ID: 59829 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 731
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59830 - Posted: 28 Jan 2023, 17:42:19 UTC - in response to Message 59828.  


is there any other way to find out whether a task is using the GPU at all, except for watching the BOINC Manager progress bar for a while and comparing to each other the progress of the individual running tasks?

Yes, use nvidia-smi which is installed by the Nvidia drivers.

It is located here in Windows.

C:\Program Files\NVIDIA Corporation\NVSMI

Just open a command window and navigate there and execute:

nvidia-smi
ID: 59830 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 5,269
Level
Trp
Scientific publications
wat
Message 59831 - Posted: 28 Jan 2023, 18:43:54 UTC - in response to Message 59830.  
Last modified: 28 Jan 2023, 18:46:38 UTC


is there any other way to find out whether a task is using the GPU at all, except for watching the BOINC Manager progress bar for a while and comparing to each other the progress of the individual running tasks?

Yes, use nvidia-smi which is installed by the Nvidia drivers.

It is located here in Windows.

C:\Program Files\NVIDIA Corporation\NVSMI

Just open a command window and navigate there and execute:

nvidia-smi


he might look here too. your location is reported to be on older installs.

C:\Windows\System32\DriverStore\FileRepository\nvdm*\nvidia-smi.exe

i think he needs to include the extension. but yes.

nvidia-smi.exe
ID: 59831 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59832 - Posted: 28 Jan 2023, 20:04:37 UTC

thank you very much, folks, for your help with the Nvidia-SMI.
BTW, on my host it is located here: C:\Program Files\NVIDIA Corporation\NVSMI

However, what I get is "access denied", even when opening the command window as administrator. No idea what the problem is.

But anyway, having been able to watch the progress bar in the BOINC Manager, by now I can clearly tell the following:

just to explain how I startet out with the Pythons last year when they were introduced:
I spoofed the GPU which gave me the ability to run 4 Pythons simultaneously.
With the hardware:
Intel Xeon E5 2667v4 (2 CPUs) and the Quadro P5000 (16GB VRAM) and 256GB system RAM
this was performing fine, over all the months.

Now, when running 4 tasks simultaneously, I notice that the 2 tasks running on "device 0" are about 3 times faster than the 2 tasks running on "device 1".
Which seems to indicate very clearly that the 2 tasks on "device 1" are NOT utilizing the GPU.

Since I made no changes neither in the hardware, nor in the software, nor in any relevant settings vis-a-avis before, the reason for this behaviour must be related to the code of the new Pythons :-( All 4 task are "new" ones.
Or does anyone have any other ideas?

BTW: on the other host with the 2 RTX3070 inside, so far I got downloaded and started 3 Pythons, however they are from the "old" series. And all three are running with the same speed, i.e. utilizing the GPUs.
ID: 59832 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59833 - Posted: 28 Jan 2023, 20:17:30 UTC

further on my posting above:

I just want to point out that the same problem exists even if only 2 of the new Pythons are crunched simultaneously (one on "device 0", the other on "device 1) - see my posing here:
https://www.gpugrid.net/forum_thread.php?id=5233&nowrap=true#59825
ID: 59833 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 731
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59834 - Posted: 28 Jan 2023, 21:07:22 UTC - in response to Message 59832.  

thank you very much, folks, for your help with the Nvidia-SMI.
BTW, on my host it is located here: C:\Program Files\NVIDIA Corporation\NVSMI

However, what I get is "access denied", even when opening the command window as administrator. No idea what the problem is.


The access denied is obviously a permission issue. I don't know how to view the properties of a file in Windows. Maybe right-click? Does that show you who "owns" the file?

Windows probably has the same ownership options or close enough to Linux where a file has permissions at the system level, the group level and the user level.

Maybe the Windows version of nvidia-smi.exe belongs to a Nvidia group which the local user is not a member. Maybe investigate adding the user to the Nvidia group to see if that changes whether the file can be executed.
ID: 59834 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59835 - Posted: 28 Jan 2023, 21:20:00 UTC

thank you, Keith, for your reply re the Nvidia-SMI. I will investigate further tomorrow.

However, by now, looking at the progress bars, it seems evident enough that the new Python version obviously has a problem with spoofed GPUs. Either by design or by accident. Maybe abouh can tell more
ID: 59835 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59836 - Posted: 29 Jan 2023, 6:48:49 UTC

for the time being, I excluded "device 1" from GPUGRID via setting in the cc_config.xml
So, when downloading Pythons tasks next time, only 2 should come in and be processed by "device 0" (with app_config.xml setting "0.5 GPU usage").

Further, I guess I could not process 4 tasks (from the new type) simultaneously anyway as I can see from the currently running 2 tasks that they are using 12.367 MB VRAM. So not even 1 additional task would work, with the GPU having 16 GB VRAM.

On the other host with the 2 RTX3070 (8 GB VRAM ea.) on which I ran 4 tasks in parallel before, I will now have the problem that only 1 task per GPU can be processed, due to the higher VRAM need. Which is a pitty :-(
And I guess even GPUs with 12 GB VRAM may NOT be able to process 2 new Pythons simultaneously.
ID: 59836 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59837 - Posted: 29 Jan 2023, 7:56:30 UTC

this is one of the Pythons which had only CPU utilization, but NOT GPU utilization. So I aborted it after several hours.

https://www.gpugrid.net/result.php?resultid=33271058

Does the stderr show by any chance why the GPU was not utilized?
ID: 59837 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
abouh

Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59838 - Posted: 30 Jan 2023, 14:05:14 UTC - in response to Message 59835.  

Hello Erich,

By design an environment variable defines which GPU is the task supposed to use (in run.py line 429):

os.environ["CUDA_VISIBLE_DEVICES"] = os.environ["GPU_DEVICE_NUM"]


Then, the PyTorchRL package tries to detect that specified GPU, and otherwise uses CPU. So if no GPU is detected it can happen what you mention, that CPU is used instead and the task progress becomes much slower.

What I can do is add an additional logging message in the run.py scripts that will display whether or not the GPU device was detected. So we will know for sure.

Furthermore, I have found a way to reduce at least a bit the GPU memory requirements. I will start using in the newly submitted tasks.
ID: 59838 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 5,269
Level
Trp
Scientific publications
wat
Message 59839 - Posted: 30 Jan 2023, 14:15:33 UTC - in response to Message 59838.  

Thanks abouh!
ID: 59839 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59840 - Posted: 30 Jan 2023, 14:21:52 UTC - in response to Message 59838.  

hello abouh,

thanks for your quick reply.

So, as it seems, the situation is such that the tasks from new Python version detect the "real" GPU ("device 0") but do NOT detect the spoofed GPU ("device 1"), for what reason ever.

In the former Python version, both GPUs were detected without any problem.

However, I now found a workaround which also works well:

I excluded "device 1" in the cc_config.xml of BOINC, and I set the GPU usage to "0.3" in the app_config.xml of GPUGRID.
This enables to run 3 Pythons simultaneously. In theory, I could run even 4 tasks by setting the GPU usage to "0.25", but from what I can see now, with 3 tasks running, the VRAM is filled with 16.307MB out of VRAM size 16.384MB.
The progress of the 3 tasks at this moment is 38%, 24% and 22% (they were downloaded at different times), So I can only hope that VRAM utilization will not increase any more in course of task processing.

On another host, I have two Pythons running in parallel, with total VRAM use 6.125 MB out of 6.144 MB available :-)

So, if you mention that you found a way to reduce VRAM requirements a little bit, this will definitely help :-)
ID: 59840 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 731
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59841 - Posted: 30 Jan 2023, 21:20:31 UTC

Abouh,

This task series and all its bethren have a configuration error and they all are failing very fast.

https://www.gpugrid.net/result.php?resultid=33273094

I've chalked up over 40 errors today and all the wingmen are failing the series in the same way.

File "run.py", line 97, in main
demo_dtypes={prl.OBS: np.uint8, prl.ACT: np.int8, prl.REW: np.float16, "StateEmbeddings": np.int16},
TypeError: create_factory() got an unexpected keyword argument 'state_embeds_to_avoid'
ID: 59841 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 5,269
Level
Trp
Scientific publications
wat
Message 59842 - Posted: 30 Jan 2023, 21:56:05 UTC - in response to Message 59841.  
Last modified: 30 Jan 2023, 21:58:25 UTC

Keith, you need to remove your "tweaking". it's trying to replace the run.py script workaround thing that we were doing before. the old run.py script is not compatible with the new tasks.

you must have forgotten to reset the project on this one host. your other hosts have run the new tasks OK.

i have many of the new tasks running just fine.

and memory use is improved. thanks abouh.
ID: 59842 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 731
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59843 - Posted: 31 Jan 2023, 2:05:13 UTC - in response to Message 59842.  

Keith, you need to remove your "tweaking". it's trying to replace the run.py script workaround thing that we were doing before. the old run.py script is not compatible with the new tasks.

you must have forgotten to reset the project on this one host. your other hosts have run the new tasks OK.

i have many of the new tasks running just fine.

and memory use is improved. thanks abouh.

Nope. Absolutely NOT the case. The run.py is the one provided by the project.

Look at the link I provided, every other wingman is failing the task also. Along with all the other failed tasks.

I'm damn sure I reset the project. Resetting again.
ID: 59843 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 5,269
Level
Trp
Scientific publications
wat
Message 59844 - Posted: 31 Jan 2023, 3:35:41 UTC - in response to Message 59843.  
Last modified: 31 Jan 2023, 3:40:08 UTC

Your stederr output from your failed task in your link clearly indicated that it copied the run.py file. Or was still trying to.

13:00:27 (3925992): wrapper: running /bin/cp (/home/keith/Desktop/BOINC/projects/www.gpugrid.net/run.py run.py)
13:00:28 (3925992): /bin/cp exited; CPU time 0.000962

The only way it would be doing that is because you’re still running my edited file, a project reset would have erased that and been replaced with the standard version.

The other hosts that failed, failed for different reasons. They got unlucky and hit hosts with incompatible GPUs.
ID: 59844 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 . . . 41 · 42 · 43 · 44 · 45 · 46 · 47 . . . 50 · Next

Message boards : News : Experimental Python tasks (beta) - task description

©2025 Universitat Pompeu Fabra