Python Runtime (GPU, beta)

Author	Message
abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 57785 - Posted: 10 Nov 2021, 14:48:30 UTC - in response to Message 57782. Thank you for the feedback. We had detected the error in https://www.gpugrid.net/result.php?resultid=32660448 but not the one in https://www.gpugrid.net/result.php?resultid=32660680 Having alternating phases of lower and higher GPU utilisation is normal in Reinforcement Learning, as the agent alternates between data collection (generally low GPU usage) and training (higher GPU memory and utilisation). Once we solve most of the errors we will focus on maximizing GPU efficiency during the training phases. ID: 57785 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1117 Credit: 40,876,970,595 RAC: 1 Level Scientific publications	Message 57786 - Posted: 10 Nov 2021, 15:04:20 UTC - in response to Message 57785. Last modified: 10 Nov 2021, 15:09:17 UTC have you considered creating a modified app that will use the RTX (and other) GPU's onboard Tensor cores? it should speed up things considerably. https://www.quora.com/Does-tensorflow-and-pytorch-automatically-use-the-tensor-cores-in-rtx-2080-ti-or-other-rtx-cards I'm guessing in addition to making the needed configuration changes, you'd need to adjust your scheduler to only send to cards with Tensor cores (GeForce RTX cards, TitanV, Tesla/QuadroRTX cards from Volta forward) ID: 57786 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1117 Credit: 40,876,970,595 RAC: 1 Level Scientific publications	Message 57789 - Posted: 10 Nov 2021, 16:28:26 UTC - in response to Message 57786. Last modified: 10 Nov 2021, 16:30:20 UTC information for pytorch here: https://github.com/NVIDIA/apex https://nvidia.github.io/apex/ ID: 57789 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 57790 - Posted: 10 Nov 2021, 17:04:13 UTC - in response to Message 57786. We are using PyTorch to train our agents, and for now we have not considered using mixed precision, which seem required for the Tensor cores. It could be an interesting possibility to reduce memory requirements and speed up training processes. I have to admit that I do not know how it affects performance in reinforcement learning algorithms, but it is an interesting option. ID: 57790 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 57792 - Posted: 10 Nov 2021, 18:53:33 UTC Last modified: 10 Nov 2021, 19:48:25 UTC Getting errors in the test5 run, like e2a16-ABOU_ppod_gym_test5-0-1-RND0379_1 e2a10-ABOU_ppod_gym_test5-0-1-RND0874_1 And on the test6 run. This time, the error seems to be in placing the expected task files in the slot directory, prior to starting the main run. e3a17-ABOU_ppod_gym_test6-0-1-RND2029_0 e3a11-ABOU_ppod_gym_test6-0-1-RND1260_4 Both have File "run.py", line 393, in <module> main() File "run.py", line 106, in main feature_extractor_network=get_feature_extractor(args.nn), File "/var/lib/boinc-client/slots/4/gpugridpy/lib/python3.8/site-packages/pytorchrl/agent/actors/feature_extractors/__init__.py", line 19, in get_feature_extractor raise ValueError("Specified model not found!") ValueError: Specified model not found! ID: 57792 · Rating: 0 · rate: / Reply Quote

mmonnin Send message Joined: 2 Jul 16 Posts: 339 Credit: 8,281,341,558 RAC: 437,535 Level Scientific publications	Message 57794 - Posted: 10 Nov 2021, 23:56:58 UTC Last modified: 10 Nov 2021, 23:57:13 UTC I got one that worked today. Then 6 more that didnt on the same PC https://www.gpugrid.net/workunit.php?wuid=27086033 ID: 57794 · Rating: 0 · rate: / Reply Quote

mmonnin Send message Joined: 2 Jul 16 Posts: 339 Credit: 8,281,341,558 RAC: 437,535 Level Scientific publications	Message 57795 - Posted: 11 Nov 2021, 2:04:09 UTC I got another. So far it is running Over 4 CPU threads at 1st then 1 thread for 1st 4min 13% completed back to 10% then no more progression At 10% hen GPU load at 3-5% 875mb vram 78min so far. ID: 57795 · Rating: 0 · rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 595 Credit: 13,083,686,510 RAC: 1,647,142 Level Scientific publications	Message 57796 - Posted: 11 Nov 2021, 6:49:43 UTC I've got several GPU Python beta tasks at my triple GPU Host #480458 Several of them have succeeded after around 5000 seconds execution time. But three of these tasks have exceeded this time. Task e1a20-ABOU_ppod_gym_test-0-1-RND4563_6 failed after 11432 seconds. Task e1a6-ABOU_ppod_gym_test-0-1-RND1186_1 failed after 18784 seconds. Task e1a2-ABOU_ppod_gym_test-0-1-RND2391_3 is currently running even longer. This last task is theoreticaly running at device 1. But it seems to be effectively running at device 0, sharing the same device with an ACEMD3 regular task e14s132_e10s98p1f905-ADRIA_AdB_KIXCMYB_HIP-0-2-RND7676_5. ID: 57796 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 5 Level Scientific publications	Message 57797 - Posted: 11 Nov 2021, 7:13:49 UTC I've got the same thing going on. BOINC says the task is running on Device2 while in reality it is sharing Device0 along with an Einstein GRP task. This is the task https://www.gpugrid.net/result.php?resultid=32661276 ID: 57797 · Rating: 0 · rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 595 Credit: 13,083,686,510 RAC: 1,647,142 Level Scientific publications	Message 57798 - Posted: 11 Nov 2021, 9:30:27 UTC - in response to Message 57796. Task e1a2-ABOU_ppod_gym_test-0-1-RND2391_3 is currently running even longer. The risk of beta testing: It finally failed after 42555 seconds. I hope this is somehow useful for debugging... ID: 57798 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 57799 - Posted: 11 Nov 2021, 9:50:17 UTC - in response to Message 57798. Task e1a2-ABOU_ppod_gym_test-0-1-RND2391_3 is currently running even longer. FileNotFoundError: [Errno 2] No such file or directory: '/var/lib/boinc-client/slots/3/model.state_dict.73201' The same for two of your predecessors on this workunit. Is there any way we could avoid re-inventing the wheel (slowly) for errors like this? ID: 57799 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 57800 - Posted: 11 Nov 2021, 13:44:16 UTC - in response to Message 57799. Last modified: 11 Nov 2021, 13:48:31 UTC The excessively long training time problem and the problem related to FileNotFoundError: [Errno 2] No such file or directory: '/var/lib/boinc-client/slots/3/model.state_dict.73201' Have been fixed now. Most jobs sent today are being completed successfully. The reported issues were very helpful for debugging. Progress: The core research idea is to train populations of reinforcement learning agents that learn independently for a certain amount of time and, once they return to the server, put their learned knowledge in common with other agents to create a new generation of agents equipped with the information acquired by previous generations. Each GPUgrid job is one of these agents doing some training independently. In that sense, the first 4 letters of the job name identify the generation and the number of the agent (i.e. e1a2-ABOU_ppod_gym_test-0-1-RND2391_3 refers to the epoch or generation number 1 and the agent number 2 within that generation). The debugging done recently, has allowed more and more of this jobs to finish. An experiment currently running has achieved already a 3rd generation of agents. As mentioned in an earlier post, we are working now with OpenAI gym environments (https://gym.openai.com/) ID: 57800 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 5 Level Scientific publications	Message 57801 - Posted: 11 Nov 2021, 15:48:48 UTC - in response to Message 57800. Last modified: 11 Nov 2021, 15:54:02 UTC Are you working on fixing the issue that the tasks only run on Device#0 in BOINC? Even when Device#0 is already occupied by another task from another project? That leaves at least one device doing nothing because BOINC thinks it is occupied. ID: 57801 · Rating: 0 · rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 595 Credit: 13,083,686,510 RAC: 1,647,142 Level Scientific publications	Message 57802 - Posted: 11 Nov 2021, 17:15:48 UTC - in response to Message 57801. Last modified: 11 Nov 2021, 17:16:26 UTC Are you working on fixing the issue that the tasks only run on Device#0 in BOINC? +1 At this other example, Device 0 is running 1 Gpugrid ACEMD3 task and 2 Python GPU tasks. Meanwhile, Device 1 and Device 2 remain idle. ID: 57802 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1117 Credit: 40,876,970,595 RAC: 1 Level Scientific publications	Message 57803 - Posted: 11 Nov 2021, 17:25:26 UTC - in response to Message 57802. weird, I thought this problem had been fixed already. I guess I never realized since I've only been running the beta tasks on my single GPU system. ID: 57803 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 57804 - Posted: 11 Nov 2021, 17:31:21 UTC Last modified: 11 Nov 2021, 17:57:23 UTC Count me in on this, too. My client is running e8a16-ABOU_ppod_gym_test7-0-1-RND1448_0 on device 1. I have GPUGrid excluded from device 0, so I can run tasks from other projects in the faster PCIe slot while testing. But ... Well, despite running on the wrong card, it finished and passed the GPUGrid validation test. I've swapped over the exclusion, and BOINC and GPUGrid are now in agreement that card 0 is the card to use. ID: 57804 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 5 Level Scientific publications	Message 57805 - Posted: 11 Nov 2021, 18:32:50 UTC Hard to tell from the error code snippet whether the tasks are hardwired to run on Device#0 or whether the error snippet is just the result of where the task actually has run. [nan, nan, nan, ..., nan, nan, nan]], device='cuda:0', ID: 57805 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 5 Level Scientific publications	Message 57817 - Posted: 12 Nov 2021, 19:33:08 UTC Last modified: 12 Nov 2021, 19:39:58 UTC Well, I have a new python task running by itself now on Device#2. So it may mean they have fixed the issue where the tasks always ran on Device#0. See this new output in the stderr.txt that looks like it is allocating to Device#2 It hasn't been there in any other of my tasks till just now for this new task. Found GPU: True, Number 2 - 2 ID: 57817 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 57818 - Posted: 12 Nov 2021, 19:51:54 UTC - in response to Message 57817. Yes, we have fixed the issue. It should be fine now. Please, let us know if you encounter any new device placement error. We just ran the tests and, as you mention, we print the device number in the stderr file. ID: 57818 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 5 Level Scientific publications	Message 57819 - Posted: 12 Nov 2021, 20:08:22 UTC - in response to Message 57818. Thank you for fixing this issue. I don't know whether you test in a multi-gpu environment or not. I suspect a lot of projects don't. But there are lots of us that run many multi-gpu hosts that have been bit by this bug often. ID: 57819 · Rating: 0 · rate: / Reply Quote