Python Runtime (GPU, beta)

Message boards : News : Python Runtime (GPU, beta)
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 . . . 6 · Next

AuthorMessage
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1114
Credit: 40,838,909,595
RAC: 4,232,576
Level
Trp
Scientific publications
wat
Message 57716 - Posted: 2 Nov 2021, 16:45:03 UTC - in response to Message 57715.  

The Obstacle Tower environment is a simulated environment for machine learning (Reinforcement Learning) research. Note that in order to research how to train and deploy embodied agents in the real word it is common to research in 3D world simulations like this on. This is the github page of the project: https://github.com/Unity-Technologies/obstacle-tower-env

We use it as a testbench within our efforts to train populations of interacting artificial intelligent agents able to develop complex behaviours and solve complex tasks. The environment runs on GPU, and the Deep Learning models learning how to solve the simulation too.

Most of the bugs we are trying to solve are related to the environment. It is installed via git, but the git-related issues is being solved from the server side as mentioned. The reported stderr message "ModuleNotFoundError: No module named 'aiohttp.signals'" should be solved now. The small black screen is also related to the environment.


do you have any plans to utilize the Tensor cores present on many newer Nvidia GPUs? these are designed for machine learning tasks.
ID: 57716 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 295,172
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57718 - Posted: 2 Nov 2021, 17:50:17 UTC

Thanks for the feedback - on that basis, I'll keep pushing them through.

Had an odd finish:

FileNotFoundError: [Errno 2] No such file or directory: '/var/lib/boinc-client/slots/6/model.state_dict.3073'
(raylet) /var/lib/boinc-client/slots/6/gpugridpy/lib/python3.8/site-packages/ray/autoscaler/_private/cli_logger.py:57: FutureWarning: Not all Ray CLI dependencies were found. In Ray 1.4+, the Ray CLI, autoscaler, and dashboard will only be usable via `pip install 'ray[default]'`. Please update your install command.
(raylet) warnings.warn(
(raylet) Traceback (most recent call last):
(raylet) File "/var/lib/boinc-client/slots/6/gpugridpy/lib/python3.8/site-packages/ray/new_dashboard/agent.py", line 22, in <module>
(raylet) import ray.new_dashboard.utils as dashboard_utils
(raylet) File "/var/lib/boinc-client/slots/6/gpugridpy/lib/python3.8/site-packages/ray/new_dashboard/utils.py", line 20, in <module>
(raylet) import aiohttp.signals
(raylet) ModuleNotFoundError: No module named 'aiohttp.signals'
INFO:mlagents_envs.environment:Environment shut down with return code 0.
15:21:11 (827067): ./gpugridpy/bin/python exited; CPU time 1598.264794
15:21:11 (827067): app exit status: 0x1
15:21:11 (827067): called boinc_finish(195)

"Environment shut down with return code 0" sounds like a happy ending, but "called boinc_finish(195)" is 'Child failed'.
ID: 57718 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1416
Credit: 9,119,446,190
RAC: 614,515
Level
Tyr
Scientific publications
watwatwatwatwat
Message 57719 - Posted: 3 Nov 2021, 6:23:50 UTC

Tried a LOT of the PythonGPU tasks today. Still no joy for a successful run.

Think they are getting further along though since I think I see progress in how far they get before the environment collapses and errors out.
ID: 57719 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 295,172
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57721 - Posted: 3 Nov 2021, 9:30:00 UTC

The next round of testing has started.

e1a10-ABOU_PPOObstacle6-0-1-RND2533_0 - I was going to say 'is running', but it's crashed already. After only 20 seconds, I got an apparently normal finish, followed by

upload failure: <file_xfer_error>
<file_name>e1a10-ABOU_PPOObstacle6-0-1-RND2533_0_0</file_name>
<error_code>-131 (file size too big)</error_code>
ID: 57721 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 295,172
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57722 - Posted: 3 Nov 2021, 9:36:16 UTC
Last modified: 3 Nov 2021, 10:01:49 UTC

Got another from what looks like the same batch. Limit is

<max_nbytes>100000000.000000</max_nbytes>

I'll catch the output and see how big it is.

Edit - couldn't catch it ('report immediately' operated too fast). But I watched the next one in the slot directory: the output file was created right at the end, but was cleaned up almost immediately. I read it as 169 MB, but can't be certain.
ID: 57722 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
abouh

Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 57723 - Posted: 3 Nov 2021, 10:25:40 UTC - in response to Message 57722.  
Last modified: 3 Nov 2021, 10:28:53 UTC

Yes the file should be 170M approx.
ID: 57723 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
abouh

Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 57724 - Posted: 3 Nov 2021, 10:26:09 UTC - in response to Message 57722.  

Yes the file should be 170M approx.
ID: 57724 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 295,172
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57725 - Posted: 3 Nov 2021, 10:33:04 UTC
Last modified: 3 Nov 2021, 10:45:31 UTC

Well, I got one for you to study:

e1a8-ABOU_PPOObstacle7-0-1-RND2466_3

That was done by manually increasing the maximum allowed size in BOINC. I think that's an internal setting in the BOINC system - specifically, the workunit generator or its template files - rather than the Python package.

I've suspended work fetch for now - please let us know when the next iteration is ready to test.

Edit - this it what the upload file contained:




It seems a bit odd to return the ObstacleTower zip back to you unchanged?
ID: 57725 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
abouh

Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 57726 - Posted: 3 Nov 2021, 10:55:09 UTC - in response to Message 57711.  
Last modified: 3 Nov 2021, 11:38:59 UTC

The git-related errors should be solved now.


ERROR: Cannot find command 'git' - do you have 'git' installed and in your PATH?


We will study the errors related to downloading the Obstacle Tower environment. Thank you for the feedback.
ID: 57726 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Azmodes

Send message
Joined: 7 Jan 17
Posts: 34
Credit: 1,371,429,518
RAC: 0
Level
Met
Scientific publications
watwatwat
Message 57727 - Posted: 3 Nov 2021, 12:19:15 UTC

Got one that ended in 195 (0xc3) EXIT_CHILD_FAILED after 15 minutes:

==> WARNING: A newer version of conda exists. <==
  current version: 4.8.3
  latest version: 4.10.3

Please update conda by running

    $ conda update -n base -c defaults conda


13:14:06 (11501): /usr/bin/flock exited; CPU time 470.306190
13:14:06 (11501): wrapper: running ./gpugridpy/bin/python (run.py)
path: ['/var/lib/boinc-client/slots/34', '/var/lib/boinc-client/slots/34/gpugridpy/lib/python3.8/site-packages/git/ext/gitdb', '/var/lib/boinc-client/slots/34/gpugridpy/lib/python38.zip', '/var/lib/boinc-client/slots/34/gpugridpy/lib/python3.8', '/var/lib/boinc-client/slots/34/gpugridpy/lib/python3.8/lib-dynload', '/var/lib/boinc-client/slots/34/gpugridpy/lib/python3.8/site-packages', '/var/lib/boinc-client/slots/34/gpugridpy/lib/python3.8/site-packages/gitdb/ext/smmap']
git path: /var/lib/boinc-client/slots/34/gpugridpy/lib/python3.8/site-packages/git
Traceback (most recent call last):
  File "run.py", line 340, in <module>
    main()
  File "run.py", line 53, in main
    print("GPU available: {}".format(torch.cuda.is_available()))
NameError: name 'torch' is not defined
13:14:10 (11501): ./gpugridpy/bin/python exited; CPU time 1.602758
13:14:10 (11501): app exit status: 0x1
13:14:10 (11501): called boinc_finish(195)

</stderr_txt>
]]>
ID: 57727 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1416
Credit: 9,119,446,190
RAC: 614,515
Level
Tyr
Scientific publications
watwatwatwatwat
Message 57728 - Posted: 3 Nov 2021, 14:24:08 UTC
Last modified: 3 Nov 2021, 14:24:55 UTC

Got five PythonGPU tasks to finish and report after about ten minutes that were valid.
ID: 57728 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
bozz4science

Send message
Joined: 22 May 20
Posts: 110
Credit: 115,525,136
RAC: 345
Level
Cys
Scientific publications
wat
Message 57732 - Posted: 3 Nov 2021, 17:18:12 UTC

My machine is a dual boot machine (Win10/Ubuntu 20.04). Are there plans for a Windows app for these tasks or should I boot into Linux to get some of these tasks?
ID: 57732 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1416
Credit: 9,119,446,190
RAC: 614,515
Level
Tyr
Scientific publications
watwatwatwatwat
Message 57733 - Posted: 3 Nov 2021, 19:50:45 UTC - in response to Message 57732.  

Haven't heard of any posts by admin types that Windows apps will be made.
That stated, often the new beta apps are tested first on Linux to get the bugs out and then the Windows apps are generated.
ID: 57733 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1416
Credit: 9,119,446,190
RAC: 614,515
Level
Tyr
Scientific publications
watwatwatwatwat
Message 57735 - Posted: 3 Nov 2021, 19:55:47 UTC

This task looks to have run through all of its parameter set to complete normally at around 3000 seconds and was validated for ~ 200K credits.
https://www.gpugrid.net/result.php?resultid=32660133
ID: 57735 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 7 Mar 14
Posts: 18
Credit: 6,575,125,525
RAC: 1,038
Level
Tyr
Scientific publications
watwatwatwatwat
Message 57737 - Posted: 3 Nov 2021, 20:05:30 UTC - in response to Message 57735.  

Did you notice if it used the GPU and if it did what percentage ?

I had one that ran for about 3 hours before failing, never saw the fans running during that time.
ID: 57737 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1114
Credit: 40,838,909,595
RAC: 4,232,576
Level
Trp
Scientific publications
wat
Message 57740 - Posted: 3 Nov 2021, 20:28:14 UTC
Last modified: 3 Nov 2021, 20:41:52 UTC

just ran this one on my RTX 3080Ti: https://www.gpugrid.net/result.php?resultid=32660184

16:19:48 (1841951): wrapper (7.7.26016): starting
16:19:48 (1841951): wrapper (7.7.26016): starting
16:19:48 (1841951): wrapper: running /usr/bin/flock (/home/ian/BOINC/projects/www.gpugrid.net/miniconda.lock -c "/bin/bash ./miniconda-installer.sh -b -u -p /home/ian/BOINC/projects/www.gpugrid.net/miniconda &&
/home/ian/BOINC/projects/www.gpugrid.net/miniconda/bin/conda install -m -y -p gpugridpy --file requirements.txt ")

0%| | 0/35 [00:00<?, ?it/s]
Extracting : tk-8.6.8-hbc83047_0.conda: 0%| | 0/35 [00:00<?, ?it/s]
Extracting : tk-8.6.8-hbc83047_0.conda: 3%|2 | 1/35 [00:00<00:11, 3.04it/s]
Extracting : urllib3-1.25.8-py37_0.conda: 3%|2 | 1/35 [00:00<00:11, 3.04it/s]
Extracting : libedit-3.1.20181209-hc058e9b_0.conda: 6%|5 | 2/35 [00:00<00:10, 3.04it/s]
Extracting : libgcc-ng-9.1.0-hdf63c60_0.conda: 9%|8 | 3/35 [00:00<00:10, 3.04it/s]
Extracting : ld_impl_linux-64-2.33.1-h53a641e_7.conda: 11%|#1 | 4/35 [00:00<00:10, 3.04it/s]
Extracting : python-3.7.7-hcff3b4d_5.conda: 14%|#4 | 5/35 [00:00<00:09, 3.04it/s]
Extracting : python-3.7.7-hcff3b4d_5.conda: 17%|#7 | 6/35 [00:00<00:06, 4.16it/s]
Extracting : tqdm-4.46.0-py_0.conda: 17%|#7 | 6/35 [00:00<00:06, 4.16it/s]
Extracting : ca-certificates-2020.1.1-0.conda: 20%|## | 7/35 [00:00<00:06, 4.16it/s]
Extracting : wheel-0.34.2-py37_0.conda: 23%|##2 | 8/35 [00:00<00:06, 4.16it/s]
Extracting : libstdcxx-ng-9.1.0-hdf63c60_0.conda: 26%|##5 | 9/35 [00:00<00:06, 4.16it/s]
Extracting : certifi-2020.4.5.1-py37_0.conda: 29%|##8 | 10/35 [00:00<00:06, 4.16it/s]
Extracting : readline-8.0-h7b6447c_0.conda: 31%|###1 | 11/35 [00:00<00:05, 4.16it/s]
Extracting : ncurses-6.2-he6710b0_1.conda: 34%|###4 | 12/35 [00:00<00:05, 4.16it/s]
Extracting : conda-package-handling-1.6.1-py37h7b6447c_0.conda: 37%|###7 | 13/35 [00:00<00:05, 4.16it/s]
Extracting : chardet-3.0.4-py37_1003.conda: 40%|#### | 14/35 [00:00<00:05, 4.16it/s]
Extracting : zlib-1.2.11-h7b6447c_3.conda: 43%|####2 | 15/35 [00:00<00:04, 4.16it/s]
Extracting : six-1.14.0-py37_0.conda: 46%|####5 | 16/35 [00:00<00:04, 4.16it/s]
Extracting : pycparser-2.20-py_0.conda: 49%|####8 | 17/35 [00:00<00:04, 4.16it/s]
Extracting : libffi-3.3-he6710b0_1.conda: 51%|#####1 | 18/35 [00:00<00:04, 4.16it/s]
Extracting : pycosat-0.6.3-py37h7b6447c_0.conda: 54%|#####4 | 19/35 [00:00<00:03, 4.16it/s]
Extracting : cffi-1.14.0-py37he30daa8_1.conda: 57%|#####7 | 20/35 [00:00<00:03, 4.16it/s]
Extracting : _libgcc_mutex-0.1-main.conda: 60%|###### | 21/35 [00:00<00:03, 4.16it/s]
Extracting : pyopenssl-19.1.0-py37_0.conda: 63%|######2 | 22/35 [00:00<00:03, 4.16it/s]
Extracting : idna-2.9-py_1.conda: 66%|######5 | 23/35 [00:00<00:02, 4.16it/s]
Extracting : pysocks-1.7.1-py37_0.conda: 69%|######8 | 24/35 [00:00<00:02, 4.16it/s]
Extracting : xz-5.2.5-h7b6447c_0.conda: 71%|#######1 | 25/35 [00:00<00:02, 4.16it/s]
Extracting : setuptools-46.4.0-py37_0.conda: 74%|#######4 | 26/35 [00:00<00:02, 4.16it/s]
Extracting : ruamel_yaml-0.15.87-py37h7b6447c_0.conda: 77%|#######7 | 27/35 [00:00<00:01, 4.16it/s]
Extracting : cryptography-2.9.2-py37h1ba5d50_0.conda: 80%|######## | 28/35 [00:00<00:01, 4.16it/s]
Extracting : openssl-1.1.1g-h7b6447c_0.conda: 83%|########2 | 29/35 [00:00<00:01, 4.16it/s]
Extracting : sqlite-3.31.1-h62c20be_1.conda: 86%|########5 | 30/35 [00:00<00:01, 4.16it/s]
Extracting : pip-20.0.2-py37_3.conda: 89%|########8 | 31/35 [00:00<00:00, 4.16it/s]
Extracting : yaml-0.1.7-had09818_2.conda: 91%|#########1| 32/35 [00:00<00:00, 4.16it/s]
Extracting : requests-2.23.0-py37_0.conda: 94%|#########4| 33/35 [00:00<00:00, 4.16it/s]
Extracting : conda-4.8.3-py37_0.tar.bz2: 97%|#########7| 34/35 [00:00<00:00, 4.16it/s]


==> WARNING: A newer version of conda exists. <==
current version: 4.8.3
latest version: 4.10.3

Please update conda by running

$ conda update -n base -c defaults conda


16:21:21 (1841951): /usr/bin/flock exited; CPU time 61.036800
16:21:21 (1841951): wrapper: running ./gpugridpy/bin/python (run.py)
Running command git clone -q https://github.com/Unity-Technologies/obstacle-tower-env.git /tmp/pip-req-build-wwv7ghqo
/home/ian/BOINC/slots/15/gpugridpy/lib/python3.8/site-packages/ray/autoscaler/_private/cli_logger.py:57: FutureWarning: Not all Ray CLI dependencies were found. In Ray 1.4+, the Ray CLI, autoscaler, and dashboard will only be usable via `pip install 'ray[default]'`. Please update your install command.
warnings.warn(
Downloading...
From: https://storage.googleapis.com/obstacle-tower-build/v4.1/obstacletower_v4.1_linux.zip
To: /home/ian/BOINC/slots/15/obstacletower_v4.1_linux.zip

0%| | 0.00/170M [00:00<?, ?B/s]
1%| | 2.10M/170M [00:00<00:08, 19.9MB/s]
6%|&#226;&#150;&#140; | 10.5M/170M [00:00<00:02, 56.2MB/s]
11%|&#226;&#150;&#136;&#226;&#150;&#143; | 19.4M/170M [00:00<00:02, 70.8MB/s]
16%|&#226;&#150;&#136;&#226;&#150;&#139; | 27.8M/170M [00:00<00:02, 70.6MB/s]
22%|&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#143; | 37.7M/170M [00:00<00:01, 76.7MB/s]
28%|&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#138; | 47.7M/170M [00:00<00:01, 79.0MB/s]
34%|&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#142; | 57.1M/170M [00:00<00:01, 82.8MB/s]
38%|&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#138; | 65.5M/170M [00:00<00:01, 80.4MB/s]
43%|&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#142; | 73.9M/170M [00:00<00:01, 81.2MB/s]
49%|&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#138; | 82.8M/170M [00:01<00:01, 83.4MB/s]
54%|&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#142; | 91.2M/170M [00:01<00:00, 80.8MB/s]
59%|&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#137; | 101M/170M [00:01<00:00, 81.3MB/s]
65%|&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#141; | 110M/170M [00:01<00:00, 83.7MB/s]
70%|&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#137; | 119M/170M [00:01<00:00, 79.0MB/s]
75%|&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#141; | 127M/170M [00:01<00:00, 80.2MB/s]
80%|&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136; | 137M/170M [00:01<00:00, 79.2MB/s]
85%|&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#140; | 145M/170M [00:01<00:00, 80.1MB/s]
90%|&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136; | 154M/170M [00:01<00:00, 79.1MB/s]
96%|&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#140;| 163M/170M [00:02<00:00, 82.6MB/s]
100%|&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;| 170M/170M [00:02<00:00, 78.6MB/s]
16:21:54 (1841951): ./gpugridpy/bin/python exited; CPU time 22.798227
16:21:59 (1841951): called boinc_finish(0)

</stderr_txt>
<message>
upload failure: <file_xfer_error>
<file_name>e1a6-ABOU_PPOObstacle6-0-1-RND7771_2_0</file_name>
<error_code>-131 (file size too big)</error_code>
</file_xfer_error>
</message>


ran for about 2 mins and errored out. file size too big? how big could the file get in 2 minutes? lol. looks like everyone in this WU chain is having the same issue though. https://www.gpugrid.net/workunit.php?wuid=27085637 Bad WU?

and I saw no evidence that it ever touched the GPU, refreshing nvidia-smi every 2 seconds showed no process running on the GPU. must still be using only the CPU.

Can an admin please directly comment if these are actually using the GPU or not? I know an admin mentioned that they were only doing CPU work "as a test". Is that still the case? Having GPU tasks that only use the CPU core is very confusing.
ID: 57740 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1416
Credit: 9,119,446,190
RAC: 614,515
Level
Tyr
Scientific publications
watwatwatwatwat
Message 57741 - Posted: 3 Nov 2021, 20:33:04 UTC

The ones that have partially ran and were validated only used 31% of the gpu in nvidia-smi.

The one task that appears to have successfully run through to normal completion was done while I was out of the house and did not see it run unfortunately.

Will have to wait for more to observe.
ID: 57741 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1416
Credit: 9,119,446,190
RAC: 614,515
Level
Tyr
Scientific publications
watwatwatwatwat
Message 57743 - Posted: 3 Nov 2021, 21:41:16 UTC
Last modified: 3 Nov 2021, 21:44:30 UTC

Looks like the tasks fluctuate between a few seconds at 1% utilization before returning to hovering around 10-13% utilization. I was watching one on a 2070 and it was running for almost 60 minutes in nvidia-smi. They are marked at C+G type in that program.

I think I killed it when I pulled up htop to look at how much cpu it was using because it finished with an error instantly at the same time as htop populated the screen.
ID: 57743 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
abouh

Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 57748 - Posted: 4 Nov 2021, 9:27:01 UTC - in response to Message 57725.  

The contents of the obstacletower.zip downloaded file are necessary to generate the data required for the machine learning agent to train. That is why the file itself is not modified. Only used to generate the training data.

The expected behaviour is for the file to be downloaded, used during the job completion and then deleted. Should not be returned.

Some jobs have already finished successfully. Thank you for the feedback. Current jobs being tested should use around 30% GPU and around 8000MiB GPU memory.


ID: 57748 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 295,172
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57750 - Posted: 4 Nov 2021, 9:31:48 UTC - in response to Message 57748.  

The expected behaviour is for the file to be downloaded, used during the job completion and then deleted. Should not be returned.

That makes much more sense. Standing by for the next round of debugging... :-)
ID: 57750 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 . . . 6 · Next

Message boards : News : Python Runtime (GPU, beta)

©2025 Universitat Pompeu Fabra