Experimental Python tasks (beta) - task description

Message boards : News : Experimental Python tasks (beta) - task description
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 50 · Next

AuthorMessage
abouh

Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 56977 - Posted: 17 Jun 2021, 10:40:32 UTC

Hello everyone, just wanted to give some updates about the machine learning - python jobs that Toni mentioned earlier in the "Experimental Python tasks (beta) " thread.

What are we trying to accomplish?
We are trying to train populations of intelligent agents in a distributed computational setting to solve reinforcement learning problems. This idea is inspired in the fact that human societies are knowledgeable as a whole, while individual agents have limited information. Also, every new generation of individuals attempts to expand and refine the knowledge inherited from previous ones, and the most interesting discoveries become part of a corpus of common knowledge. The idea is that small groups of agents will train in GPUgrid machines, and report their discoveries and findings. Information of multiple agents can be put in common and conveyed to new generations of machine learning agents. To the best of our knowledge this is the first time something of this sort is attempted in a GPUGrid-like platform, and has the potential to scale to solve problems unattainable in smaller scale settings.

Why most jobs were failing a few weeks ago?
It took us some time and testing to make simple agents work, but we managed to solve the problems in the previous weeks. Now, almost all agents train successfully.

Why are GPUs being underutilized? and why are CPU used for?
In the previous weeks we were running small scale tests, with small neural networks models that occupied little GPU memory. Also, some reinforcement learning environments, especially simple ones like those used in the test, run on CPU. Our idea is to scale to more complex models and environments to exploit the GPU capacity of the grid.

More information:
We use mainly PyTorch to train our neural networks. We only use Tensorboard because it is convenient for logging. We might remove that dependency in the future.
ID: 56977 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
bozz4science

Send message
Joined: 22 May 20
Posts: 110
Credit: 115,525,136
RAC: 381
Level
Cys
Scientific publications
wat
Message 56978 - Posted: 17 Jun 2021, 11:46:18 UTC
Last modified: 17 Jun 2021, 12:08:24 UTC

Highly anticipated and overdue. Needless to say, kudos to you and your team for pushing the frontier on the computational abilities of the client software. Looking forward to contribute in the future, hopefully with more than I have at hand right now.

A couple of questions though:

1. As the main ML technique used for training the individual agents is neural networks, I wonder about the specifics of the whole setup? What does the learning data set look like? What AF do you use? Any optimisation, regularisation used?
2. Is it mainly about getting this kind of framework to work and then test for its accuracy? How did you determine the model's base parameters as is to get you started? How can you be sure that the initial model setup is getting you anywhere/is optimal? Or do you ultimately want to tune the final model and compare the accuracy of various reinforced learning approaches?
3. Is there a way to gauge the future complexity of those prospective WUs at this stage? Similar runtimes as the current Bandit tasks?
4. What do you want to use the trained networks for? What are you trying to predict? Or rephrased what main use cases/fields of research are currently imagined for the final model?
What do you envision to be
"problems [so far] unattainable in smaller scale settings"
?
5. What is the ultimate goal of this ML-project? Have only one latest gen trained agents group at the end that is the result of the continuous reinforeced learning iterations? Have several and test/benchmark them against each other?

Thx! Keep up the great work!
ID: 56978 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1114
Credit: 40,838,348,595
RAC: 4,765,598
Level
Trp
Scientific publications
wat
Message 56979 - Posted: 17 Jun 2021, 13:26:58 UTC - in response to Message 56977.  

will you be utilizing the tensor cores present in the nvidia RTX cards? the tensor cores are designed for this kind of workload.
ID: 56979 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile phi1258

Send message
Joined: 30 Jul 16
Posts: 4
Credit: 1,555,158,536
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwat
Message 56989 - Posted: 18 Jun 2021, 11:21:31 UTC - in response to Message 56977.  

This is a welcome advance. Looking forward to contributing.



ID: 56989 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ServicEnginIC
Avatar

Send message
Joined: 24 Sep 10
Posts: 592
Credit: 11,972,186,510
RAC: 1,102,898
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56990 - Posted: 18 Jun 2021, 12:04:08 UTC - in response to Message 56977.  

Thank you very much for this advance.
I understand that on this kind of "singular" research only a limited general guidelines can be given, or there is a risk for them not being singular any more...
Best wishes.
ID: 56990 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
_heinz

Send message
Joined: 20 Sep 13
Posts: 16
Credit: 3,433,447
RAC: 0
Level
Ala
Scientific publications
wat
Message 56994 - Posted: 20 Jun 2021, 5:39:42 UTC
Last modified: 20 Jun 2021, 5:43:47 UTC

Wish you sucess.
regards _heinz
ID: 56994 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 960
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 56996 - Posted: 21 Jun 2021, 11:28:16 UTC - in response to Message 56979.  

Ian&Steve C. wrote on June 17th:

will you be utilizing the tensor cores present in the nvidia RTX cards? the tensor cores are designed for this kind of workload.

I am courious what the answer will be
ID: 56996 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1114
Credit: 40,838,348,595
RAC: 4,765,598
Level
Trp
Scientific publications
wat
Message 57000 - Posted: 22 Jun 2021, 12:17:47 UTC

also, can the team comment on not just GPU "under"utilization. these have NO GPU utilization.

when will you start releasing tasks that do more than just CPU calculation? are you aware that only CPU calculation is occurring and nothing happens on the GPU at all? I have never observed these new tasks to use the GPU, ever. even the tasks that takes ~1hr to crunch. it all happens on the single CPU thread allocated for the WU. 0% GPU utilization and no gpugrid processes reported in nvidia-smi
ID: 57000 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57009 - Posted: 23 Jun 2021, 20:09:29 UTC

I understand this is basic research in ML. However, I wonder which problems it would be used for here. Personally I'm here for the bio-science. If the topic of the new ML research differs significantly and it seems to be successful based on first trials, I'd suggest to set it up as a seperate project.

MrS
Scanning for our furry friends since Jan 2002
ID: 57009 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
bozz4science

Send message
Joined: 22 May 20
Posts: 110
Credit: 115,525,136
RAC: 381
Level
Cys
Scientific publications
wat
Message 57014 - Posted: 24 Jun 2021, 10:32:37 UTC

This is why I asked what "problems" are currently envisioned to be tackled by the resulting model. But IMO and understanding this is a ML project specifically set up to be trained on biomedical data sets. Thus, I'd argue that the science being done is still bio-related nonetheless. Would highly appreciate a feedback to loads of great questions here in this thread so far.
ID: 57014 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57020 - Posted: 26 Jun 2021, 7:53:10 UTC

ID: 57020 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
mmonnin

Send message
Joined: 2 Jul 16
Posts: 338
Credit: 7,987,341,558
RAC: 197,587
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58044 - Posted: 10 Dec 2021, 11:32:51 UTC

I noticed some python tasks in my task history. All failed for me and failed so far for everyone else. Has anyone completed any?

Examnple:
https://www.gpugrid.net/workunit.php?wuid=27100605
ID: 58044 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 326,008
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58045 - Posted: 10 Dec 2021, 11:56:26 UTC - in response to Message 58044.  

Host 132158 is getting some. The first failed with:

File "/tmp/pip-install-kvyy94ud/atari-py_bc0e384c30f043aca5cad42554937b02/setup.py", line 28, in run
      sys.stderr.write("Unable to execute '{}'. HINT: are you sure `make` is installed?\n".format(' '.join(cmd)))
  NameError: name 'cmd' is not defined
  ----------------------------------------
  ERROR: Failed building wheel for atari-py
    ERROR: Command errored out with exit status 1:
     command: /var/lib/boinc-client/slots/0/gpugridpy/bin/python -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-kvyy94ud/atari-py_bc0e384c30f043aca5cad42554937b02/setup.py'"'"'; __file__='"'"'/tmp/pip-install-kvyy94ud/atari-py_bc0e384c30f043aca5cad42554937b02/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-k6sefcno/install-record.txt --single-version-externally-managed --compile --install-headers /var/lib/boinc-client/slots/0/gpugridpy/include/python3.8/atari-py
         cwd: /tmp/pip-install-kvyy94ud/atari-py_bc0e384c30f043aca5cad42554937b02/

Looks like a typo.
ID: 58045 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1416
Credit: 9,119,446,190
RAC: 678,713
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58058 - Posted: 11 Dec 2021, 0:23:09 UTC

Shame the tasks are misconfigured. I ran through a dozen of them on a host with errors. With the scarcity of work, every little bit is appreciated and can be used.

We just got put back in good graces with a whitelist at Gridcoin too.
ID: 58058 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1416
Credit: 9,119,446,190
RAC: 678,713
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58061 - Posted: 11 Dec 2021, 2:16:29 UTC

@abouh, could you check your configuration again? The tasks are failing during the build process with cmake. cmake normally isn't installed in Linux and when it is it is not normally installed into the PATH environment.
It probably needs to be exported into the userland environment.
ID: 58061 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
abouh

Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58104 - Posted: 14 Dec 2021, 16:55:30 UTC - in response to Message 58045.  

Hello everyone, sorry for the late reply.

we detected the "cmake" error and found a way around it that does not require to install anything. Some jobs already finished successfully last Friday without reporting this error.

The error was related to the atari_py, as some users reported. More specifically installing this python package from github https://github.com/openai/atari-py, which allows to use some Atari2600 games as a test bench for reinforcement learning (RL) agents.

Sorry for the inconveniences. Even while the AI agents part of the code has been tested and works, every time we need to test our agents in a new environment we need te modify environment initialisation part of the code with the one containing the new environment, in this case atari_py.

I just sent another batch of 5 test jobs, 3 already finished the others seem to be working without problems but have not yet finished.

http://gpugrid.net/PS3GRID_ops/db_action.php?table=result&id=32730763
http://gpugrid.net/PS3GRID_ops/db_action.php?table=result&id=32730759
http://gpugrid.net/PS3GRID_ops/db_action.php?table=result&id=32730761

http://gpugrid.net/PS3GRID_ops/db_action.php?table=result&id=32730760
http://gpugrid.net/PS3GRID_ops/db_action.php?table=result&id=32730762


ID: 58104 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 326,008
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58112 - Posted: 15 Dec 2021, 15:31:49 UTC - in response to Message 58104.  

Multiple different failure modes among the four hosts that have failed (so far) to run workunit 27102466.
ID: 58112 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
abouh

Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58114 - Posted: 15 Dec 2021, 16:12:09 UTC - in response to Message 58112.  

The error reported in the job with result ID 32730901 is due to a conda environment error detected and solved during previous testing bouts.

It is the one that talk about a dependency called "pinocchio" and detects conflicts with it.

Seems like the conda misconfiguration persisted in some machines. To solve this error should be enough to click "reset" to reset the App.



ID: 58114 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 326,008
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58115 - Posted: 15 Dec 2021, 16:56:36 UTC - in response to Message 58114.  

OK, I've reset both my Linux hosts. Fortunately I'm on a fast line for the replacement download...
ID: 58115 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · 3 · 4 . . . 50 · Next

Message boards : News : Experimental Python tasks (beta) - task description

©2025 Universitat Pompeu Fabra