Experimental Python tasks (beta)

Author	Message
abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 56977 - Posted: 17 Jun 2021, 10:40:32 UTC Hello everyone, just wanted to give some updates about the machine learning - python jobs that Toni mentioned earlier in the "Experimental Python tasks (beta) " thread. What are we trying to accomplish? We are trying to train populations of intelligent agents in a distributed computational setting to solve reinforcement learning problems. This idea is inspired in the fact that human societies are knowledgeable as a whole, while individual agents have limited information. Also, every new generation of individuals attempts to expand and refine the knowledge inherited from previous ones, and the most interesting discoveries become part of a corpus of common knowledge. The idea is that small groups of agents will train in GPUgrid machines, and report their discoveries and findings. Information of multiple agents can be put in common and conveyed to new generations of machine learning agents. To the best of our knowledge this is the first time something of this sort is attempted in a GPUGrid-like platform, and has the potential to scale to solve problems unattainable in smaller scale settings. Why most jobs were failing a few weeks ago? It took us some time and testing to make simple agents work, but we managed to solve the problems in the previous weeks. Now, almost all agents train successfully. Why are GPUs being underutilized? and why are CPU used for? In the previous weeks we were running small scale tests, with small neural networks models that occupied little GPU memory. Also, some reinforcement learning environments, especially simple ones like those used in the test, run on CPU. Our idea is to scale to more complex models and environments to exploit the GPU capacity of the grid. More information: We use mainly PyTorch to train our neural networks. We only use Tensorboard because it is convenient for logging. We might remove that dependency in the future. ID: 56977 · Rating: 0 · rate: / Reply Quote

bozz4science Send message Joined: 22 May 20 Posts: 110 Credit: 115,525,136 RAC: 0 Level Scientific publications	Message 56978 - Posted: 17 Jun 2021, 11:46:18 UTC Last modified: 17 Jun 2021, 12:08:24 UTC Highly anticipated and overdue. Needless to say, kudos to you and your team for pushing the frontier on the computational abilities of the client software. Looking forward to contribute in the future, hopefully with more than I have at hand right now. A couple of questions though: 1. As the main ML technique used for training the individual agents is neural networks, I wonder about the specifics of the whole setup? What does the learning data set look like? What AF do you use? Any optimisation, regularisation used? 2. Is it mainly about getting this kind of framework to work and then test for its accuracy? How did you determine the model's base parameters as is to get you started? How can you be sure that the initial model setup is getting you anywhere/is optimal? Or do you ultimately want to tune the final model and compare the accuracy of various reinforced learning approaches? 3. Is there a way to gauge the future complexity of those prospective WUs at this stage? Similar runtimes as the current Bandit tasks? 4. What do you want to use the trained networks for? What are you trying to predict? Or rephrased what main use cases/fields of research are currently imagined for the final model? What do you envision to be "problems [so far] unattainable in smaller scale settings" ? 5. What is the ultimate goal of this ML-project? Have only one latest gen trained agents group at the end that is the result of the continuous reinforeced learning iterations? Have several and test/benchmark them against each other? Thx! Keep up the great work! ID: 56978 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 9,834 Level Scientific publications	Message 56979 - Posted: 17 Jun 2021, 13:26:58 UTC - in response to Message 56977. will you be utilizing the tensor cores present in the nvidia RTX cards? the tensor cores are designed for this kind of workload. ID: 56979 · Rating: 0 · rate: / Reply Quote

phi1258 Send message Joined: 30 Jul 16 Posts: 4 Credit: 1,555,158,536 RAC: 0 Level Scientific publications	Message 56989 - Posted: 18 Jun 2021, 11:21:31 UTC - in response to Message 56977. This is a welcome advance. Looking forward to contributing. ID: 56989 · Rating: 0 · rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 595 Credit: 12,249,686,510 RAC: 1,390,367 Level Scientific publications	Message 56990 - Posted: 18 Jun 2021, 12:04:08 UTC - in response to Message 56977. Thank you very much for this advance. I understand that on this kind of "singular" research only a limited general guidelines can be given, or there is a risk for them not being singular any more... Best wishes. ID: 56990 · Rating: 0 · rate: / Reply Quote

_heinz Send message Joined: 20 Sep 13 Posts: 16 Credit: 3,433,447 RAC: 0 Level Scientific publications	Message 56994 - Posted: 20 Jun 2021, 5:39:42 UTC Last modified: 20 Jun 2021, 5:43:47 UTC Wish you sucess. regards _heinz ID: 56994 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1168 Credit: 12,317,898,501 RAC: 91,654 Level Scientific publications	Message 56996 - Posted: 21 Jun 2021, 11:28:16 UTC - in response to Message 56979. Ian&Steve C. wrote on June 17th: will you be utilizing the tensor cores present in the nvidia RTX cards? the tensor cores are designed for this kind of workload. I am courious what the answer will be ID: 56996 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 9,834 Level Scientific publications	Message 57000 - Posted: 22 Jun 2021, 12:17:47 UTC also, can the team comment on not just GPU "under"utilization. these have NO GPU utilization. when will you start releasing tasks that do more than just CPU calculation? are you aware that only CPU calculation is occurring and nothing happens on the GPU at all? I have never observed these new tasks to use the GPU, ever. even the tasks that takes ~1hr to crunch. it all happens on the single CPU thread allocated for the WU. 0% GPU utilization and no gpugrid processes reported in nvidia-smi ID: 57000 · Rating: 0 · rate: / Reply Quote

ExtraTerrestrial Apes Volunteer moderator Volunteer tester Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level Scientific publications	Message 57009 - Posted: 23 Jun 2021, 20:09:29 UTC I understand this is basic research in ML. However, I wonder which problems it would be used for here. Personally I'm here for the bio-science. If the topic of the new ML research differs significantly and it seems to be successful based on first trials, I'd suggest to set it up as a seperate project. MrS Scanning for our furry friends since Jan 2002 ID: 57009 · Rating: 0 · rate: / Reply Quote

bozz4science Send message Joined: 22 May 20 Posts: 110 Credit: 115,525,136 RAC: 0 Level Scientific publications	Message 57014 - Posted: 24 Jun 2021, 10:32:37 UTC This is why I asked what "problems" are currently envisioned to be tackled by the resulting model. But IMO and understanding this is a ML project specifically set up to be trained on biomedical data sets. Thus, I'd argue that the science being done is still bio-related nonetheless. Would highly appreciate a feedback to loads of great questions here in this thread so far. ID: 57014 · Rating: 0 · rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 57020 - Posted: 26 Jun 2021, 7:53:10 UTC https://www.youtube.com/watch?v=yhJWAdZl-Ck ID: 57020 · Rating: 0 · rate: / Reply Quote

mmonnin Send message Joined: 2 Jul 16 Posts: 339 Credit: 7,990,341,558 RAC: 103 Level Scientific publications	Message 58044 - Posted: 10 Dec 2021, 11:32:51 UTC I noticed some python tasks in my task history. All failed for me and failed so far for everyone else. Has anyone completed any? Examnple: https://www.gpugrid.net/workunit.php?wuid=27100605 ID: 58044 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 58045 - Posted: 10 Dec 2021, 11:56:26 UTC - in response to Message 58044. Host 132158 is getting some. The first failed with: File "/tmp/pip-install-kvyy94ud/atari-py_bc0e384c30f043aca5cad42554937b02/setup.py", line 28, in run sys.stderr.write("Unable to execute '{}'. HINT: are you sure `make` is installed?\n".format(' '.join(cmd))) NameError: name 'cmd' is not defined ---------------------------------------- ERROR: Failed building wheel for atari-py ERROR: Command errored out with exit status 1: command: /var/lib/boinc-client/slots/0/gpugridpy/bin/python -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-kvyy94ud/atari-py_bc0e384c30f043aca5cad42554937b02/setup.py'"'"'; __file__='"'"'/tmp/pip-install-kvyy94ud/atari-py_bc0e384c30f043aca5cad42554937b02/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-k6sefcno/install-record.txt --single-version-externally-managed --compile --install-headers /var/lib/boinc-client/slots/0/gpugridpy/include/python3.8/atari-py cwd: /tmp/pip-install-kvyy94ud/atari-py_bc0e384c30f043aca5cad42554937b02/ Looks like a typo. ID: 58045 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 42,316 Level Scientific publications	Message 58058 - Posted: 11 Dec 2021, 0:23:09 UTC Shame the tasks are misconfigured. I ran through a dozen of them on a host with errors. With the scarcity of work, every little bit is appreciated and can be used. We just got put back in good graces with a whitelist at Gridcoin too. ID: 58058 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 42,316 Level Scientific publications	Message 58061 - Posted: 11 Dec 2021, 2:16:29 UTC @abouh, could you check your configuration again? The tasks are failing during the build process with cmake. cmake normally isn't installed in Linux and when it is it is not normally installed into the PATH environment. It probably needs to be exported into the userland environment. ID: 58061 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 58104 - Posted: 14 Dec 2021, 16:55:30 UTC - in response to Message 58045. Hello everyone, sorry for the late reply. we detected the "cmake" error and found a way around it that does not require to install anything. Some jobs already finished successfully last Friday without reporting this error. The error was related to the atari_py, as some users reported. More specifically installing this python package from github https://github.com/openai/atari-py, which allows to use some Atari2600 games as a test bench for reinforcement learning (RL) agents. Sorry for the inconveniences. Even while the AI agents part of the code has been tested and works, every time we need to test our agents in a new environment we need te modify environment initialisation part of the code with the one containing the new environment, in this case atari_py. I just sent another batch of 5 test jobs, 3 already finished the others seem to be working without problems but have not yet finished. http://gpugrid.net/PS3GRID_ops/db_action.php?table=result&id=32730763 http://gpugrid.net/PS3GRID_ops/db_action.php?table=result&id=32730759 http://gpugrid.net/PS3GRID_ops/db_action.php?table=result&id=32730761 http://gpugrid.net/PS3GRID_ops/db_action.php?table=result&id=32730760 http://gpugrid.net/PS3GRID_ops/db_action.php?table=result&id=32730762 ID: 58104 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 58112 - Posted: 15 Dec 2021, 15:31:49 UTC - in response to Message 58104. Multiple different failure modes among the four hosts that have failed (so far) to run workunit 27102466. ID: 58112 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 58114 - Posted: 15 Dec 2021, 16:12:09 UTC - in response to Message 58112. The error reported in the job with result ID 32730901 is due to a conda environment error detected and solved during previous testing bouts. It is the one that talk about a dependency called "pinocchio" and detects conflicts with it. Seems like the conda misconfiguration persisted in some machines. To solve this error should be enough to click "reset" to reset the App. ID: 58114 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 58115 - Posted: 15 Dec 2021, 16:56:36 UTC - in response to Message 58114. OK, I've reset both my Linux hosts. Fortunately I'm on a fast line for the replacement download... ID: 58115 · Rating: 0 · rate: / Reply Quote

Experimental Python tasks (beta) - task description