Message boards :
News :
Experimental Python tasks (beta) - task description
Message board moderation
Author | Message |
---|---|
Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level ![]() Scientific publications ![]() |
Hello everyone, just wanted to give some updates about the machine learning - python jobs that Toni mentioned earlier in the "Experimental Python tasks (beta) " thread. What are we trying to accomplish? We are trying to train populations of intelligent agents in a distributed computational setting to solve reinforcement learning problems. This idea is inspired in the fact that human societies are knowledgeable as a whole, while individual agents have limited information. Also, every new generation of individuals attempts to expand and refine the knowledge inherited from previous ones, and the most interesting discoveries become part of a corpus of common knowledge. The idea is that small groups of agents will train in GPUgrid machines, and report their discoveries and findings. Information of multiple agents can be put in common and conveyed to new generations of machine learning agents. To the best of our knowledge this is the first time something of this sort is attempted in a GPUGrid-like platform, and has the potential to scale to solve problems unattainable in smaller scale settings. Why most jobs were failing a few weeks ago? It took us some time and testing to make simple agents work, but we managed to solve the problems in the previous weeks. Now, almost all agents train successfully. Why are GPUs being underutilized? and why are CPU used for? In the previous weeks we were running small scale tests, with small neural networks models that occupied little GPU memory. Also, some reinforcement learning environments, especially simple ones like those used in the test, run on CPU. Our idea is to scale to more complex models and environments to exploit the GPU capacity of the grid. More information: We use mainly PyTorch to train our neural networks. We only use Tensorboard because it is convenient for logging. We might remove that dependency in the future. |
Send message Joined: 22 May 20 Posts: 110 Credit: 115,525,136 RAC: 381 Level ![]() Scientific publications ![]() |
Highly anticipated and overdue. Needless to say, kudos to you and your team for pushing the frontier on the computational abilities of the client software. Looking forward to contribute in the future, hopefully with more than I have at hand right now. A couple of questions though: 1. As the main ML technique used for training the individual agents is neural networks, I wonder about the specifics of the whole setup? What does the learning data set look like? What AF do you use? Any optimisation, regularisation used? 2. Is it mainly about getting this kind of framework to work and then test for its accuracy? How did you determine the model's base parameters as is to get you started? How can you be sure that the initial model setup is getting you anywhere/is optimal? Or do you ultimately want to tune the final model and compare the accuracy of various reinforced learning approaches? 3. Is there a way to gauge the future complexity of those prospective WUs at this stage? Similar runtimes as the current Bandit tasks? 4. What do you want to use the trained networks for? What are you trying to predict? Or rephrased what main use cases/fields of research are currently imagined for the final model? What do you envision to be "problems [so far] unattainable in smaller scale settings"? 5. What is the ultimate goal of this ML-project? Have only one latest gen trained agents group at the end that is the result of the continuous reinforeced learning iterations? Have several and test/benchmark them against each other? Thx! Keep up the great work! |
Send message Joined: 21 Feb 20 Posts: 1114 Credit: 40,838,348,595 RAC: 4,765,598 Level ![]() Scientific publications ![]() |
will you be utilizing the tensor cores present in the nvidia RTX cards? the tensor cores are designed for this kind of workload. ![]() |
![]() Send message Joined: 30 Jul 16 Posts: 4 Credit: 1,555,158,536 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
This is a welcome advance. Looking forward to contributing. ![]() ![]() |
![]() ![]() Send message Joined: 24 Sep 10 Posts: 592 Credit: 11,972,186,510 RAC: 1,102,898 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Thank you very much for this advance. I understand that on this kind of "singular" research only a limited general guidelines can be given, or there is a risk for them not being singular any more... Best wishes. |
Send message Joined: 20 Sep 13 Posts: 16 Credit: 3,433,447 RAC: 0 Level ![]() Scientific publications ![]() |
Wish you sucess. regards _heinz |
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 960 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Ian&Steve C. wrote on June 17th: will you be utilizing the tensor cores present in the nvidia RTX cards? the tensor cores are designed for this kind of workload. I am courious what the answer will be |
Send message Joined: 21 Feb 20 Posts: 1114 Credit: 40,838,348,595 RAC: 4,765,598 Level ![]() Scientific publications ![]() |
also, can the team comment on not just GPU "under"utilization. these have NO GPU utilization. when will you start releasing tasks that do more than just CPU calculation? are you aware that only CPU calculation is occurring and nothing happens on the GPU at all? I have never observed these new tasks to use the GPU, ever. even the tasks that takes ~1hr to crunch. it all happens on the single CPU thread allocated for the WU. 0% GPU utilization and no gpugrid processes reported in nvidia-smi ![]() |
![]() Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I understand this is basic research in ML. However, I wonder which problems it would be used for here. Personally I'm here for the bio-science. If the topic of the new ML research differs significantly and it seems to be successful based on first trials, I'd suggest to set it up as a seperate project. MrS Scanning for our furry friends since Jan 2002 |
Send message Joined: 22 May 20 Posts: 110 Credit: 115,525,136 RAC: 381 Level ![]() Scientific publications ![]() |
This is why I asked what "problems" are currently envisioned to be tackled by the resulting model. But IMO and understanding this is a ML project specifically set up to be trained on biomedical data sets. Thus, I'd argue that the science being done is still bio-related nonetheless. Would highly appreciate a feedback to loads of great questions here in this thread so far. |
![]() ![]() Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
|
Send message Joined: 2 Jul 16 Posts: 338 Credit: 7,987,341,558 RAC: 197,587 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() |
I noticed some python tasks in my task history. All failed for me and failed so far for everyone else. Has anyone completed any? Examnple: https://www.gpugrid.net/workunit.php?wuid=27100605 |
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 326,008 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Host 132158 is getting some. The first failed with: File "/tmp/pip-install-kvyy94ud/atari-py_bc0e384c30f043aca5cad42554937b02/setup.py", line 28, in run sys.stderr.write("Unable to execute '{}'. HINT: are you sure `make` is installed?\n".format(' '.join(cmd))) NameError: name 'cmd' is not defined ---------------------------------------- ERROR: Failed building wheel for atari-py ERROR: Command errored out with exit status 1: command: /var/lib/boinc-client/slots/0/gpugridpy/bin/python -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-kvyy94ud/atari-py_bc0e384c30f043aca5cad42554937b02/setup.py'"'"'; __file__='"'"'/tmp/pip-install-kvyy94ud/atari-py_bc0e384c30f043aca5cad42554937b02/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-k6sefcno/install-record.txt --single-version-externally-managed --compile --install-headers /var/lib/boinc-client/slots/0/gpugridpy/include/python3.8/atari-py cwd: /tmp/pip-install-kvyy94ud/atari-py_bc0e384c30f043aca5cad42554937b02/ Looks like a typo. |
![]() Send message Joined: 13 Dec 17 Posts: 1416 Credit: 9,119,446,190 RAC: 678,713 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() |
Shame the tasks are misconfigured. I ran through a dozen of them on a host with errors. With the scarcity of work, every little bit is appreciated and can be used. We just got put back in good graces with a whitelist at Gridcoin too. |
![]() Send message Joined: 13 Dec 17 Posts: 1416 Credit: 9,119,446,190 RAC: 678,713 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() |
@abouh, could you check your configuration again? The tasks are failing during the build process with cmake. cmake normally isn't installed in Linux and when it is it is not normally installed into the PATH environment. It probably needs to be exported into the userland environment. |
Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level ![]() Scientific publications ![]() |
Hello everyone, sorry for the late reply. we detected the "cmake" error and found a way around it that does not require to install anything. Some jobs already finished successfully last Friday without reporting this error. The error was related to the atari_py, as some users reported. More specifically installing this python package from github https://github.com/openai/atari-py, which allows to use some Atari2600 games as a test bench for reinforcement learning (RL) agents. Sorry for the inconveniences. Even while the AI agents part of the code has been tested and works, every time we need to test our agents in a new environment we need te modify environment initialisation part of the code with the one containing the new environment, in this case atari_py. I just sent another batch of 5 test jobs, 3 already finished the others seem to be working without problems but have not yet finished. http://gpugrid.net/PS3GRID_ops/db_action.php?table=result&id=32730763 http://gpugrid.net/PS3GRID_ops/db_action.php?table=result&id=32730759 http://gpugrid.net/PS3GRID_ops/db_action.php?table=result&id=32730761 http://gpugrid.net/PS3GRID_ops/db_action.php?table=result&id=32730760 http://gpugrid.net/PS3GRID_ops/db_action.php?table=result&id=32730762 |
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 326,008 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Multiple different failure modes among the four hosts that have failed (so far) to run workunit 27102466. |
Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level ![]() Scientific publications ![]() |
The error reported in the job with result ID 32730901 is due to a conda environment error detected and solved during previous testing bouts. It is the one that talk about a dependency called "pinocchio" and detects conflicts with it. Seems like the conda misconfiguration persisted in some machines. To solve this error should be enough to click "reset" to reset the App. |
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 326,008 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
OK, I've reset both my Linux hosts. Fortunately I'm on a fast line for the replacement download... |
©2025 Universitat Pompeu Fabra