Experimental Python tasks (beta)

Author	Message
abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 58978 - Posted: 4 Jul 2022, 16:51:10 UTC - in response to Message 58975. Last modified: 5 Jul 2022, 10:47:37 UTC The credit system gives 50.000 credits per task. However, completion before a certain amount of time multiplies this value by 1.5, then by 1.25 for a while and finally by 1.0 indefinitely. That explains why sometimes you see 75.000 and sometimes 62.500 credits. ID: 58978 · Rating: 0 · rate: / Reply Quote

Toby Broom Send message Joined: 11 Dec 08 Posts: 26 Credit: 668,444,294 RAC: 31,174 Level Scientific publications	Message 58979 - Posted: 6 Jul 2022, 22:59:17 UTC I had a idea after reading some of the post about utilisation of resources. For the power user here we tend to have high end hardware on the project so would it be possible to support our hardware fully e.g I imagine that’s if you have 10-24 GB of VRAM the whole simulation could be loaded in to VRAM giving additional performance to the project? Additionally the more modern cards have more ML focused hardware accelerated features so are they well utilised? ID: 58979 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 58980 - Posted: 7 Jul 2022, 11:10:44 UTC - in response to Message 58979. Last modified: 7 Jul 2022, 11:11:36 UTC The reason Reinforcement Learning agents do not currently use the whole potential of the cards is because the interactions between the AI agent and the simulated environment are performed on CPU while the agent "learning" process is the one that uses the GPU intermittently. There are, however, environments that only use GPU. They are becoming more and more common, so I see it as a real possibility that in the future most popular benchmarks of the field use only GPU. Then the jobs will be much more efficient since pretty much only GPU will be used. Unfortunately we are not there yet... I am not sure if I am answering your question, please let me know if I am not. ID: 58980 · Rating: 0 · rate: / Reply Quote

Toby Broom Send message Joined: 11 Dec 08 Posts: 26 Credit: 668,444,294 RAC: 31,174 Level Scientific publications	Message 58981 - Posted: 7 Jul 2022, 19:40:48 UTC - in response to Message 58980. Thanks for the comments, what about using large quantity of VRAM if available, the latest BOINC finally allows for correct reporting VRAM on NVidia cards so you can tailor the WUs based on VRAM to protect the contributions from users with lower specification computers. ID: 58981 · Rating: 0 · rate: / Reply Quote

FritzB Send message Joined: 7 Apr 15 Posts: 17 Credit: 2,999,057,945 RAC: 8,281 Level Scientific publications	Message 58995 - Posted: 10 Jul 2022, 8:22:33 UTC Sorry for OT, but some people need admin help and I've seen one beeing active here :) Password reset doesn't work and there seems to be an alternative method some years ago. Maybe this can be done again? Please have a look in this thread: http://www.gpugrid.net/forum_thread.php?id=2587&nowrap=true#58958 Thanks! Fritz ID: 58995 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 59002 - Posted: 12 Jul 2022, 7:38:42 UTC - in response to Message 58995. Hi Fritz! Apparently the problem is that sending emails from server no longer works. I will mention the problem to the server admin. ID: 59002 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 59003 - Posted: 15 Jul 2022, 9:26:40 UTC - in response to Message 58995. I talked to the server admin and he explained to me the problem in more detail. The issue comes from the fact that the GPUGrid server uses a public IP from the Universitat Pompeu Fabra, so we have to comply with the data protection and security policies of the university. Among other things this implies that we can not send emails from our web server. Therefore, unfortunately that prevents us from fixing the password recovery problem. ID: 59003 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 59004 - Posted: 15 Jul 2022, 9:46:45 UTC - in response to Message 58981. Hello Toby, For the python app, do you mean executing a script that automatically detects how much memory has the GPU to which the the task has been assigned, and then flexibly define an agent that uses it all (or most of it)? In other words, flexibly adapt to the host machine capacity. The experiments we are running at the moment require training AI agents in a sequence of jobs (i.e. starting to training an agent in a GPUGrid job, then sending it back to the server to evaluate its capabilities, then send another job that loads the same agent and continues its training, evaluate again, etc) Consequently, current jobs are designed to work with a fixed amount of GPU memory, and we can not set it too high since we want a high percentage of hosts the be able to run them. However, it is true that by doing that we are sacrificing resources in GPUs with larger amounts of memory. You gave me something to think about, there could be situations is which could make sense to use this approach and indeed would be a more efficient use of resources. ID: 59004 · Rating: 0 · rate: / Reply Quote

Toby Broom Send message Joined: 11 Dec 08 Posts: 26 Credit: 668,444,294 RAC: 31,174 Level Scientific publications	Message 59006 - Posted: 15 Jul 2022, 16:46:27 UTC - in response to Message 59004. BOINC can detect the quantity of GPU memory, it was bugged in the older BOINC version for nVidia cards but in 7.20 its fixed so there would be no need to detect in Python as its already in the project database. A variable job size, yes. Its more work for you but I can imagine there could be performance boost? Too keep it simple you could have S,M,L with say <4, 4-8, >8? the GPUs with more than 8 could be larger in general as only the top tier GPU's have this much VRAM. It seems BOINC knows how to allocate to suitable computers. Worst case you could make it opt in. ID: 59006 · Rating: 0 · rate: / Reply Quote

JohnMD Send message Joined: 4 Dec 10 Posts: 5 Credit: 26,860,106 RAC: 0 Level Scientific publications	Message 59007 - Posted: 15 Jul 2022, 20:20:42 UTC - in response to Message 58981. Even video cards with 6GiB crash with insufficient VRAM. The app is apparently not aware of available resources. This ought to be the first priority before sending tasks to the world. ID: 59007 · Rating: 0 · rate: / Reply Quote

jjch Send message Joined: 10 Nov 13 Posts: 101 Credit: 15,776,211,122 RAC: 90 Level Scientific publications	Message 59008 - Posted: 15 Jul 2022, 20:47:00 UTC - in response to Message 59007. From what we are finding right now the 6GB GPUs would have sufficient VRAM to run the current Python tasks. Refer to this thread noting between 2.5 and 3.2 GB being used:https://www.gpugrid.net/forum_thread.php?id=5327 If jobs running on GPUs with 4GB or more are crashing, then there is a different problem. Have to look at the logs to see what's going on. It's more likely they are running out of system memory or swap space but there are a few that are failing from an unknown cause. I took a quick look at the jobs you have which errored and I found the mx150 and mx350 GPUs only have 2GB VRAM. These are not sufficient to run the Python app. Unfortunately I would suggest you use these GPUs for another project they are more suited for. ID: 59008 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 59039 - Posted: 28 Jul 2022, 9:18:17 UTC Last modified: 28 Jul 2022, 9:29:40 UTC New generic error on multiple tasks this morning: TypeError: create_factory() got an unexpected keyword argument 'recurrent_nets' Seems to affect the entire batch currently being generated. ID: 59039 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 59040 - Posted: 28 Jul 2022, 9:41:25 UTC - in response to Message 59039. Last modified: 28 Jul 2022, 9:42:38 UTC Thanks for letting us know Richard. It is a minor error, sorry for the inconvenience, I am fixing it right now. Unfortunately the remaining jobs of the batch will crash but then will be replaced with correct ones. ID: 59040 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 59042 - Posted: 28 Jul 2022, 10:45:43 UTC No worries - these things happen. The machine which alerted me to the problem now has a task 'created 28 Jul 2022 \| 10:33:04 UTC' which seems to be running normally. The earlier tasks will hang around until each of them has gone through 8 separate hosts, before your server will accept that there may have been a bug. But at least they don't waste much time. ID: 59042 · Rating: 0 · rate: / Reply Quote

abouh Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 59043 - Posted: 28 Jul 2022, 13:38:06 UTC - in response to Message 59042. Yes exactly, it has to fail 8 times... the only good part is that the bugged tasks fail at the beginning of the script so almost no computation is wasted. I have checked and some of the tasks in the newest batch have already finished successfully. ID: 59043 · Rating: 0 · rate: / Reply Quote

robertmiles Send message Joined: 16 Apr 09 Posts: 503 Credit: 769,991,668 RAC: 0 Level Scientific publications	Message 59071 - Posted: 6 Aug 2022, 19:47:50 UTC A peculiarity of Python apps for GPU hosts 4.03 (cuda1131): If BOINC is shut down while such a task is in progress, then restarted, the task will show 2% progress at first, even if it was well past this before the shutdown. However, the progress may then jump past 98% at the next time a checkpoint is written, which looks like the hidden progress is recovered. Not a definite problem, but you should be aware of it. ID: 59071 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 59076 - Posted: 7 Aug 2022, 14:08:51 UTC I've been monitoring and playing with the initial runtime estimates for these tasks. The Y-axis has been scaled by various factors of 10 to make the changes legible. The initial estimates (750 days to 230 days) are clearly dominated by the DCF (real numbers, unscaled). The <flops> - the speed of processing, 707 or 704 GigaFlops, assumed by the server. There's a tiny jump midway through the month, which correlates with a machine software update, including a new version of BOINC, and reboot. That will have triggered a CPU benchmark run. The DCF (client controlled) has been falling very, very, slowly. It's so far distant from reality that BOINC moves it at an ultra-cautious 1% of the difference at the conclusion of each successful run. The changes in slope come about because of the varying mixture of short-running (early exit) tasks and full-length tasks. The APR has been wobbling about, again because of the varying mixture of tasks, but seems to be tracking the real world reasonably well. The values range from 13,000 to nearly 17,000 GigaFlops. Conclusion: The server seems to be estimating the speed of the client using some derivative of the reported benchmark for the machine. That's absurd for a GPU-based project: the variation in GPU speeds is far greater than the variation of CPU speeds. It would be far better to use the APR, but with some safeguards and greater regard to the actual numbers involved. The chart was derived from host 508381, which has a measured CPU speed of 7.256 GigaFlops (roughly one-tenth of the speed assumed by the server), and all tasks were run on the same GTX 1660 Ti GPU, with a theoretical ('peak') speed of 5,530 GigaFlops. Congratulations to the GPUGrid programmers - you've exceeded three times the speed of light (according to APR)! More seriously, that suggests that the 'size' setting for these tasks (fpops_est) - the only value that project actually has to supply manually - is set too low. This may have been the point at which the estimates started to go wrong. One further wrinkle: BOINC servers can't fully allow for varying runtimes and early task exits. Old hands will remember the problems we had with 'dash-9' (overflow) tasks at SETI@home. We overcame that one by adding an 'outlier' pathway to the server code: if the project validator marks the task as an outlier, its runtime is disregarded when tracking APR - that keeps things a lot more stable. Details at https://boinc.berkeley.edu/trac/wiki/ValidationSimple#Runtimeoutliers ID: 59076 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 9,834 Level Scientific publications	Message 59077 - Posted: 7 Aug 2022, 16:05:07 UTC - in response to Message 59076. or just use the flops reported by BOINC for the GPU. since it is recorded and communicated to the project. and from my experience (with ACEMD tasks) does get used in the credit reward for the non-static award scheme. so the project is certainly getting it and able to use that value. ID: 59077 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 59078 - Posted: 7 Aug 2022, 17:08:51 UTC - in response to Message 59077. Except: 1) A machine with two distinct GPUs only reports the peak flops of one of them. (The 'better' card, which is usually - but not always - the faster card). 2) Just as a GPU doesn't run at 10x the speed of the host CPU, it doesn't run realistic work at peak speed, either. That would involve yet another semi-realistic fiddle factor. And Ian will no doubt tell me that fancy modern cards, like Turing and Ampere, run closer to peak speed than earlier generations. We need to avoid having too many moving parts - too many things to get wrong when the next rotation of researchers takes over. ID: 59078 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 9,834 Level Scientific publications	Message 59099 - Posted: 11 Aug 2022, 22:40:09 UTC - in response to Message 59078. personally I'm a big fan of just standardizing the task computational size and assigning static credit. no matter the device used or how long it takes. just take flops out of the equation completely. that way faster devices get more credit/RAC based on the rate in which valid tasks are returned. the only caveat is the need to make all the tasks roughly the same "size" computationally. but that seems easier than all the hoops to jump through to accommodate all the idiosyncrasies of BOINC, various systems, and task differences. ID: 59099 · Rating: 0 · rate: / Reply Quote

Experimental Python tasks (beta) - task description