1 Work Unit Per GPU

Author	Message
Betting Slip Send message Joined: 5 Jan 09 Posts: 670 Credit: 2,498,095,550 RAC: 0 Level Scientific publications	Message 46314 - Posted: 25 Jan 2017, 17:05:18 UTC I would like to suggest a 1 WU per GPU policy for this project in order to increase its efficient use of resources and speed up results. This post is not about fine tuning the project to any particular user or group of users benefit although, ultimately I believe it will benefit everyone. It is about fine tuning the project for the projects benefit. Also, while there are many things that could be done server side to improve efficiency for the purpose of this discussion I would like you to concentrate on this one only. As things stand we have hosts caching WU's for crunching hours in the future while we have other hosts that can't get any work standing idle. This doesn't make a lot of sense to me. In addition, when a handful of WU's become available they are often quickly gobbled up by hosts that already have one running thus adding it to their cache and not allowing idle hosts the chance to download it because of the vagaries of the BOINC backoff. You only get a WU if it's available when BOINC asks for it. All this is akin to having 2,000 workers and 1,500 jobs instead of giving 1 job each to 1,500 and having 500 waiting for work, this project is giving 2 jobs to each of 750 and having 1,250 waiting for work. Doesn't make for speedy turnaround of work. The above is only an example of what is taking place and the numbers are not definitive. Apart from the unequal distribution of work, the system actually gets worse because a completed unit of work in most cases actually generates another WU thus supplying more work but it can't do that while it is sitting in a hosts cache so again less work available for idle hosts. There is another problem, as we all know, there are hosts that continually "timeout" and the WU is resent after 5 days, there are other hosts which continually "error out" after long periods of time. These hosts can hold up the progression of 2 WU's at a time under the present policy. However, they would only halt the progression of 1 WU at a time and thus reducing their impact on the project if a 1 WU per GPU policy were implemented. BOINC when downloaded and installed has cache size set by default and will download 2 WU's even if the card on that host is a slow one. Now if that card only got one it may complete within 24hrs thus speeding the project and earning a bonus however while downloading 2 WU's it will do neither and a lot of users will not even be aware it can be changed or the implications. In addition, there will be plenty of users running multiple projects on BOINC and have a slow card. They want a high cache on some projects but not GPUGrid. They can't because these settings are global and apply to all running projects. I'm sure you're thinking that GPUGrid can't keep enough work in the queue to keep us all active now so why should we be concerned with speed and efficiency. Firstly, I would suspect the scientist want their results back as quickly as possible so they can analyze the results and maybe issue more work on the back of the results. There is also another consideration, Stefan has mentioned collaboration with others which if fruitfull could lead to more work and boost demand on GPUGrid. Scientists at GPUGrid may have cause in the future to release a huge amount of WU's and need it to be running at MAX capacity. In any case the time to fix a leaky roof is when the sun is shining not when it begins to rain and the roof is certainly leaking on this project, leaking resources. I hope I have made my case clearly. ID: 46314 · Rating: 0 · rate: / Reply Quote

PappaLitto Send message Joined: 21 Mar 16 Posts: 513 Credit: 4,673,458,277 RAC: 0 Level Scientific publications	Message 46316 - Posted: 25 Jan 2017, 18:50:33 UTC For people that are like TL:DR, Set in BOINC 0 days works and 0 additional days work. If you crunch 2 WU at a time per GPU, it's probably not the best time to do that, especially because the WU have such high utilization now. We just want to have more people crunching the WUs in parallel. ID: 46316 · Rating: 0 · rate: / Reply Quote

PappaLitto Send message Joined: 21 Mar 16 Posts: 513 Credit: 4,673,458,277 RAC: 0 Level Scientific publications	Message 46317 - Posted: 25 Jan 2017, 19:07:17 UTC I also agree that the scientists should implement, at least temporarily, a 1 WU per GPU policy and only until it either errors out, finishes or times out does it get another. This would dramatically speed up work flow ID: 46317 · Rating: 0 · rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 46318 - Posted: 25 Jan 2017, 22:03:23 UTC - in response to Message 46314. You've made it perfectly clear. I agree with you, but this reduction of maximum workunits per GPU is effective only when there is a shortage of workunits (which is quite frequent lately). So perhaps this setting should be different for every host, based on the turnaround time of that host. Every host should start with 1 workunit per GPU, but if the turnaround time is below 48 hours, the maximum is set to 2. When the turnaround time is above 48 hours, the maximum is set to 1. The drawbacks of 1 workunit per GPU are: 1. You can't run two workunits on a single GPU simultaneously (the hosts configured to do so will have an idle GPU, so they have to be reconfigured) 2. It will reduce the RAC of every host, so it will reduce the overall RAC of GPUGrid, because the host won't receive a new GPUGrid workunit until the finished one is uploaded. Hosts without backup projects will be idle during the upload/download period, while hosts with backup projects will process (at least) one workunit from the backup project between GPUGrid workunits, as it will force the host to get work from the backup project(s) until the finished GPUGrid workunit is uploaded. If the backup project has long workunits, or the host has a slow internet connection (mainly hosts outside Europe) this reduction could be significant for these hosts (and for the whole project also). This would make some volunteers to disable their backup project(s), or manually manage them (which is pretty inconvenient). However we should try it to know exactly its impact. ID: 46318 · Rating: 0 · rate: / Reply Quote

Betting Slip Send message Joined: 5 Jan 09 Posts: 670 Credit: 2,498,095,550 RAC: 0 Level Scientific publications	Message 46319 - Posted: 25 Jan 2017, 22:21:04 UTC - in response to Message 46318. I actually think it will be just as effective when there are lots of WU's to be had because it's likely at least one or more types of work will have a higher priority setting than others so you don't want them cached in order to get them completed as fast as possible. Thanks for your support Retvari and Pappa ID: 46319 · Rating: 0 · rate: / Reply Quote

3de64piB5uZAS6SUNt1GFDU9dRhY Send message Joined: 20 Apr 15 Posts: 285 Credit: 1,102,216,607 RAC: 0 Level Scientific publications	Message 46329 - Posted: 26 Jan 2017, 7:54:22 UTC I dont really second that as reducing the number of WU per GPU will lead to a poor utilization of capable Hardware (gtx 980, 980ti, 1070 and up) because of a bad CPU/GPU ratio. The 1 WU/GPU thing would only work well for mid range Maxwell Cards and most of the Kepler and Fermi Cards. But it will surely reduce the efficiency of the fast machines. Well, I guess it is all about tasks. Let's also consider some other ideas to fill the queue, especially for the short runs. Just my two Cents. I would love to see HCF1 protein folding and interaction simulations to help my little boy... someday. ID: 46329 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1168 Credit: 12,311,898,501 RAC: 271,810 Level Scientific publications	Message 46331 - Posted: 26 Jan 2017, 11:20:38 UTC - in response to Message 46318. Last modified: 26 Jan 2017, 11:21:11 UTC the basic idea is not a bad one from various viewpoints; however, my reservations are mainly because of these facts brought up by Zoltan: ... If the backup project has long workunits, or the host has a slow internet connection (mainly hosts outside Europe) this reduction could be significant for these hosts (and for the whole project also). This would make some volunteers to disable their backup project(s), or manually manage them (which is pretty inconvenient). ID: 46331 · Rating: 0 · rate: / Reply Quote

3de64piB5uZAS6SUNt1GFDU9dRhY Send message Joined: 20 Apr 15 Posts: 285 Credit: 1,102,216,607 RAC: 0 Level Scientific publications	Message 46336 - Posted: 26 Jan 2017, 12:14:05 UTC Last modified: 26 Jan 2017, 12:14:24 UTC So perhaps this setting should be different for every host, based on the turnaround time of that host. Every host should start with 1 workunit per GPU, but if the turnaround time is below 48 hours, the maximum is set to 2. When the turnaround time is above 48 hours, the maximum is set to 1. Zoltan, that is a good proposal.. although I am not sure how to technically implement it, overriding the client config. I would love to see HCF1 protein folding and interaction simulations to help my little boy... someday. ID: 46336 · Rating: 0 · rate: / Reply Quote

Stefan Project administrator Project developer Project tester Project scientist Send message Joined: 5 Mar 13 Posts: 348 Credit: 0 RAC: 0 Level Scientific publications	Message 46346 - Posted: 27 Jan 2017, 11:38:51 UTC Last modified: 27 Jan 2017, 11:39:13 UTC Thanks Betting Slip for the thread :) Very nicely put. Hm, the points made by Zoltan are quite valid I think. It's a risk if people have backup projects with long WUs. It could make sense if it was as simple as switching a button, so when there are fewer WUs than GPUs it forces 1 WU per GPU, otherwise it allows 2. I don't know how feasible it would be to implement such a feature. Now that I think about it, on the short queue some people might even want to run 2 WU per GPU if they indeed fit. So it would be better if it was enforced on specific queues, not whole project. So the question becomes quantifying the damage. We could maybe do a short trial and see what the overall effect is. ID: 46346 · Rating: 0 · rate: / Reply Quote

Betting Slip Send message Joined: 5 Jan 09 Posts: 670 Credit: 2,498,095,550 RAC: 0 Level Scientific publications	Message 46347 - Posted: 27 Jan 2017, 11:53:08 UTC - in response to Message 46346. Last modified: 27 Jan 2017, 11:54:57 UTC Thanks Betting Slip for the thread :) Very nicely put. Hm, the points made by Zoltan are quite valid I think. It's a risk if people have backup projects with long WUs. It could make sense if it was as simple as switching a button, so when there are fewer WUs than GPUs it forces 1 WU per GPU, otherwise it allows 2. I don't know how feasible it would be to implement such a feature. Now that I think about it, on the short queue some people might even want to run 2 WU per GPU if they indeed fit. So it would be better if it was enforced on specific queues, not whole project. So the question becomes quantifying the damage. We could maybe do a short trial and see what the overall effect is. You're comments on the short queue are valid and would have put in my original post had I remembered, not to implement on short queue. I was so engrossed in making my case clearly that it slipped my mind. I think a 4 to 6 week trial would give both you and us a very good idea as to the impact on the project which, I personally think would be more positive than negative and most users would adapt their habits if running backup projects. As I said above, I also think it will be positive when there is lots of work. Be brave and give it a go. :-) ID: 46347 · Rating: 0 · rate: / Reply Quote

3de64piB5uZAS6SUNt1GFDU9dRhY Send message Joined: 20 Apr 15 Posts: 285 Credit: 1,102,216,607 RAC: 0 Level Scientific publications	Message 46350 - Posted: 27 Jan 2017, 13:06:50 UTC May I put in one more remark.. the short run queue is intended for the slower cards where it doesn't make sense to run 2 concurrent jobs anyway. Whereas the fast Maxwell and Pascal cards need 2 tasks to have a good utilization otherwise they might be bottlenecked by the CPU. So if at all I would apply the 1 Job/GPU rule to the short runs. As a positive side effect, the crunchers with high-end hardware will leave out the short runs to get more GPU utilization by multi-tasking. Which leaves more short runs to the crunchers who really need them. I would love to see HCF1 protein folding and interaction simulations to help my little boy... someday. ID: 46350 · Rating: 0 · rate: / Reply Quote

Stefan Project administrator Project developer Project tester Project scientist Send message Joined: 5 Mar 13 Posts: 348 Credit: 0 RAC: 0 Level Scientific publications	Message 46353 - Posted: 27 Jan 2017, 13:54:09 UTC Last modified: 27 Jan 2017, 13:54:17 UTC Sorry since I am not very familiar with BOINC. What is considered a host? One machine or one GPU? Which of these options would it be? https://boinc.berkeley.edu/trac/wiki/ProjectOptions#Joblimits ID: 46353 · Rating: 0 · rate: / Reply Quote

Betting Slip Send message Joined: 5 Jan 09 Posts: 670 Credit: 2,498,095,550 RAC: 0 Level Scientific publications	Message 46354 - Posted: 27 Jan 2017, 14:00:09 UTC - in response to Message 46353. Last modified: 27 Jan 2017, 14:08:37 UTC I would think it would be " Max WU in progress GPU" Host = 1 machine Would ask/PM Richard Hazelgrove or Jacob Klein https://gpugrid.net/show_user.php?userid=8048 ID: 46354 · Rating: 0 · rate: / Reply Quote

3de64piB5uZAS6SUNt1GFDU9dRhY Send message Joined: 20 Apr 15 Posts: 285 Credit: 1,102,216,607 RAC: 0 Level Scientific publications	Message 46355 - Posted: 27 Jan 2017, 15:08:57 UTC I am not exactly a Boinc expert either ... but can't the Boinc client check the estimated GFLOPS capability of the GPU used and send to the server? https://boinc.berkeley.edu/dev/forum_thread.php?id=10716 If less than 2000 GFLOPS I would send only one short run per GPU. Otherwise grant 1-2 long runs. IF that is at all possible to configure. I would love to see HCF1 protein folding and interaction simulations to help my little boy... someday. ID: 46355 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 2 Level Scientific publications	Message 46356 - Posted: 27 Jan 2017, 15:10:48 UTC - in response to Message 46354. I would think it would be " Max WU in progress GPU" Host = 1 machine Would ask/PM Richard Hazelgrove or Jacob Klein https://gpugrid.net/show_user.php?userid=8048 I'm not really an expert on practical server operations (just a volunteer user, like most people here), but I can find my way around well enough to point you to http://boinc.berkeley.edu/trac/wiki/ProjectOptions#Joblimits as a starting point. You probably need to read through the next section - Job limits (advanced) - as well. I'm guessing that because you operate with GPU limits already, there's probably a pre-existing config_aux.xml file on your server, which might be a better example to follow than the somewhat garbled one given as a template in the trac wiki. It appears that you can set limits both as totals for the project as a whole, and individually for each application (acemdlong / acemdshort). As Betting Slip says, a 'host' in BOINC-speak is one computer in its entirety - any number of CPU cores plus any number of GPUs (possibly of multiple technologies). I would guess that '<per_proc/>' shown for the <gpu_limit> in the wiki should be interpreted as 'per GPU' (bearing in mind that some dual-circuit-board GPUs, like the Titan Z, will appear to BOINC as two distinct GPUs), but that's an area where the documentation seems a little hazy. If you can clarify the situation by experimentation, I think I have an account which allows me to edit for clarity. ID: 46356 · Rating: 0 · rate: / Reply Quote

PappaLitto Send message Joined: 21 Mar 16 Posts: 513 Credit: 4,673,458,277 RAC: 0 Level Scientific publications	Message 46357 - Posted: 27 Jan 2017, 15:11:56 UTC Last modified: 27 Jan 2017, 15:12:52 UTC If less than 2000 GFLOPS I would send only one short run per GPU. Otherwise grant 1-2 long runs. IF that is at all possible to configure. That seems pretty in-depth, I'm not sure it's that granular, but it's worth investigating. If not, I think the overarching 1 WU per GPU is the best route, at least for an experimental trial period. ID: 46357 · Rating: 0 · rate: / Reply Quote

Betting Slip Send message Joined: 5 Jan 09 Posts: 670 Credit: 2,498,095,550 RAC: 0 Level Scientific publications	Message 46358 - Posted: 27 Jan 2017, 15:23:05 UTC - in response to Message 46356. Sorry for spelling your name wrong Richard was going to go for the "s" but went with "z" instead. Does explane why I couldn't find you when doing a search, that should have given me a clue but no... Thanks for helping. ID: 46358 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 2 Level Scientific publications	Message 46359 - Posted: 27 Jan 2017, 15:34:11 UTC - in response to Message 46358. Sorry for spelling your name wrong Richard was going to go for the "s" but went with "z" instead. Does explane why I couldn't find you when doing a search, that should have given me a clue but no... Thanks for helping. LOL - no offence taken. I'm used to the 'z' spelling when I dictate my surname over the telephone, but I'm always intrigued when somebody who only knows me from the written word plays it back in the alternate form. I guess the human brain vocalises as you read, and remembers the sounds better than the glyphs when the time comes to use it again? I'm sure there's a PhD in some form of cognitive psychology in there for someone... ID: 46359 · Rating: 0 · rate: / Reply Quote

eXaPower Send message Joined: 25 Sep 13 Posts: 293 Credit: 1,897,601,978 RAC: 0 Level Scientific publications	Message 46362 - Posted: 27 Jan 2017, 18:42:04 UTC Last modified: 27 Jan 2017, 18:43:06 UTC Will any server policy implemented include a fix for multiple architecture GPUs running simultaneously? For months ACEMD (CUDA 8.0) Pascal and (CUDA 6.5) Kelper / Maxwell app unable to work together on the same host. ID: 46362 · Rating: 0 · rate: / Reply Quote

PappaLitto Send message Joined: 21 Mar 16 Posts: 513 Credit: 4,673,458,277 RAC: 0 Level Scientific publications	Message 46363 - Posted: 27 Jan 2017, 18:48:25 UTC Last modified: 27 Jan 2017, 18:49:22 UTC I am running Kepler and Maxwell together so it must just be that they are different versions of CUDA, so I don't expect this to ever be fixed. But I could be wrong. ID: 46363 · Rating: 0 · rate: / Reply Quote