Message boards :
Number crunching :
Cancelled by Server - Suggestion
Message board moderation
| Author | Message |
|---|---|
ZydorSend message Joined: 8 Feb 09 Posts: 252 Credit: 1,309,451 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]()
|
This is going to be difficult to express without it being mis-read, the aim behind it is overall "job satisfaction" for those with slower cards, no more, no less. In reading it, place yourself in the situation of the lower speed rated cards. At present the server will reach out and cancell WUs already crunched by another, and no longer needed on that PC - a good thing, no problems with that, its a win-win scenario. It does have a consequential drawback, which whilst not strong enough to negate the principle, does have sufficient weight to merit consideration of a resolution. To quote an extreme, if a 8500/8600 etc etc is sent a WU, and is matched with a 285/295 at the same time - no contest, the 285/295 will finish first everytime. If the slower card is not running the WU, it will be cancelled, and thats fine. If its running it, it will be allowed to complete and be given credit, thats also a good thing. In that latter case however, in the extreme example, it means that the slower card never contributes in a meaningful way, as their crunched WU is never needed. Of course in real world terms such a scenario is nigh on impossible on every single occasion. However also in real world terms it does happen on a significant number of occasions. I run a 9800GTX+ and after a recent cancellation had a look at completed WUs to see how many were "beaten to the punch", there are quite a few, around 20% or so. That number will significantly increase the slower the card, That will over time become discouraging for those with slower cards, as it dawns on them that much of what they crunch is of no value (and here keep an even keel, I am taking in terms of running against faster cards and being beaten to the finish post, no other inference implied). That there is a valueable and essential place for the slower cards there is no doubt whatsoever, clearly, so lets not go there ...... The recall system does produce the anomaly however, and it would not surprise me to find many withdrawals from the Project - overwhelming number of which just "disappear" no reason or song and dance about it - because they feel there is no point as they get "beaten to the finish" by someone else. The latter can be very discouraging if on a slower card, is happening a lot, and will be increasingly common with cards below a 9800GTX. Is it possible server-side to test for the card used by the two selected crunchers to try and ensure comparable cards? It should be easy to do, I dont think it would involve too much extra cycles, and would create a level playing field with happier crunchers. This is not a race, nor should it ever develope into one, most have the common sense to realise that. By issuing to comparable cards however, it means the slower card can contribute in a more meaningful way, with more overall chance of retaining them, and the additional computing power. The world will not come to a halt, if this is not adopted, but it will be a happier place for those with lower spec cards if it is. Regards Zy |
X1900AIWSend message Joined: 12 Sep 08 Posts: 74 Credit: 23,566,124 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
That will over time become discouraging for those with slower cards, as it dawns on them that much of what they crunch is of no value (and here keep an even keel, I am taking in terms of running against faster cards and being beaten to the finish post, no other inference implied). I remember some months ago there was a hype about faster cards, more RAC and lot of members upgraded. Later on low credit per hour workunits were cancelled to get higher credit workunits, side effect: those who didn´t got "worse" ones. This is not a race, nor should it ever develope into one, most have the common sense to realise that. I fear that is the most attracting reason for the top participants, especially those who hide their computer information, not sharing configurations and this way knowlegde to imitate crunching systems, they are fighting other wars not for science but credit ranks. It´s o.k. If a project does not meet their claims it would be a big loss of crunching power. No credits no game. I downclocked and undervolted my cards in a period to reach a better credit per watt ratio, indeed overclocking achieves good results as before. The most important aspect for me is participating in a useful project (given) as well as getting the workunit finished without wasting my time (error diagnostics) and money (power consumption) or risking failure because of long runtimes (other circumstances). The world will not come to a halt, if this is not adopted, but it will be a happier place for those with lower spec cards if it is. Full agreement, any card in the FAQ should be supported in equal measure. Otherwise the FAQ have to be restricted to high end hardware, hope they do not. Every owner of a slow card today and supported well is a owner of a fast card tomorrow. Don´t watch statistics if you get discouraged. I was fallen back from top 40 to top 300, change of perspective: it´s a luck for the project to have this horsepower and their mandate to honour and communicate each little contribute is part of it. In my opinion they react fast and according to the project needs. (greetings to all mods & scientists) |
ZydorSend message Joined: 8 Feb 09 Posts: 252 Credit: 1,309,451 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]()
|
I fear that is the most attracting reason for the top participants Absolutely - and its a good thing, competition stirs on many people, and if you wish to view it in that manner, the above suggestion has even more weight as it can also encourage "unofficial" competition as there will be a level playing field. Don´t watch statistics if you get discouraged. I'm not discouraged in the slightest, having crunched for nine years in various guises in various projects, I'm too old in the tooth to get distracted by such artificial parameters - I dont give a rats fig about the credits yaddie yadda, and it hardly affects me with a 9800GTX+. My whole logic was aimed at making life a better place for those with low end cards. If you have one, and your work is continually "beaten" to the finish line, there will come a point where they will say "move on I'm not contributing, as my efforts are not used". Dont measure the suggestion in terms of who is "best" or who gets the most "credits" - in truth the vast majority, like me, dont care a fig about that either. The mindset to use with this, is think of the reaction from those crunchers who have low end cards and want to genuinely contribute - if they get "beaten" each time (and that will be their perception), many will say "whats the point" - no matter what esoteric explanation is deployed. The bottom line with the majority is they crunch because they hope to be of value to the Project, that cant happen if they are seeing their efforts usurped on many occasions. We have to have two crunchers per WU most of the time, for good reason, and thats fine, in such a level playing field scenario there is no issue, as any sensible human being is aware of the reason that only one of the "team" of two will be used, its the nature of the beast. The whole point of the suggestion is to make the two similar in capability so that each low end card user has equal chance of having their efforts used, and therefore feel part of a Team working to the same end - at present that is not the case in a significant number of occasions when a significant number get zapped to the finish line by a faster card. Will it be perfect - clearly not, thats life - however a very significant improvement can be made by a small change server side on the schedular. Regards Zy |
|
Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Hi Zydor, you bring up a valid point and I think you communicate it well, so people should understand what you really mean. Let me first describe how it works for SETI: Here there's a huge amount of WUs which can all be done in parallel, independent of each other. It doesn't matter when they're returned (within a reasonable time frame). Therefore every contribution from slow cards / cpus may not be the most energy efficient, but it does help the project. For GPU-Grid things are different. I may not be telling you anything new here, but I need this as basis for further argumentation: It's a simulation in time, which is inherently sequential: timestep n+1 can only be computed after step n. Therefore, if there was only 1 WU, only one GPU could be used at any time. The project can work around this by issuing WUs in parallel. However, the amount of WUs in parallel, which can be put to good use, is not quasi-infinite as in the case of seti: GPU-Grid needs results back to analyze them and adept accordingly to issue new WUs based on the old results. This is where the problems start and where it may be of more value to the project to get results back faster than to start even more WUs in parallel. It seems like the project reached a state where they have enough GPUs, i.e. enough WUs in parallel, so they try to speed important WUs up by assigning them an initial replication >1. They already have a distinction between reliable and normal hosts. Reliable are ones which return WUs within xx h with a failure rate less than y %. So.. how should the WUs be distributed? * example 1: 1000 WUs in parallel, 1000 reliable hosts -> easy ;) * example 2: 1000 WUs in parallel, 800 reliable hosts, 400 normal hosts -> 800 WUs for the reliable ones, 200 WUs with initial replication 2 for the rest * example 3: 1000 WUs in parallel, 1000 reliable hosts, 400 normal hosts -> 600 WUs with a reliable host each and 400 WUs with initial replication 2, each WUs with a reliable host and a normal one
* example 4: 1000 WUs in parallel, 1000 fast hosts, 1000 slow hosts
Scanning for our furry friends since Jan 2002 |
ZydorSend message Joined: 8 Feb 09 Posts: 252 Credit: 1,309,451 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]()
|
Guilty as charged on the card designations :) It was a quick remark as illustration I should have checked facts a little closer. The dilema is well understood, life is not perfect, never will be and we all go with the flow on the best path available when all is factored in. I understand the drive at Project level to produce a schedular solution that is the most efficient in producing the best return of WUs - it is after all why we are all here. I would only restate the effect of going down a path of "pure" efficiency, the latter can often be fools gold as the penalties suffered outweigh the solution enabled. In this case, I have no idea of the actual hard facts as clearly I dont have the whole project stats. The scenario I painted is plausible and I have seen similar effects elsewhere, where a drive for efficiency implemented in good faith, ends up driving away those "excluded" albeit unintentionally. Competition is getting fierce for Crunchers in the BOINC world, and anything that helps tweek reasons to stay on this Project can only be a good thing. As stated above todays low end card user, is potentially tomorrows mega cruncher when they get hooked by it all. There is a balance in all this which is always difficult to get right all the time. It would be an idea to "Test & Measure" numbers of lower end cards over time, and see if a trend of increasing non-activity develops, shouldnt be too hard to frame a daily stats report and log it to map the trend. Meanwhile if the schedular can be tweeked to minimise as much as is practically possible within the Projects Objectives, unbalanced card pairing, that can only be a good thing. Regards Zy |
|
Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Do you have any specific ideas on how this could be implemented? After all it's not only about "make the small guys feel good" (as important as this is), but it's also about "let's not waste their effort". Besides my revious suggestion of different WUs I could see something else: assume that the reliable hosts are saturated with WU and there's still work left. In this case one could pair fast but unreliable hosts (with high errors rates) and slow but reliable hosts. This would greatly increase the chances that the result of the slow cruncher will be needed. MrS Scanning for our furry friends since Jan 2002 |
ZydorSend message Joined: 8 Feb 09 Posts: 252 Credit: 1,309,451 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]()
|
That made me think rofl :) As a starter for ten, building on your idea in the post above, how about: Three Categories with suggested sub catagories in following priority order to allocate WUs: 1. Fast & Reliable. One Replication, match fast cards only, then allocate in the pri order: A. Time Critical & Content Critical Project Work B. Time Critical & Content Critical special one off or short duration If Units & suitable hosts remaining 2. Fast or Slow, & Reliable. Two Replications, match card speed first, any speed second, then allocate in the pri order: A. Balance of Category 1 Units remaining B. Critical Content Standard Project Work C. Critical Content special one off or short duration If Suitable hosts remaining 3. Fast or Slow, Reliable or Unreliable. Three Replications (do not pass remaining Cat2's to this) A. Routine non time critical, non content critical (1) Match card speed (2) Any Card speed If Units remaining - wait for capacity, rinse and repeat. Within that framework, define the Definition of each type of WU in a flat table/array with Column IDs: Time Critical Standard Project Work Special One off Short Duration Routine Non Time Critical From the Flat Table/array allocate the types of WU to the overall categories above, depending on the Project Priorities - I would see that as an "Option" in an application accessible for Project Admin to set/tweek criteria for the next run of units. That way the only "maintenance" to the code - as such - is generating an allocation of WU to Definition in the overall Definition table, thats just a straight input session of a few seconds during each Project Definition and scoping to enter the next row in the array - the rest should flow. Bit flakey, I'm no programmer - rofl - but I reckon it points in the right direction indicating the right priorities for the Project but balancing in needs of Cruncher Speeds. Anyone else out there with an idea how to balance Project Objectives as primary with maximising slower card useage satisfation - lets hear from you:) Regards Zy |
Paul D. BuckSend message Joined: 9 Jun 08 Posts: 1050 Credit: 37,321,185 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
As an owner of what I consider one of the slower cards ... though many would not consider it so ... I would suggest that the distribution of longer vs. shorter tasks may be less balanced than it could be. Again, there are various ways of eating this elephant but it would bother me little at all to be given fewer sub 6 hour tasks on the machines with GTX200 class GPUs. Or to put it another way, I get a lot of tasks that MIGHT be more suitable for slower systems than the KASHIF tasks now flowing through the system. WIthout gaming the questions and that would require more information as to the classes of machines it is hard to know if changes would make significant differences, or not. I would argue that it would. Put it another way, if my i7 got more of the KASHIF tasks while the slower systems got more of the other tasks our return intervals would converge ... but this requires knowing the number of systems of the various classes and the population of the tasks to be processed. And their priority. If the slower tasks are higher "priority" than the slower this makes changes significantly. Other limitations apply. The "Feeder" application that feeds tasks to the scheduler has a very limited size. My recollector says its default size is 100 tasks. And the usual configuration is FIFO so that the task assignment is "random". You can put more smarts in the scheduler so that it selects more appropriate tasks ... or change the system so that there are multiple schedulers with multiple queues and we sign-up for the one with a best match ... Hoar to know what the best choices are here ... Maybe I need to go buy another faster card ... |
|
Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
The major problem with sophisticated work distribution may be that it requires serious tweaking of the BOINC server software, which would need to be repeated upon an update (if it doesn't make it into the main code base). That's quite some work, I suppose considerably more than they have currently done with the "reliable host" flag. And there's more: like Paul said the scheduler probably has a comparatively small amount of WUs to choose from. And it can't decide or proedict which hosts are going to contact it at what point of time. That's not a show stopper, but I think it leads to the following: the more complicated and diverse you make the WU distribution, the less matches you're going to find. Might be better to opt for a simpler, more robust scheme. But then, admittedly, I didn't tkae the time to think your suggestion through properly.. Maybe I need to go buy another faster card ... .. which has absolutely nothing to do with the topic discussed here, doesn't it? ;) Actually I'm itching to replace the hdd in my notebook.. but I keep telling myself, that, although the new one would be more silent, larger, faster, and less power consuming, that it absolutely wouldn't change anything. So I'm trying to admire the 320 GB Scorpio Black without actually pulling the trigger :D MrS Scanning for our furry friends since Jan 2002 |
GDFSend message Joined: 14 Mar 07 Posts: 1958 Credit: 629,356 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() |
Hi, just to say that we are carefully following this thread. gdf |
ZydorSend message Joined: 8 Feb 09 Posts: 252 Credit: 1,309,451 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]()
|
Its a little difficult to know which direction to go from here. I reckon a fair summary of all the above to date is: - Overall speed of production from the overall crunching capacity of the community is the top priority, as long as that is not at the expense of overall capacity (ie yup need to crunch em quickly, but little point achieving that if we loose a chunk out the back door because they feel "unwanted"). Hard balance to achieve, but the overall goal is there. - BOINC schedular does give some issues in that we are essentially "interveening" between it and the GPUGRID server. Not impossible to resolve if the server responded to the BOINC schedular request with internal logic to select the WU, and pass it back to the schedular for "delivery". Suspect there is some heresy there rofl, but hey, I'm no programmer :) - No strong yells of "over my dead body", so we could be reasonably close to a done deal given some collective brain storming on detail and some pragmatic decisions on what will be an imperfect solution - lets go for the classic 80/20 rule, get it in, and massage as time goes on. Perfection first time round is not going to happen, thats not the real world. If this is gaining thought, maybe the next step is someone better than me at "sudo code" to attempt a better rendition of my first crack at it above, and post it for collective comment?? Usually a proper sudo-code exercise can tease out good suggestions as it is readable by us mere mortels. Regards Zy |
Paul D. BuckSend message Joined: 9 Jun 08 Posts: 1050 Credit: 37,321,185 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
NOt knowing the internal goals or the sub-task goals it is hard to know for sure... BUt, I can easily see that there is going to be a dynamic tension between the Speed of Service (SoS) and the processing time ... what I mean is this. Assume that there are three task length classes and three SoS objectives. Short Run Time Med. Run Time Long Run Time SoS: 1 Day or less, 2-3 Days, and Deadline fine So, in my case I have a spread of cards from GTX280 to 9800GT ... run times average 5 to 20 hours on the faster card and on my slow card it is 15 to 33 hours or there about. Assuming, for the moment, that we can roughly class the tasks I looked at the credit claim numbers and at the moment have 3 sets 3681/3946, 7057/8076, and 4131/4352. And looking at a couple dozen of these tasks gave me the run times for these tasks. Now if we contrast those run times with the SoS, we may find that the shortest running tasks may have a higher desired SoS, where the longer running tasks may be in the "Deadline Fine" class. I mean the question is are we trying to average the run times so that my 9800GT card would get the "shorter" tasks which it completes in about 15 hours, while the GT280 gets the longer tasks (about 16.5 hours)? If that is the case, then the objective would be to attempt to fit them by estimated run time class. But, the SoS objective may not be that neat and pretty. if the SoS is for 1 day or less, then you would want to assign that short run time task to the faster machine regardless. As to the Scheduler and the project, um, there is no intervervention between it and the GPU Grid server. They are one and the same. The BOIC CLient tells the project's scheduler that it wants work and it issues work out of the available pool. Sadly, this is one of the places that is some pretty bad code and one of the hardest to get changed. The first issue is that the feeder has a limited collection of tasks and if there is not a good choice in that selection there are two options, issue no work, or issue work that falls out of these new guidelines. Not trying to be nay-sayer here... but, the first question is "Is there a real problem?" or are we a solution looking for a problem? I better quit, not typing well, and not thinking much better ... |
|
Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I mean the question is are we trying to average the run times so that my 9800GT card would get the "shorter" tasks which it completes in about 15 hours, while the GT280 gets the longer tasks (about 16.5 hours)? To briefly answer your question: I don't think so. Currently the runtimes are rather arbitrarily set by the standard "a 9800GT should do it in ~12h", or at least that's what was used last autumn. So if you see longer and shorter tasks, that's not intentional. They could all be of te same length, as far as the project is concerned. And since each WU features many steps (was ~800k in former WUs) the runtime could be set almost arbitrarily. Well, the lower limit is the time per step.. ;) And to finalize this long story: the quicker the WUs of this given size are returned, the better. Not trying to be nay-sayer here... but, the first question is "Is there a real problem?" or are we a solution looking for a problem? That's a very valid question. One could see it like this: as more new genertions of GPUs are introduced and if the old ones can still execute the future code, then the problem discussed here will only get worse, as the GPU speeds will get even more diverse. The other side: due to the dynamic nature of GPU-Grid (i.e. the time domain simulation with WUs depending on each other) it will always benefit most from the fastest and newest GPUs. If, at some future point, there are other attractive CUDA projects available for slower cards.. what's the project going to do? MrS Scanning for our furry friends since Jan 2002 |
|
Send message Joined: 4 Apr 09 Posts: 450 Credit: 539,316,349 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Without knowing how it is implemented from the server perspective ... if we take a look at how WCG has multiple projects, perhaps that basic concept could be reworked for different classes of WUs (priority, 200 series card only, tight turnarounds (for compute error WUs), regular, best suited to low end cards ... you get the idea.) Let's say I have a 295, I am *asked* to sign up for specific WU types on the website (yes, this still leaves me in control so no big brother concerns) ... basic information could be provided explaining what each *WU type* is best suited for ... only if buffer < 24 hours etc. Now for coordinating this through the scheduler ... maybe just an extra <tag> on each WU to see if there is a match to the incoming client request (what the client registered for on the website). I could also select "give me anything the project needs me to do" type of WU so that if all the short turn around or tasks only suited to 200 series cards have been sent out then by all means send me a lower priority WU. This would reduce the necessarily inefficient process by which WUs are allowed to complete even thought the project already has a valid return while also reducing the implied perception of my GPU is less useful than yours so I'm gonna take my ball and go someplace else :-). Wow ... in fact if I am concerned about having my WU returned quickly (so someone else's copy does not even start processing yet) I would turn my buffer way down low which I believe the project would really appreciate. Thanks - Steve |
|
Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
After thinking about it for some time I came up with a suggestion which I actually like :) Problems I'm trying to solve: - users with slower GPUs may feel their contribution is not worthy, get bet by faster cards - slower GPUs have problems to meet the deadline - very fast GPUs (e.g. if GT300 is the beast it's rumored to be) may reach crunching times of 1 - 3h / WU in a few months, which is not desireable - the short turnaround times and small cache settings cause trouble for some users My inspiration: At Rosetta@home you can set how long a WU should run. As fas as I understood they're doing Monte-Carlos and each WU contains several runs. So it's easy to declare WUs finished after an arbitrary number of completed runs. Transferring this idea to GPU-Grid: We can't adept it directly, as our WUs have time steps which depend on each other, they're not independent runs. However, I think the number of steps in each WU can be set to arbitrary numbers, i.e. the project chooses a number of steps which leads to ~12h of computation on a 8800GT. I suggest to make this number of steps flexible and instead set the crunching time per WU. We introduce a new user preference "preferred run time". Let's try to keep it simple and sturdy, so we don't allow arbitrary numbers, but instead give 3 options to choose from: - short: 4 - 6h/WU, whatever the server can handle - standard: 10 - 12h (default setting) - long: ~24h How it could look like, initial replication 1: Host A requests work and the server decides to send WU 1. At this point the runtime of WU 1 is set to the preferred setting of host A. The deadline may be adjusted accordingly. Host A crunches 10 steps within this time. The WUs is finished and sent back. A new one is generated based upon this result. Advantages: - apart from the adjustments for this flexible WU generation nothing changes, server side - hosts get more freedom: especially slow ones and hosts which don't crunch 24/7 would benefit from the shorter run times. - Users with limited inet access may prefer the longer WUs - Users with limited upload my prefer the longer WUs, if the output file size does not depend on the number of steps (not sure here) - Users who don't have more cores than GPUs could reduce their downtime / overhead by choosing longer WUs - a short turn-around time on slower cards means better load balancing on the server side: WUs which don't progress as fast get more chances to be sent to fast and reliable hosts (if the server knows which hosts are fast) Drawbacks: - none that I can see.. apart from the necessary modifications How it could look like, initial replication >1: Host A requests work and the server decides to send WU 2. At this point the runtime of WU 2 is set to the preferred setting of host A. The deadline may be adjusted accordingly. Now WU 2 gets top priority to be sent out to the other hosts. Next work request comes from host B. He's got the same preferred runtime and also gets WU 2, everything's fine. Assume host B doesn't turn up and instead it's host C with a different preferred runtime. Now a compromise has to be made: 1.) Send WU 2 to host C anyway, with the runtime setting of host A. This overrides host Cs setting, something the user may not like. 2.) Wait until host B with the matching runtime turns up. However, if one waits too long, host B will not be able to return WU 2 within the same time frame as host A and we get essentially the problem Zydor is trying to avoid, just independent of GPU speed. That's why it's important to keep the number of possible runtimes small. This problem could be avoided if there's only one runtime for everyone. Assume host B got our WU 2 after 10 mins. Now host A returns his 10 steps of WU 2 after his preferred runtime. The scheduler could then generate a new WU immediately, based on these results. This would not be very clever, though: host B could be much faster and return 20 steps 10 mins later. Some tolerance time would have to be set here: how long does one want to wait, if the other hosts return more steps? So in this case things get a little complicated, but not terrible yet. Return results immediately: It would be ideal if GPU-Grid hosts would return their results immediately (an old cc_config option) and thus the maximum waiting time for the results of our host B could be small and overall WU processing speed could be increased. Actually it would even be beneficial now, if the BOINC client could be told by the server "I want you to report results immediately, but only for my project". Advantages: - some result will be available after the preferred runtime, regardless of host speeds (assume not all of them error out ;) - the best result could be chosen after a (hopefully) short tolerance time Drawbacks: - it gets messy if too many "preferred runtimes" are allowed, depending on the actual WU request rates - it gets ugly if BOINC waits a large, unknown amount of time before it contacts the scheduler and reports finished results - credits would differ for each run, depending on the amount of steps done, even if "a WU" is run by different hosts. I don't think BOINC allows this. Rosetta can get around this because every "WU" is only ever sent to 1 host. Internally the server collects all results belonging to the same problem, which are distributed among different runs contained in many different WUs. -> We could also generate a "new WU" for each new computation. If a WU is supposed to be send to several hosts we get a branch in the work stream / flow of the WU, which is joined again after results are collected. Not sure.. is this understandable? It would complicate debugging, though. How these WUs should be distributed: This is actually independend of what I suggest. Reliable hosts would still be fine with an initial replication of 1 and not much would change apart from the added flexibility for the user and improved overall balance (similar runtimes and cache settings regardless of GPU speed). If WUs are sent out to several hosts the same problem, which Zydor initially pointed out, appears in a different shape: if you pair slow and fast GPUs and both are successful, the result of the slower GPU won't be used as it contains less work. However, if the runtimes of the slow cards can be kept in check, it would be "less painful" to pair 2 slow cards instead of fast-slow. We'd probably still get less work done, but only over a limited, controlled time. It's easier to spread this evenly among all WUs and thus some speed could be traded for throughput in a controlled manner. Comments, foul eggs, flowers anyone? MrS Scanning for our furry friends since Jan 2002 |
ZydorSend message Joined: 8 Feb 09 Posts: 252 Credit: 1,309,451 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]()
|
You been doing that subliminal stuff again :) I dont have the competance re the internal workings of the WU to give a validity opinion on the overall principle you gave - seems sensible to me, I can see the benefit if the basic underlying predication re splitting by time steps is fundamentally practical. Whether its all predicated by GPU Class or Time Step, at some stage, as you pointed out, we inevitably get to the point of matching Users, which is where the fast-slow issues come in. I can see that time step is the much better of the two (GPU Class / Time Step) scenarios - given my caviat above - and has benefits beyond mere card matching. Having got to the stage of deciding which principle to follow - GPU Class or Time Steps - indeed any other principle that may come along), the ultimate gottcha will always raise its head re matching fast-slow. Albeit Time Steps look much better in that regard. The next bit may sound a little "sledgehammer to crack a nut" - and decidedly non-tech ...... Which ever principle is chosen, only show the Cruncher their WU result, not everyone's, in the WU result page. What we dont know wont hurt us. The Project will have gone through hoops and loops to be as fair and as accomodating as it can possibly be to the slower class card encouraging their participation. Its no biggie not to show the matched result. That way the Cruncher will not know how many times their result was dumped. Since the Project will have done its best to avoid cards being "useless" because they were not used in the final outturns, such an arrangement could be used with a clear guilt free mindset. I recognise there are benefits in seeing all participants on screen for a WU, however for the Cruncher, at the end of the day its pure asthetics, doesnt contribute one way or another to successful or otherwise WU crunching. Such a screen - or similar - may be needed by admins et al, but thats no issue. The Results Of The Zargon Jury? Flowers :) Regards Zy |
Paul D. BuckSend message Joined: 9 Jun 08 Posts: 1050 Credit: 37,321,185 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
My mind is slipping over the surface. But, I think the solution is simpler than you imagine ... The application does time steps, we know the amount of elapsed time. If we allow the user the discression, as at Rosetta, they can set a time limit on how long they want to run individual tasks. The task is downloaded... it is run until it has completed the number of iterations that will fill up the amount of time the participant selected. The task is ended at whatever arbitrary time-step the task is on when the clock expires. Task is returned and the next task is issued based on the amount of work done in unit time. The point here is that I can say run for 6 hours and at the end of that time I get a new task. Let us say that I completed 100 TS, well, the 9800GT in that same 6 hours would have only completed 30-32 TS ... Obviously this complicates the work generator, credit awarding, etc. It allows participants greater control over the work size, for those of us that do not like tasks that take more than about 6 hours would probably be happier... |
|
Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Paul, this is the point where I started thinking :) There's one important problem with this approach: WUs issued to several cards. Let's assume a slow card returns 10 steps after 4h, whereas a fast card might return 100 steps after 6h. How long are you going to wait? You could estimate the runtime from the users preference, but that's not very direct. I'm trying to make things easier to predict and more regular. Not sure how neccessary it is, but I really wouldn't want slow cards to "outrun" fast ones just because they set a shorter runtime (and the server decided to use their result for the next WU instead of waiting for the other one). Zydor, the first paragraph of Germanys constitution says "The dignity of man is untouchable". Earlier I didn't understand this:"why, it's being tread on all the time!" I think a couple of years ago I finally understood (or started to?). It's a normative clause.. and such a strong one, that it actually is untouchable. Gives me a shudder everytime I really think of it. And that's why we couldn't do what you suggest :) It may be of greater immediate benefit, maybe it could be "justified" in that the project did everything they can for the slow cards. Yet.. that's not enough. I think we owe the participants the honesty to show them what happens with their crunching efforts. Anything else is unthinkable ;) MrS Scanning for our furry friends since Jan 2002 |
|
Send message Joined: 21 Oct 08 Posts: 144 Credit: 2,973,555 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]()
|
this is the point where I started thinking :) Maybe I am missing something here, but wouldn't it be fairly straightforward to have the server issue the remaining work as a shorter workunit? That is, assume Paul's machine completes 100TS in 6 hours and is paired with a 9800GT that only completes 30TS in that time, with both returning results at about the same time. The server could then be made to reissue a follow-up workunit made up of the 70TS difference that could be sent to a third card (say another 9800GT, but with a 13 hour limit). If the third card was another 9800GT-6 hour limit, then once returned another reduced 40TS unit could be issued, and so on...in other words, the real new work would not be generated until the full set of TS in the original work was completed by additional cards. This could probably be made more efficient by always issuing the "A" result to a reliable host with the "B" and beyond work copy going to any host. |
Paul D. BuckSend message Joined: 9 Jun 08 Posts: 1050 Credit: 37,321,185 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Um, I knew I was not clear ... I ws thinking of GPU Grid following more in the line of MW where we have single issues and single returns. When you get into HR then you have to have identical returns or you cannot compare. So, I was thinking more in line with: ..|..|..|..|..|..|..|..|..|..|..|..|..|..|..|..|..|..|..|..|..|..|..|..| Where we have 3 hour "frames" if you will ... THe next frame is built on the return from the prior frame. Here, we start with the 9800 and get 30 TS, next a 260 returns 100 to TS130, a 8800 returns 10 to 140 ... and so on ... so, each client runs as best they can, and we accumulate the results on a more regular schedule, but the actual work accomplished becomes highly variable. The problem is that I do not know how much that biases the science and how it is being used. If they are looking at snapshots at certain specific TS then the returns are streamed and re-sliced for the science to be done. The advantage to a "flow" system such as this is that the totality of the schedule to hit certain points would become more predictable on average (I would expect) because of the random assignment of tasks would mean that slower cards would cause "bumps" in the timeing, but overall the odds say that the next machine is just as likely to be faster than the one currently running the task ... The only reason that I like shorter tasks is that my risk of loss goes down. I know credit does not matter, but the science does. If I am doing one hour tasks, I do put 6 times the load on the server to get tasks, but the output files are 1/6 the size and my risk of losing 5 hours science goes way down. Again, I am unusual in that I have MOSTLY higher end cards so my run times are about 6 hours per ... with the single exception with the 9800GT where the run time is 12-20 hours per more to the higher end ... Again, the Rosetta model comes to mind ... it lets ME chose how much time I want to spend on each task. How much risk *I* want to take with a task failure costing me and the project time and effort. I don't know if this is any clearer than what I said before ... and I am not sure we are converging on a concept yet ... sadly too much excitement with dying tasks causes me to wig out ... and I cannot concentrate well ... |
©2025 Universitat Pompeu Fabra