Cancelled by Server - Suggestion

Author	Message
ExtraTerrestrial Apes Volunteer moderator Volunteer tester Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level Scientific publications	Message 9902 - Posted: 17 May 2009, 11:47:16 UTC Last modified: 17 May 2009, 11:48:00 UTC Paul, for me you've been clear enough. If you go for a stream with one host per WU your system is fine and appreciably simpler than what I wrote down. However, what are you going to do if you send a WU to a host whih has been notoriously unreliable? Are you going to wait for the error or do you want to send it out to a second host immediately? Could you provide this capability and come up with something simpler than me? I don't think it would be beneficial for the project to give up the ability to issue WUs multiple times. Scott, well.. yes. I think that could work. I could argue that calculating the WU runtime the way you suggest neglects the individual work cache setting and thus you can not reliably estimate the time a WU will be returned (to adjust the time for the following hosts accordingly). However, I have to admit that this limitation applies to my suggestion in the same way. Therefore our approaches are similar regarding their result: - both should get WUs returned around the same time - both suffer from an intrinsic inaccuracy due to the cache - in both cases you'd want to wait until all results are in to issue the next work.. or make an estimate based on card speed, which result will likely contain the most steps, or just set some tolerance time I can't help it.. today it's got a slightly sour taste (for me). MrS BTW: thanks for the flowers ;) Scanning for our furry friends since Jan 2002 ID: 9902 · Rating: 0 · rate: / Reply Quote

Paul D. Buck Send message Joined: 9 Jun 08 Posts: 1050 Credit: 37,321,185 RAC: 0 Level Scientific publications	Message 9913 - Posted: 17 May 2009, 15:42:14 UTC - in response to Message 9902. for me you've been clear enough. If you go for a stream with one host per WU your system is fine and appreciably simpler than what I wrote down. However, what are you going to do if you send a WU to a host whih has been notoriously unreliable? Are you going to wait for the error or do you want to send it out to a second host immediately? Could you provide this capability and come up with something simpler than me? I don't think it would be beneficial for the project to give up the ability to issue WUs multiple times. And now you see why systems engineers get the big buck ... Here it the real rub in all this, how do you handle situations in the face of unreliability. ONe simple answer is that you send it our redundantly. And match the answers up. The problem here now is that we no longer have tasks of one and only one size. So, now we cannot match them up to do HR. One answer is that we keep more control when issuing to unreliable hosts. And, pair them up with reliable hosts but now we also have to self limit the task on the matching reliable host. So, we know that the unreliable host is going to do 30 TS in the one hour period and so when we send it to the reliable host we would know to limit to that number of TS. Which is why I was asking back in the beginning... are we a solution in search of a problem? Only the project types can tell us that. As I said, the 6 hour tasks are about at my edge of comfort as far as running tasks. I have seen too many hours of work lost for one reason or another and I hate waste. Which would also be why I would question the current length of tasks... In particular because it also self limits the participant pool. Now, if the work is getting done fast enough with the limited pool we currently have, then, cool and adding more low end machines won't buy much (well some goodwill, not to be sneered at ...) ... ID: 9913 · Rating: 0 · rate: / Reply Quote

Scott Brown Send message Joined: 21 Oct 08 Posts: 144 Credit: 2,973,555 RAC: 0 Level Scientific publications	Message 9927 - Posted: 17 May 2009, 20:36:41 UTC - in response to Message 9913. Which is why I was asking back in the beginning... are we a solution in search of a problem? Only the project types can tell us that. Agreed...to a point. I think that many of the posters in the thread have been around BOINC for several years, so we have some fairly good ideas about what those problems are...especially you Paul. On that side of things, we are able to offer real solutions. But you are very much right that without more detailed understanding of the construction of workunits at this project, we can at best offer potential solutions that, if we are lucky, might hit upon a real solution... ID: 9927 · Rating: 0 · rate: / Reply Quote

Paul D. Buck Send message Joined: 9 Jun 08 Posts: 1050 Credit: 37,321,185 RAC: 0 Level Scientific publications	Message 9930 - Posted: 17 May 2009, 21:23:37 UTC - in response to Message 9927. Which is why I was asking back in the beginning... are we a solution in search of a problem? Only the project types can tell us that. Agreed...to a point. I think that many of the posters in the thread have been around BOINC for several years, so we have some fairly good ideas about what those problems are...especially you Paul. On that side of things, we are able to offer real solutions. But you are very much right that without more detailed understanding of the construction of workunits at this project, we can at best offer potential solutions that, if we are lucky, might hit upon a real solution... Thank you for the kind words... :) The difficulty, and it is the only one, is that how can we best help the project ... now there are some good concepts here, but, until GDF or someone else from the project says that this or that initial concept might help the project we are about as far along as we can get ... So, I do concur that we need that input if nothing else to tell us that there is no need ... :) Though I must admit I would really like to see shorter tasks so that a 9800GT can get them done in less than a full day ... ID: 9930 · Rating: 0 · rate: / Reply Quote

popandbob Send message Joined: 18 Jul 07 Posts: 67 Credit: 43,351,724 RAC: 0 Level Scientific publications	Message 9936 - Posted: 18 May 2009, 7:47:48 UTC - in response to Message 9930. Though I must admit I would really like to see shorter tasks so that a 9800GT can get them done in less than a full day ... If they did shorten the tasks then it would be a win-win situation. It would either A) Allow slower GPU's to run GPUgrid or B) Have shorter deadlines and faster turn around so work units can be sent out faster if a host doesn't reply reducing the need for replication. The downside is more server load... This is a tricky topic. The project however needs fast turn around due to the nature of the work so allowing slower cards to run gpugrid would slow things down at times of low work but at times of high work then the more the better. Perhaps the best solution could be setting up 2 projects under 1 roof (ie. seti multi-beam and astropulse) and have one for slow gpus, one for fast. Default settings would have the slow GPU selected and users could select fast GPU for more credits and longer tasks. Bob ID: 9936 · Rating: 0 · rate: / Reply Quote

Zydor Send message Joined: 8 Feb 09 Posts: 252 Credit: 1,309,451 RAC: 0 Level Scientific publications	Message 9949 - Posted: 18 May 2009, 19:54:15 UTC - in response to Message 9936. Last modified: 18 May 2009, 19:54:35 UTC Perhaps the best solution could be setting up 2 projects under 1 roof (ie. seti multi-beam and astropulse) and have one for slow gpus, one for fast. Default settings would have the slow GPU selected and users could select fast GPU for more credits and longer tasks. I like the dual WU idea if the Project dev resource can cope. We would need to be careful re credits, they need to be awarded on the "standard" BOINC formulae basis as being an equal number of credits per Flop, whatever X multiple is used in the BOINC formulae. The faster cards will get more over a given time period, which is fine as they do more work. The base calculation rate should however be the same for each Flop donated, else we'll have a two tier credit war on our hands. Regards Zy ID: 9949 · Rating: 0 · rate: / Reply Quote

Scott Brown Send message Joined: 21 Oct 08 Posts: 144 Credit: 2,973,555 RAC: 0 Level Scientific publications	Message 9950 - Posted: 18 May 2009, 19:58:36 UTC - in response to Message 9936. Perhaps the best solution could be setting up 2 projects under 1 roof (ie. seti multi-beam and astropulse) and have one for slow gpus, one for fast. Default settings would have the slow GPU selected and users could select fast GPU for more credits and longer tasks. A more efficient approach would be to follow the PrimeGrid model where shorter and longer types of work can be selected within the same project...However, GDF has said elsewhere in the forum (sorry can't find the link just now) that it is not possible to divide up the work this way. ID: 9950 · Rating: 0 · rate: / Reply Quote

popandbob Send message Joined: 18 Jul 07 Posts: 67 Credit: 43,351,724 RAC: 0 Level Scientific publications	Message 9959 - Posted: 19 May 2009, 0:04:41 UTC - in response to Message 9950. A more efficient approach would be to follow the PrimeGrid model where shorter and longer types of work can be selected within the same project...However, GDF has said elsewhere in the forum (sorry can't find the link just now) that it is not possible to divide up the work this way. That's exactly what I was referring to... different sub-projects. How can it not be separated like that? There are several different types of work going on right now so why could it not be set so that (for example) 79-KASHIF_HIVPR is run on slower cards with longer deadlines and p780000-RAUL on the faster cards with shorter deadlines. As an added bonus this could help with issues such as the current problems with G90 GPU's... We would need to be careful re credits, they need to be awarded on the "standard" BOINC formulae basis as being an equal number of credits per Flop, whatever X multiple is used in the BOINC formulae I was thinking more of a short deadline bonus for granted credit. The idea behind this is that people would need encouragement to select the longer wu's with shorter deadlines. If we had gtx295's running work units meant for slow cards we are right back to square 1. Bob ID: 9959 · Rating: 0 · rate: / Reply Quote

Scott Brown Send message Joined: 21 Oct 08 Posts: 144 Credit: 2,973,555 RAC: 0 Level Scientific publications	Message 9966 - Posted: 19 May 2009, 13:17:28 UTC - in response to Message 9959. That's exactly what I was referring to... different sub-projects. How can it not be separated like that? There are several different types of work going on right now so why could it not be set so that (for example) 79-KASHIF_HIVPR is run on slower cards with longer deadlines and p780000-RAUL on the faster cards with shorter deadlines. As an added bonus this could help with issues such as the current problems with G90 GPU's... Many projects have setup different projects (i.e., different stats, website, servers, etc.) for different sub-projects such as Beta versions or, in the case of Milkway@home, separate cpu and gpu projects--your original post seemed to sound more like this...sorry if I misread it. Yes, there are numerous different types of workunits, and in my few months here, I have seen no less than a dozen workunit types (maybe more). I think that this fairly rapid change in workunit types is what prevents the different subproject setup. ID: 9966 · Rating: 0 · rate: / Reply Quote

ExtraTerrestrial Apes Volunteer moderator Volunteer tester Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level Scientific publications	Message 10026 - Posted: 21 May 2009, 11:09:53 UTC - in response to Message 9913. Last modified: 21 May 2009, 11:10:23 UTC Paul wrote: One answer is that we keep more control when issuing to unreliable hosts. And, pair them up with reliable hosts but now we also have to self limit the task on the matching reliable host. So, we know that the unreliable host is going to do 30 TS in the one hour period and so when we send it to the reliable host we would know to limit to that number of TS. You can not know exactly how many TS a card will complete within a given time. Unless you set the number of TS to begin with.. which is the current scheme. The problem is that you can not know what the user is doing with the PC: how much crunching time is lost to gaming, watching videos, moving windows in Aero or "don't use GPUs while PC is in use"? That's why I opted for "let's level the playing field and give them the same desired runtime and send the WUs out at the same time". Not sure if there's anything more we could do to keep runtimes under control. Except for the suggestion made by Scott (which would be a minor correction, if WUs can be send out at the same time.. otherwise a huge correction). Which is why I was asking back in the beginning... are we a solution in search of a problem? Only the project types can tell us that. Yeah, I agree. I think we reached the point where we brainstormed enough ideas. So what we'd need to continue is some feedback: - what's possible? - what's neccessary? - what's desired? I guess project staff has their hands busy with debugging the recent problems.. but GDF already said they're watching this thread carefully. So let's give our tortured brains a brake ;) Scott wrote: A more efficient approach would be to follow the PrimeGrid model where shorter and longer types of work can be selected within the same project...However, GDF has said elsewhere in the forum (sorry can't find the link just now) that it is not possible to divide up the work this way. That sounds straight forward to set up: offer WUs of normal length and maybe 1/2 and 2 times that length. Give the user a preference in the account setting. Adjust these times as needed due to the emergence of faster cards and to balance server load. The main problem which I see here is that it would split up the pool of WUs. If too many WUs are generated for one of the runtimes these will be lagging behind, whereas other runtimes may even run dry. If this idea was adapted it would require changes in the server software to dynamically adjust the runtime / number of steps within one "set of simulations", when ever new WUs are created. Otherwise load balancing would suck. MrS Scanning for our furry friends since Jan 2002 ID: 10026 · Rating: 0 · rate: / Reply Quote

popandbob Send message Joined: 18 Jul 07 Posts: 67 Credit: 43,351,724 RAC: 0 Level Scientific publications	Message 10044 - Posted: 21 May 2009, 19:17:54 UTC - in response to Message 10026. The main problem which I see here is that it would split up the pool of WUs. If too many WUs are generated for one of the runtimes these will be lagging behind, whereas other runtimes may even run dry. If this idea was adapted it would require changes in the server software to dynamically adjust the runtime / number of steps within one "set of simulations", when ever new WUs are created. Otherwise load balancing would suck. MrS I wouldn't think that much work is needed. If I understand the current system correctly (each simulation has a task created that is run for x steps then once returned another task is created to continue that simulation for another x steps and several simulations are run in parallel) Then each simulation could be given to a specific length. Sure the simulations would finish at different rates but does that matter? As for run times running dry well that happens already without this so what would be the difference? I'm sure there is plenty of work to go around so more work could be added where needed. Bob ID: 10044 · Rating: 0 · rate: / Reply Quote

ExtraTerrestrial Apes Volunteer moderator Volunteer tester Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level Scientific publications	Message 10045 - Posted: 21 May 2009, 19:52:43 UTC - in response to Message 10044. GDF said they use the concept of reliable hosts and a higher initial replication to speed up WUs (or better, simulations) which are lagging behind the others. Imagine they run a parameter study and many simulations in parallel with one parameter being different. In such cases you'd want them all back before you start your analysis on what happened.. which may be needed to decide on what the next set of simulations should do. I have the impression it's important to have load balancing available so most simulations finish approximately at the same time (give or take a few days, of course). As for run times running dry well that happens already without this so what would be the difference? I'm sure there is plenty of work to go around so more work could be added where needed. Imagine the following: most users chose the long tasks. The project issues most simulations for this "project" or however you want to call it. No, due to some reason, users switch to the standard runtime. There wouldn't be enough work issued over there and noone would finish the simulations associated to the long-WU-crew.. which may be needed to generate new work. Sounds unlikely you may say. Well.. yes. But if I were a developer I'd want to be sure I could handle load balancing between projects / runtimes. anything else is asking for trouble at some point. MrS Scanning for our furry friends since Jan 2002 ID: 10045 · Rating: 0 · rate: / Reply Quote

Scott Brown Send message Joined: 21 Oct 08 Posts: 144 Credit: 2,973,555 RAC: 0 Level Scientific publications	Message 10051 - Posted: 21 May 2009, 21:01:57 UTC - in response to Message 10045. Regarding load balancing, there is already an option for the user to check to accept other types of work if the preferred work is not available. Make this an always on option (or perhaps switched on automatically for reliable hosts), and that might solve the load balance issue. ID: 10051 · Rating: 0 · rate: / Reply Quote

ExtraTerrestrial Apes Volunteer moderator Volunteer tester Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level Scientific publications	Message 10053 - Posted: 21 May 2009, 21:48:06 UTC - in response to Message 10051. Currently this option is not used at all.. or not that I know. But it could be used. MrS Scanning for our furry friends since Jan 2002 ID: 10053 · Rating: 0 · rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 10277 - Posted: 28 May 2009, 22:04:30 UTC - in response to Message 10053. I think the whole concept of task competitiveness is moronic - especially coming from people with scientific backgrounds. It’s just too wasteful. The amount of lost work must be huge for this project. I’m sure there are thousands of people that have added this project in Boinc and just gave up after repeatedly getting no credit for work units. Is there some sensible reason, I am not aware of, to turn people away from the project? To be perfectly honest, as soon as there is another descent project out there that can utilise CUDA, I will be with it. I will only be happy if I know my computer is doing worthwhile work – its me paying the electric bill, and its not cheap to run 150W graphics cards! ID: 10277 · Rating: 0 · rate: / Reply Quote

Zydor Send message Joined: 8 Feb 09 Posts: 252 Credit: 1,309,451 RAC: 0 Level Scientific publications	Message 10278 - Posted: 29 May 2009, 0:07:33 UTC - in response to Message 10277. I’m sure there are thousands of people that have added this project in Boinc and just gave up after repeatedly getting no credit for work units. Is there some sensible reason, I am not aware of, to turn people away from the project? Under what circumstances is nil credit given when work has been completed? If someone is processing a WU that another completes while they are in mid crunch, they still get credits, hence I am a bit confused by the statement - can you expand a litle on what you mean by that? Regards Zy ID: 10278 · Rating: 0 · rate: / Reply Quote

Paul D. Buck Send message Joined: 9 Jun 08 Posts: 1050 Credit: 37,321,185 RAC: 0 Level Scientific publications	Message 10282 - Posted: 29 May 2009, 4:41:18 UTC Tasks canceled are those that have not been started. So, no work is lost. If you start it, you can still get full credit for your work. But, if someone else has a task and they do not need the one you have, but have not yet started, it can be canceled. The available projects that use the Nvida GPU now include: SaH, SaH Beta, The Lattice Project, Ramsey, Aqua, and soon we hope MilkyWay... I know that SaH, SaH Beta, Ramsey and Aqua are issuing GPU work at this time ... ID: 10282 · Rating: 0 · rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 10306 - Posted: 29 May 2009, 17:06:39 UTC - in response to Message 10278. Last modified: 29 May 2009, 17:07:42 UTC I have had a number of tasks cancled/stopped/deleted mid run. The system did this not me. I dont know why but I do know that some were even scrapped after reaching about 85% completion. Im not worried about bandwidth here, just processing for 2 days only to have the server scrap the calculations. Especially when the task is over 80% and the deadline is still days away; no issue of finishing in time. ID: 10306 · Rating: 0 · rate: / Reply Quote

Zydor Send message Joined: 8 Feb 09 Posts: 252 Credit: 1,309,451 RAC: 0 Level Scientific publications	Message 10314 - Posted: 30 May 2009, 3:34:51 UTC - in response to Message 10306. Can you post a link to/identify which ones had that happen? Regards Zy ID: 10314 · Rating: 0 · rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 10327 - Posted: 30 May 2009, 13:28:33 UTC - in response to Message 10314. Here is the message: 703993 481150 22 May 2009 13:06:15 UTC 25 May 2009 16:47:19 UTC Over Client error Compute error (cpu time)5,460.94 (claimed credit)4,531.91 (Granted Credit)--- None. Here are the links to this: http://www.gpugrid.net/result.php?resultid=703993 http://www.gpugrid.net/workunit.php?wuid=481150 The Link Details: stderr out <core_client_version>6.6.20</core_client_version> <![CDATA[ <message> Incorrect function. (0x1) - exit code 1 (0x1) </message> <stderr_txt> # Using CUDA device 0 # Device 0: "GeForce 8600 GT" # Clock rate: 1300000 kilohertz # Total amount of global memory: 536870912 bytes # Number of multiprocessors: 4 # Number of cores: 32 # Amber: readparm : Reading parm file parameters # PARM file in AMBER 7 format # Encounter 10-12 H-bond term WARNING: parameters.cu, line 568: Found zero 10-12 H-bond term. WARNING: parameters.cu, line 568: Found zero 10-12 H-bond term. MDIO ERROR: cannot open file "restart.coor" # Using CUDA device 0 # Device 0: "GeForce 8600 GT" # Clock rate: 1300000 kilohertz # Total amount of global memory: 536870912 bytes # Number of multiprocessors: 4 # Number of cores: 32 # Amber: readparm : Reading parm file parameters # PARM file in AMBER 7 format # Encounter 10-12 H-bond term WARNING: parameters.cu, line 568: Found zero 10-12 H-bond term. WARNING: parameters.cu, line 568: Found zero 10-12 H-bond term. # Using CUDA device 0 # Device 0: "GeForce 8600 GT" # Clock rate: 1300000 kilohertz # Total amount of global memory: 536870912 bytes # Number of multiprocessors: 4 # Number of cores: 32 # Amber: readparm : Reading parm file parameters # PARM file in AMBER 7 format # Encounter 10-12 H-bond term WARNING: parameters.cu, line 568: Found zero 10-12 H-bond term. WARNING: parameters.cu, line 568: Found zero 10-12 H-bond term. </stderr_txt> ]]> My attempt at a solution: I assumed that the error has nothing to do with the projects/Task/Workunit (as I can do nothing about that, other than post messages). so, I changed the card to an 8800GT, upgraded Boinc, the Nvidia CUDA code and Video drivers. I also popped in a Phenom II 940 (better instruction set). The system is now a 5.9 everything, so hopefully this problem will not happen again. ID: 10327 · Rating: 0 · rate: / Reply Quote