Message boards :
Graphics cards (GPUs) :
Redundant results
Message board moderation
| Author | Message |
|---|---|
|
Send message Joined: 7 Apr 09 Posts: 4 Credit: 1,121,005 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
How many people are receiving the same workunit and exactly what does the time-limit mean? I've had two units cancelled now after working on them for like 10 hours: 552067 and 542428. Both units would have finished well within the time limit! Is the cancellation an error or project policy? In other words; am I wasting my time here?
|
ZydorSend message Joined: 8 Feb 09 Posts: 252 Credit: 1,309,451 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]()
|
I have had one cancelled in the last twenty, but that one was not running. I also note another post on this two days ago that remains unanswered. WUs should not be cancelled pre-emptively if they are running. I also would be grateful for a response on this, cancellation of models already running is an abuse of donated free time & resources and should not happen. Its a good idea to cancel redundent WUs that have yet to run. I can understand how they can become redundent, and I applaud the existence of such a facility - its win/win all round for all concerned. However it is not acceptable to cancel those already running without pro rata credit for effort already expended, or at the very least quietly kill them off on upload. There is a high level of Trust involved in crunching what we allow automatically on machines as a free donated resource and free personal time & effort - to be reliable, safe and of value. Pre-emptive action in the way this appears to be implemented, is abuse of that Trust, and is a matter of important Principle. Regards Zy |
|
Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Mhh, I was not aware that WUs are also canceled if they already started. This is good for the project, but it's unacceptable for the participant if 0 credits are awarded for x hours of GPU time. Talknuser, could you provide a link to the WUs in question or (temporarly) unhide your computers? MrS Scanning for our furry friends since Jan 2002 |
Paul D. BuckSend message Joined: 9 Jun 08 Posts: 1050 Credit: 37,321,185 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I am *NOT* a member of the project, but, there are all kinds of technical reasons for the project to issue work and then cancel tasks that have not been started. The standard cancellation tool in the BOINC system will not cancel tasks that you have started to process so that you should get credit. THe point is, that for what ever the reason is, the project is in fact looking out for all of us by canceling work that is not needed. BOINC has some automated mechanisms but sometimes they don't have just the perfect control so that we would never see tasks downloaded and not needing running. In truth, they HAVE canceled streams of tasks before they were issued too ... so ... If you can't stand it then the alternative is to leave (sadly) because this is the nature of the beast here ... sometimes we get tasks that get "recalled" ... heck, I probably get more of them than just about anyone ... look in my account for computer w2 ... |
|
Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Paul, the problem here is that he says these canceled WUs had already started and quite some computation time went into them. For project speed it's still better to cancel them if they become redundant. But it's not fair to the user, so this should not happen, unless credits are payed via trickles or some similar system. MrS Scanning for our furry friends since Jan 2002 |
Paul D. BuckSend message Joined: 9 Jun 08 Posts: 1050 Credit: 37,321,185 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Paul, the problem here is that he says these canceled WUs had already started and quite some computation time went into them. That shows you how bad I am doing today ... I guess I shoiud quite while Iam behind ... |
|
Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Well, I also just did a stupid mistake (another thread) and refuse to recognize that it's actually time for me to go into bed since a half hour. See you tomorrow ;) MrS Scanning for our furry friends since Jan 2002 |
|
Send message Joined: 7 Apr 09 Posts: 4 Credit: 1,121,005 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Unhidden - only have 2 machines here so far :) The Work unit ID shows that no work is done, but that is not the case! Although I was not there to actually watch the cancellation, this smallish rig was well underway which both results at the time I left it... Then, suddenly they were cancelled/redundant. Not the end of the world, but certainly a waste of time, and definitely a problem to smaller machines if this is a general issue...
|
|
Send message Joined: 4 Apr 09 Posts: 450 Credit: 539,316,349 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I don't think this is a general issue, I keep pretty close track of my WUs as I am sure many other people do also. I hestitate to say this but ... is it possible you made a mistake when you looked at the tasks? You did have one error and one complete. Can your card actually crunch two Wus at the same time... the task list will say "In Process" as soon as it gets sent to you, it does not mean the are actively being crunched all at the same time. Currently you have two tasks that say "Im Process" but if you check BOINC Manager I think you will see 1 that is processing and another that is waiting to run. Steve |
|
Send message Joined: 7 Apr 09 Posts: 4 Credit: 1,121,005 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
@Snow Crash Like you I like to keep tabs on my units - at least when I start a project. When I'm sure things work I don't care ;) And there's no chance this could be a mistake. The first one was an error for some reason - probably because I was still setting up the linux box at the time. The second one got cancelled by the server with 10 hours or more completed. Same thing happened to #3. Number 4 was actually allowed to finish and upload in time :) Let's see what happens to #5 and #6 ;) Anyway, this is really not worth wasting time on as no one from the project seems to bother. I only reported it because cycles were being wasted, and because I was not the first one to have this problem... Have fun out there :)
|
|
Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I assume the host you're talking about has to be this one. Let's try to dissect what's happening: (I'm assuming your machine runs 24/7 and that linux BOINC doesn't suspend running GPU-Grid tasks.) - 1st WU recieved should have started 1st. It supposedly ran for 22.5h, until it was canceled at 9:48 on the 19th. - 2nd WU sent to you ran supposedly for 10h after the 1st one was canceled. It stopped with an error and at that time lists 9h of CPU time. Does GPU-Grid occupy an entire core of your linux machine? If it does, the above looks probable. If not, say it's "only" 30 - 50% of one core, the situation looks different: in this case the 2nd WU could not have accumulated that much cpu time within 10h and thus must have been running before the 1st WU was canceled. Which would likely mean that the 1st WU did not yet start when it was canceled. - the next 2 WUs were sent at the same time, so we can't say which one started first. Let's call them S for success and F for fail. - S may have run from 8:54 on the 20th until 11:16 on the 22nd. That's 50:22h of wall clock time. It registers a run time of 47:05h. So under perfect conditions (i.e. WU runtime = wall clock time) there was a maximum possible runtime interval of 3:17h for "F". - F could in principle have run before or after S. However, after S is impossible becuase S finished on the 22nd, whereas F was canceled on the 21st. If we assume F ran before S there's another problem: it would have started on the 20th and was aborted on the 21th. So it would have run for far more than the maximum of 3:17h, which it is allowed due to the minimum runtime of S. -> if the linux BOINC doesn't suspend running GPU-Grid tasks it is clear that F could not have been started when it was aborted. Anyway, this is really not worth wasting time on as no one from the project seems to bother. I only reported it because cycles were being wasted, and because I was not the first one to have this problem... Your report is greatly appreciated! The project can't fix what they don't know. And the project staff is quite busy, so they usually only reply if they have something worthwhile to say. No reply doesn't mean they're not watching :) MrS Scanning for our furry friends since Jan 2002 |
|
Send message Joined: 7 Apr 09 Posts: 4 Credit: 1,121,005 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
@ ETA Thanks for your in debth breakdown, which made me think :) A couple of comments/observations: * The unit with the error actually ran first, as #1 got stuck in the download queue. * I've been monitoring the box closely today, and Boinc, for some reason, seems to allocate only 0.05 CPU to GPUGRID, meaning that this particular box (running 24/7) in practice runs GPUGRID for only 40% of the time as opposed to the expected 100% :( With the above in mind, unit #2 (the erroneuos one that ran first) actually would not have stopped running until after about 22.5 hours (as opposed to 9 hours), which is AFTER the two units in question were cancelled. Meaning that neither of the cancelled units would have had time to start! So, provided the above observations/assumptions are correct for the whole period, nothing was in fact wasted - except your time and mine ;) Sorry to miss that this box was not running full tilt guys :) Next step is to find out why, but I won't bother you with that :D
|
ZydorSend message Joined: 8 Feb 09 Posts: 252 Credit: 1,309,451 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]()
|
The level of cpu use at 0.05% is a good thing. It indicates a low level of cpu involvement in the gpu application. In gpu crunching the cpu is there to load up the gpu with initial data set (hence the pause when a gpu wu first starts, the data is being loaded by the cpu into the gpu), and also passes "what to do next" instructions to the gpu. The gpu - in crude terms - is inherently stupid compared to the cpu, as it does not have integral instruction sets, its a pure number cruncher, and relies on the cpu to drive it and tell it what to do next. The lower the cpu number the better, as is it indicates a more efficient gpu app. The latter then frees the cpu to get on with other things, such as more time to crunch a cpu based application - or let you get on with the latest powerpoint presentation, etc etc, with minimal to zero lag/disruption. Many BOINC gpu projects have much higher cpu assist percentages, the low number is a pat on the back to the gpu app devs, not a figure of concern or worry. In SETI for example, their CUDA wu (non optimised) runs at 0.15% cpu assist, using an optimised SETI app it will run at 0.04%. The slightly higher figure of 0.05% in GPUGRID is an indicator of the complexity of the model being run compared to SETI's. Regards Zy |
Michael GoetzSend message Joined: 2 Mar 09 Posts: 124 Credit: 124,873,744 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
The lower the cpu number the better, as is it indicates a more efficient gpu app. I never really thought about it much, but I suspect that the amount of CPU used will have a lot to do with the speed of the CPU vs. the speed of the GPU, not just the efficiency of the software. Put a monster video card in a computer and you'll see a much higher CPU usage than you would with a mediocre video card. The CPU has to work that much harder to keep the GPU running. Same effect with using a slow CPU vs. a fast CPU. Put an 8000 class video card in an i7 machine and I'm sure you'll see *very* low CPU utilization! Want to find one of the largest known primes? Try PrimeGrid. Or help cure disease at WCG.
|
ZydorSend message Joined: 8 Feb 09 Posts: 252 Credit: 1,309,451 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]()
|
Valid point, and does, to a degree, have just that effect. Inevitably there is a "floor" below which the cpu assist number will not go, it will never be zero as the gpu has no inherent Instruction Set "Intelligence". As gpu applications become more refined, they will perform faster, as the gpu app is tweeked to both perform the maths in a more efficient way, and ask for less help from the cpu. GPU crunching is still in its infancy, there is a lot of "sledgehammer to crack a nut" going on inside the beast, and there is a huge latent power lurking in there yet to be fully tapped. The MW WU explosion was due to a model written especially for the gpu, not just an "adapted" cpu model. Other factors were clearly involved, double precision/single precision yaddie yadda that lead to short term fanboyism re ATI/NVidia cards. In truth in the long term it will even out in performance/card vendor terms as gpu apps become more refined and specially written for a gpu. The low cpu involvement is why the lower power cpu based machines can still produce cracking results with a gpu app, the gpu is doing all the work. In those cases there are hardware issues such as can the "older" cpu run the card on its motherboard in terms of data throughput (x16 x8 channel PCI etc etc). Regards Zy |
Michael GoetzSend message Joined: 2 Mar 09 Posts: 124 Credit: 124,873,744 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
... as the gpu has no inherent Instruction Set "Intelligence". Are you SURE about that? I'm pretty sure the GPUs are actually full blown computer cores. (Probably not x86-ish type CPUs; I think they're custom RISC processors.) Granted, it's been about six months since I read through the documentation that comes with the CUDA SDK, but my impression was that the multi-processors on the Nvidia cards are complete CPU's in and of themselves. Yes, there are vector processors (aka "shaders") on the cards. But each group of 8 shaders is attached to one of these multiprocessors which have full instruction sets. A GTX280 or 285, for example, has 30 multiprocessors -- essentially it's a (somewhat slow) 30-core CPU. What makes it so powerful is that the arithmetic unit on each of those cores is a vector processing unit that can do 8 calculations in parallel. Not to mention that there are 30 of those cores, which, in aggregate, have a total of 240 shaders. It's possible that I'm misremembering what I read, or perhaps I misunderstood it, but my impression was that the CUDA processors could handle arbitrarily complicated programs all by themselves. The only shortcomings would be if you needed more memory than was available on the video card, or if I/O was required. Then you needed some coordination with the CPU. The tricky part (and this applies to any vector processing system, not just CUDA), is writing the code in such a way that the parallelism is exploited to its fullest. That is a quite complex topic. (For example, if you have a branch instruction (an IF statement in a high level language) on a vector processor, what happens if the 8 different shaders/vector-processors don't yield the same branch result?) Mike |
|
Send message Joined: 1 Feb 09 Posts: 139 Credit: 575,023 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
I don't agree with you that GPU's are full blown processors they are made to do some tasks but do them as fast a possible, and are not nearly as complex as a CPU. Maybe in time we will see this change because they can be made some kind of intelligent but for now basically are raw data monsters. They calculate some instructions and indeed because they are split up in 240 smaller ones do it lightning fast. Still a CPU will tell it what todo and feeds it with a packet which it can work on and then go back to other work till it gets a signal from the GPU that it did the job. So in every way the GPU is just a simple co-processor which can calculate fast. PS Look at the mythbuster example about GPU the cpu is made to let the robot move in a circle and shoot some paint pellets and then move to the middle to make the eyes and mouth, but the gpu simply shoots many colors at once making it look like it did more. But you can't compare them at all because the cpu have to make very complex moves and extras to come to a result while the gpu cannon just had to shoot all the pellets at once. So in itself the GPU did more work yes but with very very simple instructions |
Michael GoetzSend message Joined: 2 Mar 09 Posts: 124 Credit: 124,873,744 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
EDIT: post greatly shortened; I'm not going to argue about this. Read for yourself: Here's the CUDA SDK documentation: http://www.nvidia.com/object/cuda_develop.html. In particular, you might want to take a look at this document. Mike |
|
Send message Joined: 19 Apr 09 Posts: 1 Credit: 1,053,798 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]()
|
569200 404560 23 Apr 2009 7:23:33 UTC 24 Apr 2009 8:24:51 UTC Over Redundant result Cancelled by server 0.00 --- --- 569139 404534 23 Apr 2009 7:24:14 UTC 24 Apr 2009 18:25:03 UTC Over Redundant result Cancelled by server 0.00 --- --- 554332 397382 20 Apr 2009 18:53:34 UTC 21 Apr 2009 19:21:46 UTC Over Client error Aborted by user 0.00 0.00 --- 549590 395293 19 Apr 2009 20:44:48 UTC 21 Apr 2009 18:50:53 UTC Over Redundant result Cancelled by server 0.00 --- --- 549584 395292 19 Apr 2009 20:44:48 UTC 21 Apr 2009 18:49:42 UTC Over Redundant result Cancelled by server 0.00 --- --- 549466 395240 19 Apr 2009 20:45:23 UTC 20 Apr 2009 17:12:28 UTC Over Redundant result Cancelled by server 0.00 --- --- A very strange Boinc project. Working Working working without any credit I give computer time not only for the credits (some I'm used to crunch for don't give much per hour) but this project has really the world record !!! 0 credits per hour At least is it a usefull project ??? I was happy that my GPU could help boinc projects but I'm going to leave this project without any regret if there's no way for me to get at least one credit... Is it normal that when you finish your WU not the first(but before max time of course) you get not even one credit ??? Please at least one so that I get more than 0 credit after hours of hard work ;-) I'm new on this project. Maybe it's a temporary bug. Any help ?
|
Paul D. BuckSend message Joined: 9 Jun 08 Posts: 1050 Credit: 37,321,185 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Note the zero compute time ... you lost nothing. |
©2025 Universitat Pompeu Fabra