Message boards : Server and website : The problem with the size of sent jobs.
Author | Message |
---|---|
Good day! | |
ID: 44363 | Rating: 0 | rate: / Reply Quote | |
Look at server status to what type of workunits they have active. | |
ID: 44364 | Rating: 0 | rate: / Reply Quote | |
Thank you. Perhaps with my hardware it makes no sense to consider all GPUGRID too much fuss is obtained. | |
ID: 44365 | Rating: 0 | rate: / Reply Quote | |
If the GPU can finish the non GIANNI tasks in the 5 day time period, I think it is worth running GPUGRID still and just keeping an eye on it, if you are inclined to keep an eye on it anyway, and make sure it is running a working ADRIA or a GERRARD. | |
ID: 44367 | Rating: 0 | rate: / Reply Quote | |
Thank you. Perhaps with my hardware it makes no sense to consider all GPUGRID too much fuss is obtained. Not necessary. Recently my GTX 660 finished 2 long GIANNI tasks in 2.3 days. More than enough time to return task before deadline. https://www.gpugrid.net/result.php?resultid=15244862 https://www.gpugrid.net/result.php?resultid=15258440 | |
ID: 44376 | Rating: 0 | rate: / Reply Quote | |
Thank you. Perhaps with my hardware it makes no sense to consider all GPUGRID too much fuss is obtained. Now if these super long WUs were put in a separate queue where the GTX980/TI users could opt in they'd be done much more reliably and efficiently. Then those with normal GPUs could also more efficiently run the normal long WUs without having to try to avoid the crazy super long ones. Very easy to implement. | |
ID: 44377 | Rating: 0 | rate: / Reply Quote | |
I agree that it would be nice to have additional function, to get only the recommended jobs for your hardware. | |
ID: 44386 | Rating: 0 | rate: / Reply Quote | |
Thank you. Perhaps with my hardware it makes no sense to consider all GPUGRID too much fuss is obtained. Or they could just change the description of "Long Runs" to "8-27 hours on fastest cards" and the "Short Runs" to "3-9 hours on fastest cards". Either way, new classifications, new categories, none of them are in danger of the 5 day timeout except on the oldest cards that according to the "Stats" page are already phased out and yet can still complete a task, so they continue to run them. Remember that 5 days is the threshold for a task, not a bonus period of credit. If you can return a task in 5 days including upload, download, actual running, paused periods, etc, then you don't have any problems, you just have a grudge about the credit. And since the longer running units do offer more credit (though maybe not proportional completely in all timed runs[the longer it is the less credit there is per runtime hour] which is always true anyway) anyway, so a long run or a longer run, it doesn't matter. If you can't finish and return a WU within 5 days of real time, then it is time to consider a different project. Until/unless then, its not a problem, let them run. | |
ID: 44388 | Rating: 0 | rate: / Reply Quote | |
Also something to consider is that the "Time estimated to complete" may run much faster than realtime at some points when taking averages of different WUs and learning how fast your system can do WUs. Like sometimes it will tell me after doing several GIANNI WUs that the next GERRARD will take 1 day 6 hours, but the time counts at 2-3 seconds per second and completes in 8-12 hours. Then if it does a bunch of the GIANNIs and ADRIAs in a row and gets a GIANNI, it will say it has 15 hours to go and it takes a day and 6 hours moving at 1 second ever 2-3 or real time. The estimated time is not an exact thing, but an estimation based on the other tasks you have been doing and the expected time of the task. | |
ID: 44389 | Rating: 0 | rate: / Reply Quote | |
Remember that 5 days is the threshold for a task, not a bonus period of credit. If you can return a task in 5 days including upload, download, actual running, paused periods, etc, then you don't have any problems, you just have a grudge about the credit. This would be true if they fix the app in the next version. Right now there's better than a 50/50 chance that a WU will fail if a power glitch occurs. A 5 day WU is 5 times more likely for this to happen than a 1 day WU. In addition look at the normal failure rate of the super long WUs right now. It's almost 50%. It doesn't make anyone very happy to lose 5 days worth of crunching due to a bad WU or any other reason for that matter. | |
ID: 44400 | Rating: 0 | rate: / Reply Quote | |
The GIANNI and ADRIA tasks aside, my point is valid. As Zoltan points out, when a new task batch is put out, the first few returned are usually failures, not based on the task, but based on the hosts that get them, fail them, and return them as failed at a much faster and higher rate than the normal host that works can return them. So that does indeed make the % listings on the Performance page misleading to a point and moreso if the task batch is newer. And I suspect that the ADRIA and GIANNI failures have been making the other tasks fail by affecting them inside the BOINC software, inside the OS, or because of a reboot/crash of the system. That perhaps if the GIANNI and ARDIA tasks had not been failing as they do, the GERRARD tasks would have a slightly better success ratio right now. | |
ID: 44426 | Rating: 0 | rate: / Reply Quote | |
Right now there's better than a 50/50 chance that a WU will fail if a power glitch occurs. A 5 day WU is 5 times more likely for this to happen than a 1 day WU. This problem happens more likely on hosts with fast GPUs (GTX 970 and above), so the chance of such failure is *not* in direct ratio with the runtime (as longer runtime means slower GPU, thus the OS has time to write the contents of the files needed for restart to the disk). I think the chance of the error is in inverse ratio (non-linear) with the GPU speed. In addition look at the normal failure rate of the super long WUs right now. It's almost 50%. This error rate includes user aborted workunits too, as such workunits are considered as failures. So this error rate is distorted by users (including you) selectively aborting tasks based on their length / error rate / earned bonus. It's quite awkward that you refer to the error rate of these tasks, while you are actively increasing it. However, they should have been put to another queue, but there won't be more than two. Brace yourself, that workunits get longer every time a new GPU generation is released, and we're facing this right now. | |
ID: 44429 | Rating: 0 | rate: / Reply Quote | |
Right now there's better than a 50/50 chance that a WU will fail if a power glitch occurs. A 5 day WU is 5 times more likely for this to happen than a 1 day WU. Yet another thunderstorm last night and only the 2 WUs on the 650Ti cards survived. All 15 of the 750Ti WUs either crashed or restarted from zero. I'm finding also that if a WU restarts from zero after a power outage it will almost always either error later or fail validation if it somehow finishes. If it happens to restart where it left off it is generally good to go. Therefore it's best to abort the restarts (from zero type). Sure hope that the admins do something to make this app more fault tolerant. | |
ID: 44434 | Rating: 0 | rate: / Reply Quote | |
Yet another thunderstorm last night and only the 2 WUs on the 650Ti cards survived. All 15 of the 750Ti WUs either crashed or restarted from zero. Have you tried to turn off write caching for your disks? Windows key + R -> Devmgmt.msc <ENTER> -> Disk drives -> select your BOINC disk (double click) -> Policies tab -> Un-check (both) write caching option(s) -> OK -> Close device manager | |
ID: 44437 | Rating: 0 | rate: / Reply Quote | |
Yet another thunderstorm last night and only the 2 WUs on the 650Ti cards survived. All 15 of the 750Ti WUs either crashed or restarted from zero. Zoltan, THANKS MUCH for this. I've now turned off write caching on the BOINC drive on all the machines. Hopefully that will help. Losing 15 WUs in an instant was irritating. Usually I lose over 50% in an outage, but 15 out of 17 is ridiculous. Last night we had another huge thunderstorm but luckily no power outage. I can hear another storm in the distance moving toward us right now, so this may get a test soon. Hopefully the admins will improve/fix the app when they do the next build so things like this won't be necessary. I don't think that the weather on this planet is going to get less violent at least in the near future. :-( | |
ID: 44442 | Rating: 0 | rate: / Reply Quote | |
I've now turned off write caching on the BOINC drive on all the machines. Hopefully that will help. It should help. You're the best test subject to tell if my theory is right. I'm really curious to know. | |
ID: 44444 | Rating: 0 | rate: / Reply Quote | |
I've now turned off write caching on the BOINC drive on all the machines. Hopefully that will help. The current storm passed by to the north. I will certainly let you know when this happens again. Sometimes we'll go a long time with no outages, sometimes it's frequent. Usually they're only a few seconds. Occasionally longer but of course the damage is the same either way. In fact it's maybe better for the hardware to actually wind all the way down (drives) before restarting. Thanks again for posting this workaround, have my fingers crossed... | |
ID: 44446 | Rating: 0 | rate: / Reply Quote | |
I've now turned off write caching on the BOINC drive on all the machines. Hopefully that will help. You *could* flip the switch on a power strip, as a very similar test case. Just sayin'. Probably not nice to the PC or disks, but, is exactly the same test case I believe. | |
ID: 44448 | Rating: 0 | rate: / Reply Quote | |
Also trying this on the 2 systems I get my errors on. The one with all the errors with the 3 TI Classies and the one at the bad power location. Let's see if the error rate goes down on these. | |
ID: 44449 | Rating: 0 | rate: / Reply Quote | |
I've now turned off write caching on the BOINC drive on all the machines. Hopefully that will help. Great idea, and wonderful of you to volunteer! Even better, I'd suggest cycling the main house breaker on and off while the family watches a movie or plays online games! | |
ID: 44451 | Rating: 0 | rate: / Reply Quote | |
You *could* flip the switch on a power strip, as a very similar test case. Just sayin'. Probably not nice to the PC or disks, but, is exactly the same test case I believe. You are right: it's similar (from IT view), but not the same (from electricity view). A surge in the electric power grid caused by lightning can result in much more trouble a couple of miliseconds *before* the emergency power switch (or a fuse) goes off; therefore you can't test it by simply switching the power off. However, if my method doesn't protect the data integrity from a simple dirty shutdown, it won't protect the system from the thunderstorms as well. | |
ID: 44453 | Rating: 0 | rate: / Reply Quote | |
I've now turned off write caching on the BOINC drive on all the machines. Hopefully that will help. Surely you know that, if I had an expendable PC that was capable of running GPUGrid tasks, this (flipping power strip switch) would be one of my test cases for any proposed fixes that GPUGrid releases. As it is right now, however, I don't have that setup of equipment, and GPUGrid isn't actively fixing their problem. | |
ID: 44454 | Rating: 0 | rate: / Reply Quote | |
You *could* flip the switch on a power strip, as a very similar test case. Just sayin'. Probably not nice to the PC or disks, but, is exactly the same test case I believe. Which is why a proper business-grade UPS would, IMO, be a better solution to these local weather related problems. A lightning surge can take out a great deal more than the current GPUGrid task, up to and including a motherboard or CPU. You can't ask GPUGrid to take care of that eventuality too. | |
ID: 44455 | Rating: 0 | rate: / Reply Quote | |
Used to have UPSes on all the machines but they're expensive, a pain to maintain and when there was an outage there'd be loud beeping from everywhere. Even the better ones seem to have a rather high failure rate and on those the batteries are expensive. When lightning hit the lightning rod on my house all the Intel based boxes died even though they had UPSes (all the APC sine wave type). As I've mentioned before all the AMD machines survived. Go figure. I switched to high grade surge protectors and haven't had ANY machine failures since. So there's pluses and minuses in my experience. When I was a network manager I had beefy UPSes on all the servers (mostly Novell) and that worked pretty well but mission critical business apps (and my butt) were at stake. ;-) | |
ID: 44456 | Rating: 0 | rate: / Reply Quote | |
We've drifted OT from task size to build issues and tips, but here's my 2p's worth. | |
ID: 44546 | Rating: 0 | rate: / Reply Quote | |
Message boards : Server and website : The problem with the size of sent jobs.