Advanced search

Message boards : Server and website : 11 day slacker

Author Message
Profile Stoneageman
Avatar
Send message
Joined: 25 May 09
Posts: 224
Credit: 34,057,224,498
RAC: 231
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33563 - Posted: 20 Oct 2013 | 16:53:04 UTC

Have not been paying attention to my crunchers of late & noticed THIS wu which had stalled for 11 days (Is that a record?)
Flashawk had completed it after 5 days so should the server have cancelled this task on my machine? I aborted it & the card has worked fine since.

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33565 - Posted: 20 Oct 2013 | 20:30:58 UTC - in response to Message 33563.
Last modified: 20 Oct 2013 | 20:35:10 UTC

Beats my 5day record (Ubuntu 13.04).

Typically, servers don't cancel work once it's started, though some projects do. Sometimes it's done by mistake, sometimes if there is a bad batch, and some projects sort of do this routinely.

In the past I've suggested that server cancellations be implemented here (so long as credit is given for the work completed). If the task is not needed it's a waste of time, money and effort.

In some instances duplicated work can add to the validity of the methods, but models that 'wiggle randomly' will give different results. Sometimes it's better to get two results, but here it's probably better to re-run the entire batch, after some data analysis.

In situations where work isn't needed it should be halted, credit given and other work sent, but there should also be a cutoff point (a basic fail over mechanism). On some projects you use to get the message, 'won't complete in time, consider aborting' (not much use if you're not sitting at the system though). After a certain run-time, based on the size of the WU and the GPU (GFlops), a WU could be stopped and returned. Ideally the app would self assess for normal progress (It completed 5% in 600sec, so does it keep doing this, if not suspend, resume from last checkpoint and if it doesn't progress normally at the second or third attempt finish). A hard cutoff point is also needed. Work gets resent after 5days so, if it doesn't finish by day 6 it should be stopped anyway. It certainly makes no sense to allow a task to continue running if it's already been completed on another system and isn't needed. If a runtime cutoff of 3times what you expect the WU to complete in was implemented, the WU would stop and get returned after 36h (based on work normally completing in 12h).
Just rough ideas, but for the cruncher it's better than losing 3M credits.

Matt hasn't lent his deft hand to troubleshooting issues on Linux yet - Maybe next month...
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Post to thread

Message boards : Server and website : 11 day slacker

//