Message boards :
News :
WU: OPM simulations
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 10 · Next
Author | Message |
---|---|
Send message Joined: 28 Mar 09 Posts: 490 Credit: 11,731,645,728 RAC: 52,725 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
For future reference, I have a couple of suggestions: Fine, they underestimated the runtimes, but I stand by my statement for super long runs category. The future WU application version should be made less CPU dependent. As for making the WUs less CPU dependent, there are somethings that are easy to move to GPU calculation, some that are difficult, maybe not impossible, but definitely impractical. Okay, I get this point, and I am not asking for the moon, but something more must be done to avoid or minimize this bottleneck, from a programming standpoint. That's all I'm saying. |
Send message Joined: 5 Mar 13 Posts: 348 Credit: 0 RAC: 0 Level ![]() Scientific publications ![]() |
@Retvari @MrJo I mentioned the credit algorithm in this post https://www.gpugrid.net/forum_thread.php?id=4299&nowrap=true#43328 I understand the runtime was underestimated, but given the knowledge we had (projected runtime) it was our best guess. I only sent WUs that were projected to run under 24 hours on a 780. If that ends up being more than 5 days on some GPUs I am really sorry, we didn't consider that possibility. The problem with equilibrations is that we cannot split them into multiple steps like the normal simulations so we just have to push through this batch and then we are done. I expect it to be over by the end of the week. No more simulations are being sent out a few days now so only the ones that cancel or fail will be resent automatically. |
Send message Joined: 5 Jan 09 Posts: 670 Credit: 2,498,095,550 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Don't worry about that Stefan it's a curve ball that you've thrown us that's all. "I expect it to be over by the end of the week"I like your optimism given the users who hold on to a WU for 5 days and never return and those that continually error even after long run times. Surely there must be a way to suspend their ability to get WU's completely until they Log In. They seriously impact our ability to complete a batch in good time. |
![]() Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I like your optimism given the users who hold on to a WU for 5 days and never return and those that continually error even after long run times. 5 days does sound optimistic. Maybe you can ensure resends go to top cards with a low failure rate? Would be useful if the likes of this batch was better controlled; only sent to known good cards - see the RAC's of the 'participants' this WU was sent to: https://www.gpugrid.net/workunit.php?wuid=11595345 I suspect that none of these cards could ever return extra long tasks within 5days (not enough GFlops) even if properly setup: NVIDIA GeForce GT 520M (1024MB) driver: 358.87 [too slow and not enough RAM] NVIDIA GeForce GT 640 (2048MB) driver: 340.52 [too slow, only cuda6.0 driver] NVIDIA Quadro K4000 (3072MB) driver: 340.62 [too slow, only cuda6.0 driver] NVIDIA GeForce GTX 560 Ti (2048MB) driver: 365.10 [best chance to finish but runs too hot, all tasks fail on this card when it hits ~85C, needs to cool it properly {prioritise temperature}] Like the idea that WU's don't get sent to systems with high error rates until they log in (and are directed to a recommended settings page). FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help |
Send message Joined: 5 Jan 09 Posts: 670 Credit: 2,498,095,550 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Hi SK, At least the only thing that was wasted there was bandwidth because they errored immediately but the biggest thing (especially but not only in a WU drought) is the person who never returns so basically the WU is in limbo for 5 days before it gets resent and then it's possible it goes to another machine that holds it for another 5 days while good machines are sat waiting for work. It's NOT good for users It's NOT good for scientists It's NOT good for this project About time we got the broom out. |
Send message Joined: 5 Mar 13 Posts: 348 Credit: 0 RAC: 0 Level ![]() Scientific publications ![]() |
Actually there is a way to send WUs only to top users by increasing the job priority over 1000. I thought I had it at max priority but I just found out that it's only set to 'high' which is 600. I guess most of these problems would have been avoided given a high enough priority. In any case, we are at 1721 completed 1582 running so I think my estimate might roughly hold :P |
Send message Joined: 5 Jan 09 Posts: 670 Credit: 2,498,095,550 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Thanks Stefan, Finally just an example of a different scenario which is frustrating, https://www.gpugrid.net/workunit.php?wuid=11591312 User exceeds 5 days so WU gets sent to my slowest machine which completes in less than 24hrs. However original user then sends back result after 5 days pipping my machine to the post making my effort a waste of time. This was of course because original user downloaded 2 WUs at a time on 750ti which struggles to complete one WU in 24hrs running 24/7 On my 980ti I recently aborted a unit that was 40% complete for exactly the same reason. |
![]() ![]() Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Actually there is a way to send WUs only to top users by increasing the job priority over 1000. I thought I had it at max priority but I just found out that it's only set to 'high' which is 600.Priority usually does not cause exclusion, only lower probability. Does priority over 1000 really exclude the less reliable hosts from task scheduling when there are no lower priority tasks in the queue? I think it is true only when there are lower priority tasks also queued to give them to less reliable hosts. It is a documented feature? |
![]() Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Hi SK, Exactly, send a WU to 6 clients in succession that don't run it or return it for 5days and after a month the WU still isn't complete. IMO if client systems report a cache of over 1day they shouldn't get work from this project. I highlighted a best case scenario (where several systems were involved); the WU runs soon after receipt, fails immediately, gets reported quickly and can be sent out again. Even in this situation it went to 4 clients that failed to complete the WU. To send and receive replies from 4 systems took 7h. My second point was that the WU never really had a chance on any of those systems. They weren't capable of running these tasks (not enough RAM, too slow, oldish driver [slower or more buggy], too hot, or just not powerful enough to return an extra long WU inside 5days). FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help |
Send message Joined: 5 Jan 09 Posts: 670 Credit: 2,498,095,550 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I take your point. In the 7 hours that WU was being bounced from bad client to bad client a fast card would have completed it. Totally agree. In fact Retvari could have completed it and taken an hour out for lunch. HaHa |
Send message Joined: 18 Apr 14 Posts: 43 Credit: 1,192,135,172 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I understand the runtime was underestimated The principle in the law "In dubio pro reo" should also apply here. When in doubt, more points. I only sent WUs that were projected to run under 24 hours on a 780 Not everyone has a 780. I run several 970's, 770's, 760's, 680's and 950's (don't use my 750 any longer). A WU should be such that it can be completed by a midrange-card within 24 hours. A 680 ore a 770 is still a good card. Alternatively one could make a fundraising campaign so poor Cruncher's get some 980i's ;-) Regards, Josef ![]() |
![]() Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
https://boinc.berkeley.edu/trac/wiki/ProjectOptions Accelerating retries The goal of this mechanism is to send timeout-generated retries to hosts that are likely to finish them fast. Here's how it works: • Hosts are deemed "reliable" (a slight misnomer) if they satisfy turnaround time and error rate criteria. • A job instance is deemed "need-reliable" if its priority is above a threshold. • The scheduler tries to send need-reliable jobs to reliable hosts. When it does, it reduces the delay bound of the job. • When job replicas are created in response to errors or timeouts, their priority is raised relative to the job's base priority. The configurable parameters are: <reliable_on_priority>X</reliable_on_priority> Results with priority at least reliable_on_priority are treated as "need-reliable". They'll be sent preferentially to reliable hosts. <reliable_max_avg_turnaround>secs</reliable_max_avg_turnaround> Hosts whose average turnaround is at most reliable_max_avg_turnaround and that have at least 10 consecutive valid results e are considered 'reliable'. Make sure you set this low enough that a significant fraction (e.g. 25%) of your hosts qualify. <reliable_reduced_delay_bound>X</reliable_reduced_delay_bound> When a need-reliable result is sent to a reliable host, multiply the delay bound by reliable_reduced_delay_bound (typically 0.5 or so). <reliable_priority_on_over>X</reliable_priority_on_over> <reliable_priority_on_over_except_error>X</reliable_priority_on_over_except_error> If reliable_priority_on_over is nonzero, increase the priority of duplicate jobs by that amount over the job's base priority. Otherwise, if reliable_priority_on_over_except_error is nonzero, increase the priority of duplicates caused by timeout (not error) by that amount. (Typically only one of these is nonzero, and is equal to reliable_on_priority.) NOTE: this mechanism can be used to preferentially send ANY job, not just retries, to fast/reliable hosts. To do so, set the workunit's priority to reliable_on_priority or greater. FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help |
![]() Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I understand the runtime was underestimated When there is WU variation in runtime fixing credit based on anticipated runtime is always going to be hit and miss, and on a task by task basis. FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help |
Send message Joined: 5 Jan 09 Posts: 670 Credit: 2,498,095,550 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
It's not only low end cards https://www.gpugrid.net/workunit.php?wuid=11595159 Continual errors and sometimes run a long time before error. |
![]() ![]() Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
It's not only low end cards https://www.gpugrid.net/workunit.php?wuid=11594288 |
![]() Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
https://www.gpugrid.net/workunit.php?wuid=11595159 Card overheating, erroring repeatedly but restarting before eventually failing. https://www.gpugrid.net/workunit.php?wuid=11594288 287647 NVIDIA GeForce GT 520 (1023MB) driver: 352.63 201720 NVIDIA Tesla K20m (4095MB) driver: 340.29 (cuda6.0 - might be an issue) 125384 Error while downloading, also NVIDIA GeForce GT 640 (1024MB) driver: 361.91 321762 looks like a GTX980Ti but actually tried to run on the 2nd card, a GTX560Ti which only had 1GB GDDR5 54461 another 560Ti with 1GB GDDR5 329196 NVIDIA GeForce GTX 550 Ti (1023MB) driver: 361.42 Looks like 4 out of 6 fails where due to the cards only having 1GB GDDR, one failed to download but only had 1GB GDDR anyway, the other might be due to using an older driver with these WU's. FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help |
Send message Joined: 7 Jun 09 Posts: 24 Credit: 1,149,643,416 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Would it make sense if a moderator or staff contacted these repeat offenders with a PM, asking them if they would consider detaching or try a less strenuous project? I think they've done this procedure on CPDN, when users find hosts that burn workunits over and over. |
![]() Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
In theory the server can send messages to hosts following repeated failures, if these are logged: <msg_to_host/> If present, check the msg_to_host table on each RPC, and send the client any messages queued for it. Not sure if this appears in Notices (pop-up, if enabled) or just the event log. I've tried contacting some people by PM in the past, but if they don't use the forum they won't see it, unless they have it going through to their emails too (and read those). Basically, not a great response and would need to be automated IMO. FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help |
Send message Joined: 25 Sep 13 Posts: 293 Credit: 1,897,601,978 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
[...] The problem with equilibrations is that we cannot split them into multiple steps like the normal simulations so we just have to push through this batch and then we are done. I expect it to be over by the end of the week. No more simulations are being sent out a few days now so only the ones that cancel or fail will be resent automatically. Will there be any future OPM batches available or is this the end of OPM? I've enjoyed crunching OPM996 (non-fixed) credit WU. The unpredictable runtime simulations (interchangeable Natom batch) an exciting type of WU. Variable Natom for each task creates an allure of mystery. If viable - OPM a choice WU for Summer time crunching from it's lower power requirement compared to some other WU's. (Umbrella type WU would also help contend with the summer heat.) 100.440 Natoms 16h 36m 46s (59.806s) 903.900 credits (GTX980Ti) Is 903,900 the most credit ever given to an ACEMD WU? |
Send message Joined: 5 Jan 09 Posts: 670 Credit: 2,498,095,550 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Or just deny them new tasks until they login as your previous idea. This would solve the probem for the most part. Just can't understand the project not adopting this approach as they would like WUs returned ASAP and users don't want to be sat idle while these hosts hold us up |
©2025 Universitat Pompeu Fabra