WU: OPM simulations

Author	Message
Bedrich Hajek Send message Joined: 28 Mar 09 Posts: 490 Credit: 11,731,645,728 RAC: 57 Level Scientific publications	Message 43370 - Posted: 11 May 2016, 22:34:34 UTC - in response to Message 43362. For future reference, I have a couple of suggestions: For WUs running 18 hours +, there should be separate category : "super long runs". I believe this was mentioned in past posts. This was previously suggested, however this issue was just an underestimate of the runtimes by the researchers and they would normally increase the credit ratio awarded for extra-long tasks. The real issue would be another queue on the server and facilitating and maintaining that; the short queue is often empty never mind having an extra long queue. Fine, they underestimated the runtimes, but I stand by my statement for super long runs category. The future WU application version should be made less CPU dependent. Generally that's the case but ultimately this is a different type of research and it simply requires that some work be performed on the CPU (you can't do it on the GPU). The WUs are getting longer, GPU s are getting faster, but CPU speed is stagnant. Something got to give, and with the Pascal cards coming out soon, you have put out a new version anyway. Over the years WU's have remained about the same length overall. Occasionally there are extra-long tasks but such batches are rare. GPU's are getting faster, more adapt, the number of shaders (cuda cores here) is increasing and CUDA development continues. The problem isn't just CPU frequency (though an AMD Athlon 64 X2 5000+ is going to struggle a bit), WDDM is an 11%+ bottleneck (increasing with GPU performance) and when the GPUGrid app needs the CPU to perform a calculation it's down to the PCIE bridge and perhaps to some extent not being multi-threaded on the CPU. Note that it's the same ACEMD app (not updated since Jan) just a different batch of tasks (batches vary depending on the research type). My GPU usage on these latest SDOERR WUs is between 70% to 80%, compared to 85% to 95% for the GERARD BestUmbrella units. I don't think this is reinventing the wheel, just updating it. This does have to be done sooner or later. The Volta cards are coming out in only a few short years. I'm seeing ~75% GPU usage on my 970's too. On Linux and XP I would expect it to be around 90%. I've increased my Memory to 3505MHz to reduce the MCU (~41%) @1345MHz [power 85%], ~37% @1253MHz. I've watched the GPU load and it varies continuously with these WU's, as does the Power. As for making the WUs less CPU dependent, there are somethings that are easy to move to GPU calculation, some that are difficult, maybe not impossible, but definitely impractical. Okay, I get this point, and I am not asking for the moon, but something more must be done to avoid or minimize this bottleneck, from a programming standpoint. That's all I'm saying. ID: 43370 · Rating: 0 · rate: / Reply Quote

Stefan Project administrator Project developer Project tester Project scientist Send message Joined: 5 Mar 13 Posts: 348 Credit: 0 RAC: 0 Level Scientific publications	Message 43372 - Posted: 12 May 2016, 8:13:00 UTC @Retvari @MrJo I mentioned the credit algorithm in this post https://www.gpugrid.net/forum_thread.php?id=4299&nowrap=true#43328 I understand the runtime was underestimated, but given the knowledge we had (projected runtime) it was our best guess. I only sent WUs that were projected to run under 24 hours on a 780. If that ends up being more than 5 days on some GPUs I am really sorry, we didn't consider that possibility. The problem with equilibrations is that we cannot split them into multiple steps like the normal simulations so we just have to push through this batch and then we are done. I expect it to be over by the end of the week. No more simulations are being sent out a few days now so only the ones that cancel or fail will be resent automatically. ID: 43372 · Rating: 0 · rate: / Reply Quote

Betting Slip Send message Joined: 5 Jan 09 Posts: 670 Credit: 2,498,095,550 RAC: 0 Level Scientific publications	Message 43373 - Posted: 12 May 2016, 8:49:36 UTC - in response to Message 43372. Last modified: 12 May 2016, 8:58:11 UTC Don't worry about that Stefan it's a curve ball that you've thrown us that's all. "I expect it to be over by the end of the week" I like your optimism given the users who hold on to a WU for 5 days and never return and those that continually error even after long run times. Surely there must be a way to suspend their ability to get WU's completely until they Log In. They seriously impact our ability to complete a batch in good time. ID: 43373 · Rating: 0 · rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 43374 - Posted: 12 May 2016, 9:07:31 UTC - in response to Message 43373. Last modified: 12 May 2016, 9:35:47 UTC I like your optimism given the users who hold on to a WU for 5 days and never return and those that continually error even after long run times. Surely there must be a way to suspend their ability to get WU's completely until they Log In. They seriously impact our ability to complete a batch in good time. 5 days does sound optimistic. Maybe you can ensure resends go to top cards with a low failure rate? Would be useful if the likes of this batch was better controlled; only sent to known good cards - see the RAC's of the 'participants' this WU was sent to: https://www.gpugrid.net/workunit.php?wuid=11595345 I suspect that none of these cards could ever return extra long tasks within 5days (not enough GFlops) even if properly setup: NVIDIA GeForce GT 520M (1024MB) driver: 358.87 [too slow and not enough RAM] NVIDIA GeForce GT 640 (2048MB) driver: 340.52 [too slow, only cuda6.0 driver] NVIDIA Quadro K4000 (3072MB) driver: 340.62 [too slow, only cuda6.0 driver] NVIDIA GeForce GTX 560 Ti (2048MB) driver: 365.10 [best chance to finish but runs too hot, all tasks fail on this card when it hits ~85C, needs to cool it properly {prioritise temperature}] Like the idea that WU's don't get sent to systems with high error rates until they log in (and are directed to a recommended settings page). FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help ID: 43374 · Rating: 0 · rate: / Reply Quote

Betting Slip Send message Joined: 5 Jan 09 Posts: 670 Credit: 2,498,095,550 RAC: 0 Level Scientific publications	Message 43375 - Posted: 12 May 2016, 9:23:38 UTC - in response to Message 43374. Last modified: 12 May 2016, 9:24:26 UTC Hi SK, At least the only thing that was wasted there was bandwidth because they errored immediately but the biggest thing (especially but not only in a WU drought) is the person who never returns so basically the WU is in limbo for 5 days before it gets resent and then it's possible it goes to another machine that holds it for another 5 days while good machines are sat waiting for work. It's NOT good for users It's NOT good for scientists It's NOT good for this project About time we got the broom out. ID: 43375 · Rating: 0 · rate: / Reply Quote

Stefan Project administrator Project developer Project tester Project scientist Send message Joined: 5 Mar 13 Posts: 348 Credit: 0 RAC: 0 Level Scientific publications	Message 43376 - Posted: 12 May 2016, 9:45:35 UTC Actually there is a way to send WUs only to top users by increasing the job priority over 1000. I thought I had it at max priority but I just found out that it's only set to 'high' which is 600. I guess most of these problems would have been avoided given a high enough priority. In any case, we are at 1721 completed 1582 running so I think my estimate might roughly hold :P ID: 43376 · Rating: 0 · rate: / Reply Quote

Betting Slip Send message Joined: 5 Jan 09 Posts: 670 Credit: 2,498,095,550 RAC: 0 Level Scientific publications	Message 43377 - Posted: 12 May 2016, 10:20:06 UTC - in response to Message 43376. Last modified: 12 May 2016, 10:40:53 UTC Thanks Stefan, Finally just an example of a different scenario which is frustrating, https://www.gpugrid.net/workunit.php?wuid=11591312 User exceeds 5 days so WU gets sent to my slowest machine which completes in less than 24hrs. However original user then sends back result after 5 days pipping my machine to the post making my effort a waste of time. This was of course because original user downloaded 2 WUs at a time on 750ti which struggles to complete one WU in 24hrs running 24/7 On my 980ti I recently aborted a unit that was 40% complete for exactly the same reason. ID: 43377 · Rating: 0 · rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 43378 - Posted: 12 May 2016, 10:40:38 UTC - in response to Message 43376. Actually there is a way to send WUs only to top users by increasing the job priority over 1000. I thought I had it at max priority but I just found out that it's only set to 'high' which is 600. I guess most of these problems would have been avoided given a high enough priority. Priority usually does not cause exclusion, only lower probability. Does priority over 1000 really exclude the less reliable hosts from task scheduling when there are no lower priority tasks in the queue? I think it is true only when there are lower priority tasks also queued to give them to less reliable hosts. It is a documented feature? ID: 43378 · Rating: 0 · rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 43379 - Posted: 12 May 2016, 10:44:11 UTC - in response to Message 43375. Last modified: 12 May 2016, 10:46:06 UTC Hi SK, At least the only thing that was wasted there was bandwidth because they errored immediately but the biggest thing (especially but not only in a WU drought) is the person who never returns so basically the WU is in limbo for 5 days before it gets resent and then it's possible it goes to another machine that holds it for another 5 days while good machines are sat waiting for work. It's NOT good for users It's NOT good for scientists It's NOT good for this project About time we got the broom out. Exactly, send a WU to 6 clients in succession that don't run it or return it for 5days and after a month the WU still isn't complete. IMO if client systems report a cache of over 1day they shouldn't get work from this project. I highlighted a best case scenario (where several systems were involved); the WU runs soon after receipt, fails immediately, gets reported quickly and can be sent out again. Even in this situation it went to 4 clients that failed to complete the WU. To send and receive replies from 4 systems took 7h. My second point was that the WU never really had a chance on any of those systems. They weren't capable of running these tasks (not enough RAM, too slow, oldish driver [slower or more buggy], too hot, or just not powerful enough to return an extra long WU inside 5days). FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help ID: 43379 · Rating: 0 · rate: / Reply Quote

Betting Slip Send message Joined: 5 Jan 09 Posts: 670 Credit: 2,498,095,550 RAC: 0 Level Scientific publications	Message 43380 - Posted: 12 May 2016, 10:59:37 UTC - in response to Message 43379. Last modified: 12 May 2016, 11:04:21 UTC I highlighted a best case scenario (where several systems were involved); the WU runs soon after receipt, fails immediately, gets reported quickly and can be sent out again. Even in this situation it went to 4 clients that failed to complete the WU. To send and receive replies from 4 systems took 7h. My second point was that the WU never really had a chance on any of those systems. They weren't capable of running these tasks (not enough RAM, too slow, oldish driver [slower or more buggy], too hot, or just not powerful enough to return an extra long WU inside 5days). I take your point. In the 7 hours that WU was being bounced from bad client to bad client a fast card would have completed it. Totally agree. In fact Retvari could have completed it and taken an hour out for lunch. HaHa ID: 43380 · Rating: 0 · rate: / Reply Quote

MrJo Send message Joined: 18 Apr 14 Posts: 43 Credit: 1,192,135,172 RAC: 0 Level Scientific publications	Message 43381 - Posted: 12 May 2016, 11:01:54 UTC - in response to Message 43372. Last modified: 12 May 2016, 11:05:56 UTC I understand the runtime was underestimated The principle in the law "In dubio pro reo" should also apply here. When in doubt, more points. I only sent WUs that were projected to run under 24 hours on a 780 Not everyone has a 780. I run several 970's, 770's, 760's, 680's and 950's (don't use my 750 any longer). A WU should be such that it can be completed by a midrange-card within 24 hours. A 680 ore a 770 is still a good card. Alternatively one could make a fundraising campaign so poor Cruncher's get some 980i's ;-) Regards, Josef ID: 43381 · Rating: 0 · rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 43382 - Posted: 12 May 2016, 11:02:28 UTC - in response to Message 43379. Last modified: 12 May 2016, 11:09:34 UTC https://boinc.berkeley.edu/trac/wiki/ProjectOptions Accelerating retries The goal of this mechanism is to send timeout-generated retries to hosts that are likely to finish them fast. Here's how it works: • Hosts are deemed "reliable" (a slight misnomer) if they satisfy turnaround time and error rate criteria. • A job instance is deemed "need-reliable" if its priority is above a threshold. • The scheduler tries to send need-reliable jobs to reliable hosts. When it does, it reduces the delay bound of the job. • When job replicas are created in response to errors or timeouts, their priority is raised relative to the job's base priority. The configurable parameters are: <reliable_on_priority>X</reliable_on_priority> Results with priority at least reliable_on_priority are treated as "need-reliable". They'll be sent preferentially to reliable hosts. <reliable_max_avg_turnaround>secs</reliable_max_avg_turnaround> Hosts whose average turnaround is at most reliable_max_avg_turnaround and that have at least 10 consecutive valid results e are considered 'reliable'. Make sure you set this low enough that a significant fraction (e.g. 25%) of your hosts qualify. <reliable_reduced_delay_bound>X</reliable_reduced_delay_bound> When a need-reliable result is sent to a reliable host, multiply the delay bound by reliable_reduced_delay_bound (typically 0.5 or so). <reliable_priority_on_over>X</reliable_priority_on_over> <reliable_priority_on_over_except_error>X</reliable_priority_on_over_except_error> If reliable_priority_on_over is nonzero, increase the priority of duplicate jobs by that amount over the job's base priority. Otherwise, if reliable_priority_on_over_except_error is nonzero, increase the priority of duplicates caused by timeout (not error) by that amount. (Typically only one of these is nonzero, and is equal to reliable_on_priority.) NOTE: this mechanism can be used to preferentially send ANY job, not just retries, to fast/reliable hosts. To do so, set the workunit's priority to reliable_on_priority or greater. FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help ID: 43382 · Rating: 0 · rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 43383 - Posted: 12 May 2016, 11:13:50 UTC - in response to Message 43381. I understand the runtime was underestimated The principle in the law "In dubio pro reo" should also apply here. When in doubt, more points. I only sent WUs that were projected to run under 24 hours on a 780 Not everyone has a 780. I run several 970's, 770's, 760's, 680's and 950's (don't use my 750 any longer). A WU should be such that it can be completed by a midrange-card within 24 hours. A 680 ore a 770 is still a good card. Alternatively one could make a fundraising campaign so poor Cruncher's get some 980i's ;-) When there is WU variation in runtime fixing credit based on anticipated runtime is always going to be hit and miss, and on a task by task basis. FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help ID: 43383 · Rating: 0 · rate: / Reply Quote

Betting Slip Send message Joined: 5 Jan 09 Posts: 670 Credit: 2,498,095,550 RAC: 0 Level Scientific publications	Message 43384 - Posted: 12 May 2016, 11:42:47 UTC It's not only low end cards https://www.gpugrid.net/workunit.php?wuid=11595159 Continual errors and sometimes run a long time before error. ID: 43384 · Rating: 0 · rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 43385 - Posted: 12 May 2016, 13:08:08 UTC - in response to Message 43384. It's not only low end cards https://www.gpugrid.net/workunit.php?wuid=11595159 Continual errors and sometimes run a long time before error. https://www.gpugrid.net/workunit.php?wuid=11594288 ID: 43385 · Rating: 0 · rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 43386 - Posted: 12 May 2016, 15:12:36 UTC - in response to Message 43385. Last modified: 12 May 2016, 16:59:07 UTC https://www.gpugrid.net/workunit.php?wuid=11595159 Card overheating, erroring repeatedly but restarting before eventually failing. https://www.gpugrid.net/workunit.php?wuid=11594288 287647 NVIDIA GeForce GT 520 (1023MB) driver: 352.63 201720 NVIDIA Tesla K20m (4095MB) driver: 340.29 (cuda6.0 - might be an issue) 125384 Error while downloading, also NVIDIA GeForce GT 640 (1024MB) driver: 361.91 321762 looks like a GTX980Ti but actually tried to run on the 2nd card, a GTX560Ti which only had 1GB GDDR5 54461 another 560Ti with 1GB GDDR5 329196 NVIDIA GeForce GTX 550 Ti (1023MB) driver: 361.42 Looks like 4 out of 6 fails where due to the cards only having 1GB GDDR, one failed to download but only had 1GB GDDR anyway, the other might be due to using an older driver with these WU's. FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help ID: 43386 · Rating: 0 · rate: / Reply Quote

cadbane Send message Joined: 7 Jun 09 Posts: 24 Credit: 1,149,643,416 RAC: 0 Level Scientific publications	Message 43389 - Posted: 12 May 2016, 16:59:19 UTC Would it make sense if a moderator or staff contacted these repeat offenders with a PM, asking them if they would consider detaching or try a less strenuous project? I think they've done this procedure on CPDN, when users find hosts that burn workunits over and over. ID: 43389 · Rating: 0 · rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 43390 - Posted: 12 May 2016, 17:13:13 UTC - in response to Message 43389. In theory the server can send messages to hosts following repeated failures, if these are logged: <msg_to_host/> If present, check the msg_to_host table on each RPC, and send the client any messages queued for it. Not sure if this appears in Notices (pop-up, if enabled) or just the event log. I've tried contacting some people by PM in the past, but if they don't use the forum they won't see it, unless they have it going through to their emails too (and read those). Basically, not a great response and would need to be automated IMO. FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help ID: 43390 · Rating: 0 · rate: / Reply Quote

eXaPower Send message Joined: 25 Sep 13 Posts: 293 Credit: 1,897,601,978 RAC: 0 Level Scientific publications	Message 43391 - Posted: 12 May 2016, 22:44:16 UTC - in response to Message 43372. [...] The problem with equilibrations is that we cannot split them into multiple steps like the normal simulations so we just have to push through this batch and then we are done. I expect it to be over by the end of the week. No more simulations are being sent out a few days now so only the ones that cancel or fail will be resent automatically. Will there be any future OPM batches available or is this the end of OPM? I've enjoyed crunching OPM996 (non-fixed) credit WU. The unpredictable runtime simulations (interchangeable Natom batch) an exciting type of WU. Variable Natom for each task creates an allure of mystery. If viable - OPM a choice WU for Summer time crunching from it's lower power requirement compared to some other WU's. (Umbrella type WU would also help contend with the summer heat.) 100.440 Natoms 16h 36m 46s (59.806s) 903.900 credits (GTX980Ti) Is 903,900 the most credit ever given to an ACEMD WU? ID: 43391 · Rating: 0 · rate: / Reply Quote

Betting Slip Send message Joined: 5 Jan 09 Posts: 670 Credit: 2,498,095,550 RAC: 0 Level Scientific publications	Message 43392 - Posted: 12 May 2016, 22:48:22 UTC - in response to Message 43390. Or just deny them new tasks until they login as your previous idea. This would solve the probem for the most part. Just can't understand the project not adopting this approach as they would like WUs returned ASAP and users don't want to be sat idle while these hosts hold us up ID: 43392 · Rating: 0 · rate: / Reply Quote