ACEMD 4

Author	Message
Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 58748 - Posted: 27 Apr 2022, 0:00:46 UTC - in response to Message 58747. Last modified: 27 Apr 2022, 0:07:50 UTC I don't think much chance at all. We've blown through all those 1000 tasks I think. Not yet. The task which preempted the ACEMD3 task is: P0_NNPMM_frag_85-RAIMIS_NNPMM-5-10-RND6112_0 The blue number is the total number of tasks in the given sequence The red number is the number of the task in the given sequence (starting from 0, so the last one will be 9-10) The green number is the number of resends. So those 1000 tasks are actually 100 task sequences, each sequence is broken into 10 pieces. ID: 58748 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 34,713 Level Scientific publications	Message 58749 - Posted: 27 Apr 2022, 1:00:16 UTC - in response to Message 58748. Thanks for the task enumeration explanation. I thought since it had been ages since I got any constant work from GPUGrid that the REC balance would take forever to balance out my REC of my other projects. So when I didn't ask for any replacement work and when I manually updated and got none to send, I thought we had blown through all the work already. I see I will have to put my update script back into action to keep the hosts occupied. Only halfway done by your task I see. Thanks again for the knowledge. ID: 58749 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 8,067 Level Scientific publications	Message 58750 - Posted: 27 Apr 2022, 1:09:46 UTC - in response to Message 58749. it seems like they are only letting out about 100 into the wild at any given time rather than releasing them all at once. ID: 58750 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 58751 - Posted: 27 Apr 2022, 6:18:52 UTC Last modified: 27 Apr 2022, 6:20:50 UTC One of my machines currently has: P0_NNPMM_frag_76-RAIMIS_NNPMM-3-10-RND5267_0 sent 27 Apr 2022 \| 1:25:23 UTC P0_NNPMM_frag_85-RAIMIS_NNPMM-7-10-RND6112_0 sent 27 Apr 2022 \| 3:14:45 UTC Not all models seem to be progressing at the same rate (note neither is a resend from another user - both are first run after creation). ID: 58751 · Rating: 0 · rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 58756 - Posted: 27 Apr 2022, 23:29:15 UTC - in response to Message 58746. Last modified: 27 Apr 2022, 23:31:04 UTC The ACEMD3 could restart from the checkpoint, so it will finish eventually. It has failed actually. :( It was suspended a couple of times, so I set "no new task" to make it finish within 24 hours, but it didn't help. ID: 58756 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 58757 - Posted: 28 Apr 2022, 11:54:58 UTC Looks like we're coming to the end of this batch - time to take stock. For this post, I'm looking at host 132158, running on a GTX 1660 Super. I have one remaining task, about to start, with an initial estimate of 01:31:38 (5,498 seconds) - but that's with DCF hard up against the limit of 100. The raw estimate will be 55 seconds, and the typical actual runtime has been just over 4 hours (14,660 seconds). We need to get those estimates sorted out - but gently, gently. A sudden large change will make matters even worse. My raw metrics are: <flops>181882876215.769470</flops> <rsc_fpops_est>10000000000000.000000</rsc_fpops_est> <rsc_fpops_bound>10000000000000000.000000</rsc_fpops_bound> - so speed 181,882,876,216 or 181.88 Gflops. This website has an APR of 198.76, but it stopped updating after 9 completed tasks with v1.03 (I've got 13 showing valid at the moment). Size/speed gives 54.98, confirming what BOINC is estimating. I reckon the size estimate (fpops_est) should be increased by a factor of about 250, to get closer to the target of DCF=1. BUT DON'T DO IT ALL IN ONE MASSIVE JUMP. We could probably manage a batch with the estimate 5x the current value, to get DCF away from the upper limit (DCF corrects very, very, slowly when it gets this far out of equilibrium, and limits work fetch requests to a nominal 1 second). Then, a batch with a further increase of 10x, and a third batch with another 10x, should do it. But monitor the effects of the changes carefully. ID: 58757 · Rating: 0 · rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 58758 - Posted: 28 Apr 2022, 14:20:08 UTC Last modified: 28 Apr 2022, 14:20:39 UTC The ACEMD4 app puts less stress on the GPU, than the ACEMD3 app. ACEMD3 on RTX 3080Ti: 1845MHz 342W ACEMD4 on RTX 3080Ti: 1920MHz 306W I made similar observations on RTX 2080Ti, though I didn't record the exact numbers yet. ID: 58758 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 8,067 Level Scientific publications	Message 58759 - Posted: 28 Apr 2022, 15:38:46 UTC - in response to Message 58758. Last modified: 28 Apr 2022, 16:17:10 UTC I also do notice that the these tasks don't fully utilize the GPU (mentioned in an earlier post with usage stats). But I think it’s a CPU bottleneck with these tasks. Faster CPU allows the GPU to work harder. My 3080Tis run 2010MHz @ ~265W and 70-80% GPU utilization. That’s with an AMD EPYC 7402P @ 3.35GHz On another system I have another 3080Ti but this one paired with a faster 7443P running at ~3.5-3.6GHz. Here the 3080Ti runs at the same 2010MHz, but has 80-85% GPU utilization and about 275-280W power draw. If the i3-9300 on your 3080Ti system is running over 4.0 (maybe 4.2?) GHz then your higher power draw (than mine) makes sense and supports that it’s a CPU bottleneck. The "GAFF2" version of these tasks (which were ACEMD4) that were sent out during the testing phase ran with behavior much more similar to the ACEMD3 behavior. see my post here: https://gpugrid.net/forum_thread.php?id=5305&nowrap=true#58683 so it seems to be how the WU is setup/configured rather than the application itself ID: 58759 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 8,067 Level Scientific publications	Message 58760 - Posted: 28 Apr 2022, 17:45:12 UTC - in response to Message 58759. running another GAFF2 task now. GPU utilization is higher (~96%) but power use is even lower at about 230W for a 3080Ti (power limit at 300W). I suspect these GAFF2 tasks have some code/funtion in them that's causing serialized computations. We saw this exact same behavior with the Einstein tasks (high reported utilization, low power draw) before one of our teammates was able to find and deploy a code fix for the app, then it went to maxing out the GPU. I've now got some "QM" tasks in the queue as well. will report how they behave, if different. ID: 58760 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 8,067 Level Scientific publications	Message 58761 - Posted: 28 Apr 2022, 18:46:37 UTC - in response to Message 58760. QM tasks run in the same manner as the GAFF2 tasks. high ~96% GPU utilization, low GPU power draw, ~1GB VRAM used, ~2% VRAM bus use, high PCIe use. ID: 58761 · Rating: 0 · rate: / Reply Quote

captainjack Send message Joined: 9 May 13 Posts: 171 Credit: 4,610,796,466 RAC: 21,543 Level Scientific publications	Message 58763 - Posted: 28 Apr 2022, 23:11:25 UTC ACEMD 4 tasks now seem to be processing ok on my antique GTX 970. I had previously reported that they were all failing. Two GAFF2 tasks completed and validated. 1 QM task underway for 25 minutes and processing normally. ID: 58763 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 8,067 Level Scientific publications	Message 58764 - Posted: 29 Apr 2022, 18:54:10 UTC out of the ~1200 or so that I assume went out. I processed over 400 of the myself. all completed successfully with no errors (excluding ones cancelled or aborted). I "saved" many _7s too. ID: 58764 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1168 Credit: 12,317,898,501 RAC: 75,187 Level Scientific publications	Message 59919 - Posted: 14 Feb 2023, 19:05:04 UTC no ACEMD 4 tasks for very long time :-( Is this subproject dead? Please let us know ID: 59919 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1168 Credit: 12,317,898,501 RAC: 75,187 Level Scientific publications	Message 60232 - Posted: 30 Mar 2023, 5:08:00 UTC - in response to Message 59919. no ACEMD 4 tasks for very long time :-( Is this subproject dead? Please let us know no reply from the project team for 1 1/2 months now - do we volunteers not deserve any kind of information? ID: 60232 · Rating: 0 · rate: / Reply Quote

oemuser Send message Joined: 18 Sep 16 Posts: 10 Credit: 1,291,979 RAC: 0 Level Scientific publications	Message 60236 - Posted: 30 Mar 2023, 14:43:25 UTC - in response to Message 60232. Here you can see the progress of acemd software https://software.acellera.com/acemd/releases.html Maybe that also goes into acemd tasks for gpugrid? ID: 60236 · Rating: 0 · rate: / Reply Quote