Message boards :
News :
ACEMD 4
Message board moderation
Previous · 1 . . . 4 · 5 · 6 · 7
| Author | Message |
|---|---|
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I don't think much chance at all. We've blown through all those 1000 tasks I think.Not yet. The task which preempted the ACEMD3 task is: P0_NNPMM_frag_85-RAIMIS_NNPMM-5-10-RND6112_0 The blue number is the total number of tasks in the given sequence The red number is the number of the task in the given sequence (starting from 0, so the last one will be 9-10) The green number is the number of resends. So those 1000 tasks are actually 100 task sequences, each sequence is broken into 10 pieces. |
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Thanks for the task enumeration explanation. I thought since it had been ages since I got any constant work from GPUGrid that the REC balance would take forever to balance out my REC of my other projects. So when I didn't ask for any replacement work and when I manually updated and got none to send, I thought we had blown through all the work already. I see I will have to put my update script back into action to keep the hosts occupied. Only halfway done by your task I see. Thanks again for the knowledge. |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level ![]() Scientific publications
|
it seems like they are only letting out about 100 into the wild at any given time rather than releasing them all at once.
|
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 428 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
One of my machines currently has: P0_NNPMM_frag_76-RAIMIS_NNPMM-3-10-RND5267_0 sent 27 Apr 2022 | 1:25:23 UTC P0_NNPMM_frag_85-RAIMIS_NNPMM-7-10-RND6112_0 sent 27 Apr 2022 | 3:14:45 UTC Not all models seem to be progressing at the same rate (note neither is a resend from another user - both are first run after creation). |
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
The ACEMD3 could restart from the checkpoint, so it will finish eventually.It has failed actually. :( It was suspended a couple of times, so I set "no new task" to make it finish within 24 hours, but it didn't help. |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 428 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Looks like we're coming to the end of this batch - time to take stock. For this post, I'm looking at host 132158, running on a GTX 1660 Super. I have one remaining task, about to start, with an initial estimate of 01:31:38 (5,498 seconds) - but that's with DCF hard up against the limit of 100. The raw estimate will be 55 seconds, and the typical actual runtime has been just over 4 hours (14,660 seconds). We need to get those estimates sorted out - but gently, gently. A sudden large change will make matters even worse. My raw metrics are: <flops>181882876215.769470</flops>
<rsc_fpops_est>10000000000000.000000</rsc_fpops_est>
<rsc_fpops_bound>10000000000000000.000000</rsc_fpops_bound>- so speed 181,882,876,216 or 181.88 Gflops. This website has an APR of 198.76, but it stopped updating after 9 completed tasks with v1.03 (I've got 13 showing valid at the moment). Size/speed gives 54.98, confirming what BOINC is estimating. I reckon the size estimate (fpops_est) should be increased by a factor of about 250, to get closer to the target of DCF=1. BUT DON'T DO IT ALL IN ONE MASSIVE JUMP. We could probably manage a batch with the estimate 5x the current value, to get DCF away from the upper limit (DCF corrects very, very, slowly when it gets this far out of equilibrium, and limits work fetch requests to a nominal 1 second). Then, a batch with a further increase of 10x, and a third batch with another 10x, should do it. But monitor the effects of the changes carefully. |
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
The ACEMD4 app puts less stress on the GPU, than the ACEMD3 app. ACEMD3 on RTX 3080Ti: 1845MHz 342W ACEMD4 on RTX 3080Ti: 1920MHz 306W I made similar observations on RTX 2080Ti, though I didn't record the exact numbers yet. |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level ![]() Scientific publications
|
I also do notice that the these tasks don't fully utilize the GPU (mentioned in an earlier post with usage stats). But I think it’s a CPU bottleneck with these tasks. Faster CPU allows the GPU to work harder. My 3080Tis run 2010MHz @ ~265W and 70-80% GPU utilization. That’s with an AMD EPYC 7402P @ 3.35GHz On another system I have another 3080Ti but this one paired with a faster 7443P running at ~3.5-3.6GHz. Here the 3080Ti runs at the same 2010MHz, but has 80-85% GPU utilization and about 275-280W power draw. If the i3-9300 on your 3080Ti system is running over 4.0 (maybe 4.2?) GHz then your higher power draw (than mine) makes sense and supports that it’s a CPU bottleneck. The "GAFF2" version of these tasks (which were ACEMD4) that were sent out during the testing phase ran with behavior much more similar to the ACEMD3 behavior. see my post here: https://gpugrid.net/forum_thread.php?id=5305&nowrap=true#58683 so it seems to be how the WU is setup/configured rather than the application itself
|
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level ![]() Scientific publications
|
running another GAFF2 task now. GPU utilization is higher (~96%) but power use is even lower at about 230W for a 3080Ti (power limit at 300W). I suspect these GAFF2 tasks have some code/funtion in them that's causing serialized computations. We saw this exact same behavior with the Einstein tasks (high reported utilization, low power draw) before one of our teammates was able to find and deploy a code fix for the app, then it went to maxing out the GPU. I've now got some "QM" tasks in the queue as well. will report how they behave, if different.
|
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level ![]() Scientific publications
|
QM tasks run in the same manner as the GAFF2 tasks. high ~96% GPU utilization, low GPU power draw, ~1GB VRAM used, ~2% VRAM bus use, high PCIe use.
|
|
Send message Joined: 9 May 13 Posts: 171 Credit: 4,594,296,466 RAC: 171 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
ACEMD 4 tasks now seem to be processing ok on my antique GTX 970. I had previously reported that they were all failing. Two GAFF2 tasks completed and validated. 1 QM task underway for 25 minutes and processing normally. |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level ![]() Scientific publications
|
out of the ~1200 or so that I assume went out. I processed over 400 of the myself. all completed successfully with no errors (excluding ones cancelled or aborted). I "saved" many _7s too.
|
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
no ACEMD 4 tasks for very long time :-( Is this subproject dead? Please let us know |
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
no ACEMD 4 tasks for very long time :-( no reply from the project team for 1 1/2 months now - do we volunteers not deserve any kind of information? |
|
Send message Joined: 18 Sep 16 Posts: 10 Credit: 1,291,979 RAC: 0 Level ![]() Scientific publications
|
Here you can see the progress of acemd software https://software.acellera.com/acemd/releases.html Maybe that also goes into acemd tasks for gpugrid? |
©2025 Universitat Pompeu Fabra