ACEMD 4

Message boards : News : ACEMD 4
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 4 · 5 · 6 · 7

AuthorMessage
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58748 - Posted: 27 Apr 2022, 0:00:46 UTC - in response to Message 58747.  
Last modified: 27 Apr 2022, 0:07:50 UTC

I don't think much chance at all. We've blown through all those 1000 tasks I think.
Not yet.
The task which preempted the ACEMD3 task is:
P0_NNPMM_frag_85-RAIMIS_NNPMM-5-10-RND6112_0
The blue number is the total number of tasks in the given sequence
The red number is the number of the task in the given sequence (starting from 0, so the last one will be 9-10)
The green number is the number of resends.
So those 1000 tasks are actually 100 task sequences, each sequence is broken into 10 pieces.
ID: 58748 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 891
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58749 - Posted: 27 Apr 2022, 1:00:16 UTC - in response to Message 58748.  

Thanks for the task enumeration explanation.

I thought since it had been ages since I got any constant work from GPUGrid that the REC balance would take forever to balance out my REC of my other projects.

So when I didn't ask for any replacement work and when I manually updated and got none to send, I thought we had blown through all the work already.

I see I will have to put my update script back into action to keep the hosts occupied.

Only halfway done by your task I see.

Thanks again for the knowledge.
ID: 58749 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 58750 - Posted: 27 Apr 2022, 1:09:46 UTC - in response to Message 58749.  

it seems like they are only letting out about 100 into the wild at any given time rather than releasing them all at once.
ID: 58750 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 428
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58751 - Posted: 27 Apr 2022, 6:18:52 UTC
Last modified: 27 Apr 2022, 6:20:50 UTC

One of my machines currently has:

P0_NNPMM_frag_76-RAIMIS_NNPMM-3-10-RND5267_0 sent 27 Apr 2022 | 1:25:23 UTC
P0_NNPMM_frag_85-RAIMIS_NNPMM-7-10-RND6112_0 sent 27 Apr 2022 | 3:14:45 UTC

Not all models seem to be progressing at the same rate (note neither is a resend from another user - both are first run after creation).
ID: 58751 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58756 - Posted: 27 Apr 2022, 23:29:15 UTC - in response to Message 58746.  
Last modified: 27 Apr 2022, 23:31:04 UTC

The ACEMD3 could restart from the checkpoint, so it will finish eventually.
It has failed actually. :(
It was suspended a couple of times, so I set "no new task" to make it finish within 24 hours, but it didn't help.
ID: 58756 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 428
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58757 - Posted: 28 Apr 2022, 11:54:58 UTC

Looks like we're coming to the end of this batch - time to take stock. For this post, I'm looking at host 132158, running on a GTX 1660 Super.

I have one remaining task, about to start, with an initial estimate of 01:31:38 (5,498 seconds) - but that's with DCF hard up against the limit of 100. The raw estimate will be 55 seconds, and the typical actual runtime has been just over 4 hours (14,660 seconds).

We need to get those estimates sorted out - but gently, gently. A sudden large change will make matters even worse.

My raw metrics are:

    <flops>181882876215.769470</flops>
    <rsc_fpops_est>10000000000000.000000</rsc_fpops_est>
    <rsc_fpops_bound>10000000000000000.000000</rsc_fpops_bound>

- so speed 181,882,876,216 or 181.88 Gflops. This website has an APR of 198.76, but it stopped updating after 9 completed tasks with v1.03 (I've got 13 showing valid at the moment). Size/speed gives 54.98, confirming what BOINC is estimating.

I reckon the size estimate (fpops_est) should be increased by a factor of about 250, to get closer to the target of DCF=1.

BUT DON'T DO IT ALL IN ONE MASSIVE JUMP.

We could probably manage a batch with the estimate 5x the current value, to get DCF away from the upper limit (DCF corrects very, very, slowly when it gets this far out of equilibrium, and limits work fetch requests to a nominal 1 second). Then, a batch with a further increase of 10x, and a third batch with another 10x, should do it. But monitor the effects of the changes carefully.
ID: 58757 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58758 - Posted: 28 Apr 2022, 14:20:08 UTC
Last modified: 28 Apr 2022, 14:20:39 UTC

The ACEMD4 app puts less stress on the GPU, than the ACEMD3 app.
ACEMD3 on RTX 3080Ti: 1845MHz 342W
ACEMD4 on RTX 3080Ti: 1920MHz 306W
I made similar observations on RTX 2080Ti, though I didn't record the exact numbers yet.
ID: 58758 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 58759 - Posted: 28 Apr 2022, 15:38:46 UTC - in response to Message 58758.  
Last modified: 28 Apr 2022, 16:17:10 UTC

I also do notice that the these tasks don't fully utilize the GPU (mentioned in an earlier post with usage stats). But I think it’s a CPU bottleneck with these tasks. Faster CPU allows the GPU to work harder.

My 3080Tis run 2010MHz @ ~265W and 70-80% GPU utilization. That’s with an AMD EPYC 7402P @ 3.35GHz

On another system I have another 3080Ti but this one paired with a faster 7443P running at ~3.5-3.6GHz. Here the 3080Ti runs at the same 2010MHz, but has 80-85% GPU utilization and about 275-280W power draw.

If the i3-9300 on your 3080Ti system is running over 4.0 (maybe 4.2?) GHz then your higher power draw (than mine) makes sense and supports that it’s a CPU bottleneck.

The "GAFF2" version of these tasks (which were ACEMD4) that were sent out during the testing phase ran with behavior much more similar to the ACEMD3 behavior. see my post here: https://gpugrid.net/forum_thread.php?id=5305&nowrap=true#58683

so it seems to be how the WU is setup/configured rather than the application itself
ID: 58759 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 58760 - Posted: 28 Apr 2022, 17:45:12 UTC - in response to Message 58759.  

running another GAFF2 task now. GPU utilization is higher (~96%) but power use is even lower at about 230W for a 3080Ti (power limit at 300W).

I suspect these GAFF2 tasks have some code/funtion in them that's causing serialized computations. We saw this exact same behavior with the Einstein tasks (high reported utilization, low power draw) before one of our teammates was able to find and deploy a code fix for the app, then it went to maxing out the GPU.

I've now got some "QM" tasks in the queue as well. will report how they behave, if different.
ID: 58760 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 58761 - Posted: 28 Apr 2022, 18:46:37 UTC - in response to Message 58760.  

QM tasks run in the same manner as the GAFF2 tasks. high ~96% GPU utilization, low GPU power draw, ~1GB VRAM used, ~2% VRAM bus use, high PCIe use.
ID: 58761 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
captainjack

Send message
Joined: 9 May 13
Posts: 171
Credit: 4,594,296,466
RAC: 171
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58763 - Posted: 28 Apr 2022, 23:11:25 UTC

ACEMD 4 tasks now seem to be processing ok on my antique GTX 970. I had previously reported that they were all failing.
Two GAFF2 tasks completed and validated. 1 QM task underway for 25 minutes and processing normally.
ID: 58763 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 58764 - Posted: 29 Apr 2022, 18:54:10 UTC

out of the ~1200 or so that I assume went out. I processed over 400 of the myself. all completed successfully with no errors (excluding ones cancelled or aborted). I "saved" many _7s too.
ID: 58764 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59919 - Posted: 14 Feb 2023, 19:05:04 UTC

no ACEMD 4 tasks for very long time :-(

Is this subproject dead? Please let us know

ID: 59919 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60232 - Posted: 30 Mar 2023, 5:08:00 UTC - in response to Message 59919.  

no ACEMD 4 tasks for very long time :-(

Is this subproject dead? Please let us know


no reply from the project team for 1 1/2 months now - do we volunteers not deserve any kind of information?
ID: 60232 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
oemuser

Send message
Joined: 18 Sep 16
Posts: 10
Credit: 1,291,979
RAC: 0
Level
Ala
Scientific publications
wat
Message 60236 - Posted: 30 Mar 2023, 14:43:25 UTC - in response to Message 60232.  

Here you can see the progress of acemd software
https://software.acellera.com/acemd/releases.html

Maybe that also goes into acemd tasks for gpugrid?
ID: 60236 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 . . . 4 · 5 · 6 · 7

Message boards : News : ACEMD 4

©2025 Universitat Pompeu Fabra