ACEMD 4

Message boards : News : ACEMD 4
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · Next

AuthorMessage
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 731
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58726 - Posted: 26 Apr 2022, 0:18:33 UTC - in response to Message 58725.  

Mine just started. So far, so good.

I had a lot of _0..._5 failures before getting to one of mine and they were all on low RAM Maxwell or low RAM Pascal cards like a 1050 or 950.

So maybe the low RAM count cards are the suspect ones.

Looks like even a 8GB 1070 works. My 11GB 1080 Ti should have no issues.
ID: 58726 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 731
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58727 - Posted: 26 Apr 2022, 3:21:27 UTC

The 1080 Ti can crunch the acemd4 tasks with no issues.
1000 seconds faster than my 2070 Supers or 2080's
ID: 58727 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 5,269
Level
Trp
Scientific publications
wat
Message 58728 - Posted: 26 Apr 2022, 3:34:04 UTC - in response to Message 58727.  

i think the CPU speed plays a pretty significant roll in GPU speed on these tasks. so your fast 5950X is helping out a good bit.

my 300W 3080Ti under the 7443P runs roughly 700s faster than equivalent 300W 3080Ti under a 7402P.
ID: 58728 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 731
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58729 - Posted: 26 Apr 2022, 4:15:17 UTC

Same speed for all my 5950X hosts which the 1080 Ti is the newest. So the cpu speed is not the reason for the speedup.

Must be 11GB VRAM versus 8GB VRAM on the 2080's, 2070 Supers
ID: 58729 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 351
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58730 - Posted: 26 Apr 2022, 7:08:24 UTC
Last modified: 26 Apr 2022, 7:36:27 UTC

Checkpointing is still not active.

Both my Linux machines crashed hard overnight - I've yet to work out why. But one machine I've restarted had an ACEMD 4 at about the midway point: it's started again from 1%.

Edit - very odd. Second machine looks completely inert - no post, no beep, no video output. But BOINC remote monitoring shows all tasks are running, including both GPUs (I got a clue from the SSD activity LED). I'm draining the cache - may call out for some help later.
ID: 58730 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
mmonnin

Send message
Joined: 2 Jul 16
Posts: 338
Credit: 7,987,341,558
RAC: 213
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58733 - Posted: 26 Apr 2022, 10:47:28 UTC

exceeded elapsed time limit 12932.15 (10000000.00G/773.27G)</message>
exceeded elapsed time limit 7695.31 (10000000.00G/1299.49G)</message>

Why are there time limits? Up them!
ID: 58733 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 5,269
Level
Trp
Scientific publications
wat
Message 58734 - Posted: 26 Apr 2022, 13:13:48 UTC - in response to Message 58729.  

Same speed for all my 5950X hosts which the 1080 Ti is the newest. So the cpu speed is not the reason for the speedup.

Must be 11GB VRAM versus 8GB VRAM on the 2080's, 2070 Supers


i wasn't saying there was any "speedup". just that your fast CPU is helping vs how a 1080Ti might perform with a lower end CPU. put the same GPU on a slower CPU and it will slow down to some extent.
ID: 58734 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 5,269
Level
Trp
Scientific publications
wat
Message 58735 - Posted: 26 Apr 2022, 13:16:36 UTC - in response to Message 58733.  

exceeded elapsed time limit 12932.15 (10000000.00G/773.27G)</message>
exceeded elapsed time limit 7695.31 (10000000.00G/1299.49G)</message>

Why are there time limits? Up them!


your hosts are hidden. what are the system specifics? what CPU/RAM specs, etc.

the time limit is set by BOINC, not the project directly. BOINC bases the limits on the speed of the device and the estimated flops set by the project.
ID: 58735 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58736 - Posted: 26 Apr 2022, 13:38:45 UTC - in response to Message 58730.  

Checkpointing is still not active.
I confirm that.
ID: 58736 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 351
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58737 - Posted: 26 Apr 2022, 13:59:11 UTC - in response to Message 58735.  

exceeded elapsed time limit 12932.15 (10000000.00G/773.27G)</message>
exceeded elapsed time limit 7695.31 (10000000.00G/1299.49G)</message>

Why are there time limits? Up them!


your hosts are hidden. what are the system specifics? what CPU/RAM specs, etc.

the time limit is set by BOINC, not the project directly. BOINC bases the limits on the speed of the device and the estimated flops set by the project.

It's possible that the lack of checkpointing contributed to this problem. ACEND 4 tells BOINC that it has checkponted (and I think it's correct - the checkpoint have been written). So BOINC thinks it's OK to task-switch to another project.

But when the time comes to return to GPUGrid, ACEMD fails to read the checkpoint files, deletes the result file so far, and starts from the beginning. But it retains the memory of elapsed time so far, bringing it much closer to the abort point.
ID: 58737 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 5,269
Level
Trp
Scientific publications
wat
Message 58738 - Posted: 26 Apr 2022, 14:50:19 UTC - in response to Message 58737.  

if it's retaining the timer when restarting the task from 0 then yes I agree the checkpointing could be the root cause. if it's not checkpointing, the timer should reset to 0.

I still think it's a bit of user config error to allow any task switching for GPU projects. setting the task switch limit longer than estimated run time and the task will run unimpeded, barring any other high priority work (though with the 24hr deadlines these are going right into panic mode and preempt the to the front of the line anyway lol).

time and time again, GPUGRID shows that the tasks don't like being interrupted. you have the ACEMD3 tasks that can't be restarted on a different device, and sometimes restarting on the same device gets detected as a "different" device. sometimes you have to work around the project rather than making the project work around you. these tasks are ~1/5th the size (runtime) of the current ACEMD3 batch, I don't think it's too burdensome to just let it run to completion without interruption.
ID: 58738 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 5,269
Level
Trp
Scientific publications
wat
Message 58739 - Posted: 26 Apr 2022, 14:56:39 UTC

I still don't understand the messages that the tasks wont complete in time lol. it will only download 1 per GPU, then no more.

at inception of the run, BOINC gives an estimated runtime of about 25mins. this is of course too low and they end up running ~2-2.5hrs.

so if boinc thinks the tasks are only 25minutes long, and theres a 24hr deadline, what's the logic in saying it can't have another task due to not enough time for completion? even following BOINCs own logic at that point it should realize that it has 23:30 to complete another 0:25 task, no?

can anyone explain what BOINC is doing here?
ID: 58739 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 351
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58740 - Posted: 26 Apr 2022, 15:53:17 UTC

We mostly concentrate on the estimates calculated by the client, from size, speed, and (at this project) DCF.

But before a task is issued, a similar calculation is made by the server. In 'the good old days' (pre-2010), these were pretty much in lock-step: you can see the working out in Einstein's public server logs. In particular, both client and server included DCF in the estimate.

Einstein's server is pretty old-school: the server here in in a curious time-warp state, somewhere between Einstein and full CreditNew. Without access to the server logs, we can't tell exactly what old features have been stripped out, and what new features have been added in. That makes it very hard to answer your question.
ID: 58740 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ServicEnginIC
Avatar

Send message
Joined: 24 Sep 10
Posts: 592
Credit: 11,972,186,510
RAC: 1,187
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58741 - Posted: 26 Apr 2022, 16:56:07 UTC - in response to Message 58739.  

I still don't understand the messages that the tasks wont complete in time lol. it will only download 1 per GPU, then no more.

I find that this warning appears for ACEMD 4 tasks when you try to set a work buffer greater than 1 day.
BOINC Manager probably "thinks": Why should I want to store more than one day of tasks, for tasks with one day deadline?
The same happens with ACEMD 3 tasks when setting a work buffer greater than 5 days.

It's in some way tricky...
* Related question
- Short explanation
- Extended explanation

That is, for current ACEMD 4 tasks (24 hours deadline): If you want to get more than one task per GPU, set your work buffer provisionally at a value lower than 1, and revert to your custom value once a further task is downloaded.
(Or leave your buffer permanently set to 0.95 ;-)
ID: 58741 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 5,269
Level
Trp
Scientific publications
wat
Message 58742 - Posted: 26 Apr 2022, 17:11:27 UTC - in response to Message 58741.  
Last modified: 26 Apr 2022, 17:21:00 UTC

interesting observation. I've experienced similar things in the past with other projects with counterintuitive behavior/response to a cache level set "too high".

I'll try that out.

[edit]
Indeed that was it! thanks :)

the extracted package for these tasks is also huge. 7 tasks running (7 GPUs), and ~88GB disk space used by the GPUGRID project lol.
ID: 58742 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58743 - Posted: 26 Apr 2022, 18:02:58 UTC
Last modified: 26 Apr 2022, 18:03:49 UTC

The BOINC manager UI shows 42 minutes left, while the work fetch debug shows 2811 minutes:
[work_fetch] --- state for NVIDIA GPU ---
[work_fetch] shortfall 0.00 nidle 0.00 saturated 167688.92 busy 167688.92
That's odd, because the fraction_done_exact isn't set in the app_config.xml
ID: 58743 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 5,269
Level
Trp
Scientific publications
wat
Message 58744 - Posted: 26 Apr 2022, 19:31:11 UTC - in response to Message 58737.  


But when the time comes to return to GPUGrid, ACEMD fails to read the checkpoint files, deletes the result file so far, and starts from the beginning. But it retains the memory of elapsed time so far, bringing it much closer to the abort point.


I had one of these ACEMD4 tasks paused with a short amount of computation completed. at 2% and about 6minutes.

with the task paused, the task remained in its "extracted" state. upon resuming the task, it restarts from 1%, not 0%. I'm guessing 0-1% is for the file extraction. but indeed the timer stayed where it was at ~6minutes and continued from there and did not reset to the actual time elapsed for 1%.
ID: 58744 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58745 - Posted: 26 Apr 2022, 19:47:36 UTC

Take a look at the tasks on my host.
It's very easy to spot the one which was restarted without the checkpoint.
ID: 58745 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58746 - Posted: 26 Apr 2022, 23:09:57 UTC

The host from my previous post has received an ACEMD3 task. It has reached 16.3%, when the host received an ACEMD4 task, which took over, as the latter has much shorter deadline. The ACEMD3 could restart from the checkpoint, so it will finish eventually. I wonder how many times the ACEMD3 taks will be suspended, and how many days will pass until it's completed.
ID: 58746 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 731
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58747 - Posted: 26 Apr 2022, 23:42:33 UTC
Last modified: 26 Apr 2022, 23:45:24 UTC

I don't think much chance at all. We've blown through all those 1000 tasks I think. Not much chance your acemd3 task will get pre-empted.
ID: 58747 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · Next

Message boards : News : ACEMD 4

©2025 Universitat Pompeu Fabra