Message boards :
News :
ACEMD 4
Message board moderation
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · Next
| Author | Message |
|---|---|
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Mine just started. So far, so good. I had a lot of _0..._5 failures before getting to one of mine and they were all on low RAM Maxwell or low RAM Pascal cards like a 1050 or 950. So maybe the low RAM count cards are the suspect ones. Looks like even a 8GB 1070 works. My 11GB 1080 Ti should have no issues. |
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
The 1080 Ti can crunch the acemd4 tasks with no issues. 1000 seconds faster than my 2070 Supers or 2080's |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level ![]() Scientific publications
|
i think the CPU speed plays a pretty significant roll in GPU speed on these tasks. so your fast 5950X is helping out a good bit. my 300W 3080Ti under the 7443P runs roughly 700s faster than equivalent 300W 3080Ti under a 7402P.
|
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Same speed for all my 5950X hosts which the 1080 Ti is the newest. So the cpu speed is not the reason for the speedup. Must be 11GB VRAM versus 8GB VRAM on the 2080's, 2070 Supers |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 428 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Checkpointing is still not active. Both my Linux machines crashed hard overnight - I've yet to work out why. But one machine I've restarted had an ACEMD 4 at about the midway point: it's started again from 1%. Edit - very odd. Second machine looks completely inert - no post, no beep, no video output. But BOINC remote monitoring shows all tasks are running, including both GPUs (I got a clue from the SSD activity LED). I'm draining the cache - may call out for some help later. |
|
Send message Joined: 2 Jul 16 Posts: 338 Credit: 7,987,341,558 RAC: 259 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
exceeded elapsed time limit 12932.15 (10000000.00G/773.27G)</message> exceeded elapsed time limit 7695.31 (10000000.00G/1299.49G)</message> Why are there time limits? Up them! |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level ![]() Scientific publications
|
Same speed for all my 5950X hosts which the 1080 Ti is the newest. So the cpu speed is not the reason for the speedup. i wasn't saying there was any "speedup". just that your fast CPU is helping vs how a 1080Ti might perform with a lower end CPU. put the same GPU on a slower CPU and it will slow down to some extent.
|
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level ![]() Scientific publications
|
exceeded elapsed time limit 12932.15 (10000000.00G/773.27G)</message> your hosts are hidden. what are the system specifics? what CPU/RAM specs, etc. the time limit is set by BOINC, not the project directly. BOINC bases the limits on the speed of the device and the estimated flops set by the project.
|
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Checkpointing is still not active.I confirm that. |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 428 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
exceeded elapsed time limit 12932.15 (10000000.00G/773.27G)</message> It's possible that the lack of checkpointing contributed to this problem. ACEND 4 tells BOINC that it has checkponted (and I think it's correct - the checkpoint have been written). So BOINC thinks it's OK to task-switch to another project. But when the time comes to return to GPUGrid, ACEMD fails to read the checkpoint files, deletes the result file so far, and starts from the beginning. But it retains the memory of elapsed time so far, bringing it much closer to the abort point. |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level ![]() Scientific publications
|
if it's retaining the timer when restarting the task from 0 then yes I agree the checkpointing could be the root cause. if it's not checkpointing, the timer should reset to 0. I still think it's a bit of user config error to allow any task switching for GPU projects. setting the task switch limit longer than estimated run time and the task will run unimpeded, barring any other high priority work (though with the 24hr deadlines these are going right into panic mode and preempt the to the front of the line anyway lol). time and time again, GPUGRID shows that the tasks don't like being interrupted. you have the ACEMD3 tasks that can't be restarted on a different device, and sometimes restarting on the same device gets detected as a "different" device. sometimes you have to work around the project rather than making the project work around you. these tasks are ~1/5th the size (runtime) of the current ACEMD3 batch, I don't think it's too burdensome to just let it run to completion without interruption.
|
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level ![]() Scientific publications
|
I still don't understand the messages that the tasks wont complete in time lol. it will only download 1 per GPU, then no more. at inception of the run, BOINC gives an estimated runtime of about 25mins. this is of course too low and they end up running ~2-2.5hrs. so if boinc thinks the tasks are only 25minutes long, and theres a 24hr deadline, what's the logic in saying it can't have another task due to not enough time for completion? even following BOINCs own logic at that point it should realize that it has 23:30 to complete another 0:25 task, no? can anyone explain what BOINC is doing here?
|
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 428 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
We mostly concentrate on the estimates calculated by the client, from size, speed, and (at this project) DCF. But before a task is issued, a similar calculation is made by the server. In 'the good old days' (pre-2010), these were pretty much in lock-step: you can see the working out in Einstein's public server logs. In particular, both client and server included DCF in the estimate. Einstein's server is pretty old-school: the server here in in a curious time-warp state, somewhere between Einstein and full CreditNew. Without access to the server logs, we can't tell exactly what old features have been stripped out, and what new features have been added in. That makes it very hard to answer your question. |
ServicEnginICSend message Joined: 24 Sep 10 Posts: 592 Credit: 11,972,186,510 RAC: 1,447 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I still don't understand the messages that the tasks wont complete in time lol. it will only download 1 per GPU, then no more. I find that this warning appears for ACEMD 4 tasks when you try to set a work buffer greater than 1 day. BOINC Manager probably "thinks": Why should I want to store more than one day of tasks, for tasks with one day deadline? The same happens with ACEMD 3 tasks when setting a work buffer greater than 5 days. It's in some way tricky... * Related question - Short explanation - Extended explanation That is, for current ACEMD 4 tasks (24 hours deadline): If you want to get more than one task per GPU, set your work buffer provisionally at a value lower than 1, and revert to your custom value once a further task is downloaded. (Or leave your buffer permanently set to 0.95 ;-) |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level ![]() Scientific publications
|
interesting observation. I've experienced similar things in the past with other projects with counterintuitive behavior/response to a cache level set "too high". I'll try that out. [edit] Indeed that was it! thanks :) the extracted package for these tasks is also huge. 7 tasks running (7 GPUs), and ~88GB disk space used by the GPUGRID project lol.
|
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
The BOINC manager UI shows 42 minutes left, while the work fetch debug shows 2811 minutes: [work_fetch] --- state for NVIDIA GPU --- [work_fetch] shortfall 0.00 nidle 0.00 saturated 167688.92 busy 167688.92That's odd, because the fraction_done_exact isn't set in the app_config.xml |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level ![]() Scientific publications
|
I had one of these ACEMD4 tasks paused with a short amount of computation completed. at 2% and about 6minutes. with the task paused, the task remained in its "extracted" state. upon resuming the task, it restarts from 1%, not 0%. I'm guessing 0-1% is for the file extraction. but indeed the timer stayed where it was at ~6minutes and continued from there and did not reset to the actual time elapsed for 1%.
|
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Take a look at the tasks on my host. It's very easy to spot the one which was restarted without the checkpoint. |
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
The host from my previous post has received an ACEMD3 task. It has reached 16.3%, when the host received an ACEMD4 task, which took over, as the latter has much shorter deadline. The ACEMD3 could restart from the checkpoint, so it will finish eventually. I wonder how many times the ACEMD3 taks will be suspended, and how many days will pass until it's completed. |
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
I don't think much chance at all. We've blown through all those 1000 tasks I think. Not much chance your acemd3 task will get pre-empted. |
©2025 Universitat Pompeu Fabra