ACEMD 4

Message boards : News : ACEMD 4
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 7 · Next

AuthorMessage
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 58439 - Posted: 5 Mar 2022, 8:32:19 UTC - in response to Message 58438.  

This project need to get their networking in order.

that's what I have been saying often enough in the past
ID: 58439 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 58440 - Posted: 5 Mar 2022, 15:48:15 UTC

it's also interesting to see that this new ACEMD4 application does not have the same high PCIe bus use as the ACEMD3 app. should allow faster processing on systems that might be on smaller bus widths (cards on USB risers, cards plugged in via chipsets, older systems with PCIe 2.0 or less, etc)

it seems fairly bound on the memory bandwidth though. my 3080Ti is using up to about 80% memory bus which is a bit higher than the ACEMD3 app, fast cards with a smaller bus will be more bound. but this is better for speed I think, reaching back and forth to GPU RAM is a lot faster than reaching back and forth over the PCIe bus to system RAM.

still curious about the constant unpacking of the compressed file for every task. wastes 5 mins for each task doing the same thing over and over. if you just leave it unpacked then you save 5 mins on each subsequent task.
ID: 58440 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 428
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58441 - Posted: 5 Mar 2022, 16:33:16 UTC - in response to Message 58440.  

still curious about the constant unpacking of the compressed file for every task. wastes 5 mins for each task doing the same thing over and over. if you just leave it unpacked then you save 5 mins on each subsequent task.

It's BOINC that empties the slot directory when a task has finished uploading, exited and reported. BOINC will check again that the allocated slot is still empty before starting a new task. It won't re-use an old slot if there's anything left behind, whether it's the (a) same project, same application, (b) same project, different application, or (c) a different project entirely.

The slot directory is also the 'working' directory in operating system terms, and both the operating system and the GPUGrid project use it in that sense. To use a different location for persistent files would require some effort in modifying the Path environment to let GPUGrid run.

Personally, I suspect the "everything, including the kitchen sink" compressed files are perhaps over-specified. The 17,298 items, 10.3 GB I found in there yesterday feels like an 'oh, include it, just in case' solution. When testing is complete and production is about to start, perhaps the project could audit the compressed archives and strip them back to the bare minimum?
ID: 58441 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 428
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58442 - Posted: 5 Mar 2022, 17:14:48 UTC

Another problem in P0_NNPMM_2p95_19-RAIMIS_NNPMM-0-20-RND5821_0 (task 32755591:

Exit status 198 (0xc6) EXIT_MEM_LIMIT_EXCEEDED
working set size > client RAM limit: 14361.16MB > 14353.11MB
ID: 58442 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 58443 - Posted: 5 Mar 2022, 17:50:37 UTC - in response to Message 58441.  

unless they build a single binary file for processing, like most other projects do. then they just dump the binary into the projects folder and it gets used over and over.
ID: 58443 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
mmonnin

Send message
Joined: 2 Jul 16
Posts: 338
Credit: 7,987,341,558
RAC: 259
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58444 - Posted: 5 Mar 2022, 18:58:49 UTC - in response to Message 58442.  

Another problem in P0_NNPMM_2p95_19-RAIMIS_NNPMM-0-20-RND5821_0 (task 32755591:

Exit status 198 (0xc6) EXIT_MEM_LIMIT_EXCEEDED
working set size > client RAM limit: 14361.16MB > 14353.11MB


I had one too. What kind of app needs 40GB of memory?
working set size > client RAM limit: 38573.95MB > 38554.31MB</message>

On the next user it was aborted by the project. I've had 3 canceled by server as well, 2 while running.

ID: 58444 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 428
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58445 - Posted: 5 Mar 2022, 19:22:28 UTC - in response to Message 58444.  

And I had P0_NNPMM_1hpo_19-RAIMIS_NNPMM-1-20-RND4653_0 cancelled as well, same machine. Same sequence, also ran ~50 minutes. Maybe somebody pulled the batch?
ID: 58445 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Raimondas

Send message
Joined: 26 Mar 18
Posts: 7
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58452 - Posted: 7 Mar 2022, 13:57:10 UTC

Hello everybody,

Thank you for your feedback on the ACEMD 4 app.

Response to the reported issues:

    Long download time of the software package.
    The GPUGRID server is connected to a university network with a substantial bandwidth, so it is not a limiting factor. Most likely, your ISPs are throttling the download speed. As mentioned before, the software package has increased due to PyTorch, which by itself takes ~ 1 GB. I have removed some more unnecessary files, but still it is ~3 GB. The software package is cached by the Boinc client, so it is only downloaded once for each app version and reused.


    Long download time of the input files.
    The WUs have to download ~500 MB of input files. At the moment, I cannot do much about this, but this will be reduced eventually.


    Long decompression time.
    I have changed to a different format (gzip), so now it takes 1-2 min to decompress. As I side note, the ACEMD 3 app does the same, but it uses a built-in ZIP decompressor, which doesn’t report that in the log. In the case of the ACEMD 4 app there is an issue that the built-in decompressor doesn’t support files >2 GB. So, I had to add the decompression as a separate task.


    Excessive memory usage.
    I have fixed a memory leak. Now it should consume a reasonable amount of memory (2-4 GB).


If I missed something important, feel free to remind me.

Happy computing,

Raimondas

ID: 58452 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 58453 - Posted: 7 Mar 2022, 15:20:23 UTC - in response to Message 58452.  
Last modified: 7 Mar 2022, 15:25:58 UTC

thanks for giving some attention to the package decompression :). much faster now.

another comment I have is regarding the credit reward for the ACEMD4 tasks. they seem to be set to a static 1500cred. it might make sense to implement the same credit reward model from the ACEMD3 tasks (higher credit and actually scaled to difficulty/runtime).
ID: 58453 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 428
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58454 - Posted: 7 Mar 2022, 16:59:33 UTC - in response to Message 58452.  

Thank you for your responses and explanations. Two further points - one an amplification, and the other something different.

    Long download time of the software package.
    The GPUGRID server is connected to a university network with a substantial bandwidth, so it is not a limiting factor. Most likely, your ISPs are throttling the download speed. As mentioned before, the software package has increased due to PyTorch, which by itself takes ~ 1 GB. I have removed some more unnecessary files, but still it is ~3 GB. The software package is cached by the Boinc client, so it is only downloaded once for each app version and reused.


I'm on the end of a pretty stable 70 megabit ISP download connection - I don't think they are throttling it, either. I think the answer probably lies in the intermediate routing - the switching centres and interconnectors that the signals pass through between the University and our ISPs. Sometimes they work well, sometimes they drop packets, and sometimes they slow to a crawl. I don't think that's something that can be fixed at the research lab level, but it might be worth mentioning it upstream to the University's technical support team - they might be able to monitor it.

On another tack - and this applies to all the GPUGrid researchers as a group - it would help the work proceed more smoothly if you could find a way of paying more careful attention to the meta-data which BOINC passes downstream to our computers with each task.

The key value is the estimated size, the <rsc_fpops_est>, of each task. At the moment, I have various machines working on:

AbNTest_micro tasks for ADRIA, which run for over a day
AbNTest_counts tasks for ADRIA, which run for about an hour
Today's NNPMM task from your good self, which looks set to run for about 8 hours.

Earlier test runs only lasted a few minutes, but all seem to be given the same project standard <rsc_fpops_est> setting of 5,000,000,000,000,000.

The BOINC client uses the fpops estimate, plus its own measurement of elapsed time, to keep track of the effective speed of our machines, and thus the anticipated runtime of newly downloaded tasks.

It's tedious, I know, but if the task size estimate isn't routinely adjusted to take account of the testing being undertaken, anticipated runtimes can get seriously distorted. In the worst case, a succession of short tests (if not described accurately) can make our BOINC clients think that our machines have become suddenly many times faster, and can even cause 'normal' tasks to be aborted for taking too long. Experienced volunteers can anticipate and work through these problems, but the project's main work will proceed more smoothly if they don't arise in the first place.
ID: 58454 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 58455 - Posted: 7 Mar 2022, 17:35:58 UTC - in response to Message 58454.  

yeah I agree about the est flops size. it throws things all out of wack. my task now which will run for about 3hrs, started with an estimated runtime of 10 days lol.
ID: 58455 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Sven

Send message
Joined: 26 Nov 20
Posts: 1
Credit: 1,333,655,578
RAC: 43
Level
Met
Scientific publications
wat
Message 58457 - Posted: 7 Mar 2022, 21:12:06 UTC

Hello everyone,

after restarting the boinc-client.service the progress indication reset from about 35% to just 1% and ist working from there (ACEMD 4 simulations for GPUs 1.03). Rather surprising that the task did not manage to save results and work from there as other applications are.

Any explanation?

Best
Sven
ID: 58457 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
mmonnin

Send message
Joined: 2 Jul 16
Posts: 338
Credit: 7,987,341,558
RAC: 259
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58459 - Posted: 7 Mar 2022, 22:42:11 UTC - in response to Message 58454.  
Last modified: 7 Mar 2022, 22:44:01 UTC

Thank you for your responses and explanations. Two further points - one an amplification, and the other something different.

    Long download time of the software package.
    The GPUGRID server is connected to a university network with a substantial bandwidth, so it is not a limiting factor. Most likely, your ISPs are throttling the download speed. As mentioned before, the software package has increased due to PyTorch, which by itself takes ~ 1 GB. I have removed some more unnecessary files, but still it is ~3 GB. The software package is cached by the Boinc client, so it is only downloaded once for each app version and reused.


I'm on the end of a pretty stable 70 megabit ISP download connection - I don't think they are throttling it, either. I think the answer probably lies in the intermediate routing - the switching centres and interconnectors that the signals pass through between the University and our ISPs. Sometimes they work well, sometimes they drop packets, and sometimes they slow to a crawl. I don't think that's something that can be fixed at the research lab level, but it might be worth mentioning it upstream to the University's technical support team - they might be able to monitor it.


I agree, its not on our end either. As mentioned somewhere, a pause and resume of networking in the client can speed up the download. I did this on the 3GB download.

The site is often slow and will timeout. Once timing out then it will reload on a refresh. I'd put more weight on a DDoS type of restriction somewhere. Too many requests and speed drops or cut off.

The app rename with v3 vs v4 in the name makes things easier.
ID: 58459 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 428
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58462 - Posted: 8 Mar 2022, 10:44:15 UTC

Another 3+ GB download, for ACEMD4 v1.03

This one is coming down at 6MB/sec (so far - the average speed figure is still stabilising while the data download shares the connection), but it looks on target to finish within 5 minutes. Not a problem by itself, but users with slow connections might choose to delay requesting new work until the surge is over.

08/03/2022 10:28:57 | GPUGRID | Started download of x86_64-pc-linux-gnu__cuda1121.tar.gz.0f9ebf1ac84d8d1f5ae2c260dc903be9
08/03/2022 10:36:55 | GPUGRID | Finished download of x86_64-pc-linux-gnu__cuda1121.tar.gz.0f9ebf1ac84d8d1f5ae2c260dc903be9

OK, 7 minutes 58 seconds. And with a six day estimate and a one day deadline, the previous task was booted aside and the new task started immediately. That's called EDF (Earliest Deadline First), or Panic Mode On!
ID: 58462 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ServicEnginIC
Avatar

Send message
Joined: 24 Sep 10
Posts: 592
Credit: 11,972,186,510
RAC: 1,447
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58482 - Posted: 10 Mar 2022, 11:33:48 UTC - in response to Message 58453.  

another comment I have is regarding the credit reward for the ACEMD4 tasks. they seem to be set to a static 1500cred. it might make sense to implement the same credit reward model from the ACEMD3 tasks (higher credit and actually scaled to difficulty/runtime).

I do agree that credits granted for ACEMD4 tasks are very undervalued, comparing to same processing times for tasks from other projects, or even ACEMD3 tasks from this same project.
For example, Host #186626
PrimeGrid Genefer 18 tasks: ~1250 seconds processing time --> 1750 cedits
Gpugrid ACEMD3 ADRIA KIXCMYB tasks: ~82000 seconds processing time --> 540000 credits (+50% bonus, base credits: 360000)
Gpugrid ACEMD3 ADRIA e100 tasks: ~3300 seconds processing time -->27000 credits
Gpugrid ACEMD4 RAIMIS tasks: ~25600 seconds processing time --> 1500 credits (?)
ID: 58482 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Raimondas

Send message
Joined: 26 Mar 18
Posts: 7
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58623 - Posted: 12 Apr 2022, 12:17:05 UTC

Hello everybody,

A quick update on the ACEMD 4 app:

    Credits
    I have re-calibrated the estimation of credits. Now the granted credits will be more in line with ACEMD 3, i.e. 1 hour of NVIDIA RTX 2080 Ti calculation is valued at 60 000 credits, not including the additional bonuses.


    Estimated flops
    Currently there is no automated mechanism to set the flops, just the app maintainers set some arbitrary numbers. I have decreased the flop estimate by two orders of magnitude. Hopefully, this is more in line with the actual work.



What happens next? Today I have sent several test WUs. If no issues are discovered, I will send ~1300 production WUs.

Happy computing,

Raimondas

ID: 58623 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 891
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58626 - Posted: 12 Apr 2022, 17:05:11 UTC
Last modified: 12 Apr 2022, 17:07:00 UTC

Still having issues with the acemd4 application.

Time exceeded 1800 seconds errors still.

Can't download the python/anaconda environment in time.

<core_client_version>7.17.0</core_client_version>
<![CDATA[
<message>
exceeded elapsed time limit 1803.39 (1000000.00G/554.51G)</message>
<stderr_txt>
08:51:16 (3808596): wrapper (7.7.26016): starting
08:51:16 (3808596): wrapper (7.7.26016): starting
08:51:16 (3808596): wrapper: running /bin/tar (xf x86_64-pc-linux-gnu__cuda1121.tar.gz)
08:52:11 (3808596): /bin/tar exited; CPU time 52.866473
08:52:11 (3808596): wrapper: running bin/acemd (--boinc --device 0)

</stderr_txt>
]]>
ID: 58626 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 58627 - Posted: 12 Apr 2022, 17:17:35 UTC - in response to Message 58626.  

Still having issues with the acemd4 application.

Time exceeded 1800 seconds errors still.

Can't download the python/anaconda environment in time.

<core_client_version>7.17.0</core_client_version>
<![CDATA[
<message>
exceeded elapsed time limit 1803.39 (1000000.00G/554.51G)</message>
<stderr_txt>
08:51:16 (3808596): wrapper (7.7.26016): starting
08:51:16 (3808596): wrapper (7.7.26016): starting
08:51:16 (3808596): wrapper: running /bin/tar (xf x86_64-pc-linux-gnu__cuda1121.tar.gz)
08:52:11 (3808596): /bin/tar exited; CPU time 52.866473
08:52:11 (3808596): wrapper: running bin/acemd (--boinc --device 0)

</stderr_txt>
]]>

Keith I saw this too in your tasks list. I'm betting that the reduction in estimated flops caused this to happen. reducing flops makes BOINC think it will take less time to complete and sets the limit for timeout lower.

ID: 58627 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 891
Level
Tyr
Scientific publications
watwatwatwatwat
Message 58628 - Posted: 12 Apr 2022, 18:13:55 UTC

You are probably correct Ian. I fixated on the 1800 seconds error since that was the errors I saw with the python tasks.

But since this is acemd4, no python involved here I believe.

The reduction in estimated gflops was likely the culprit.
ID: 58628 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 428
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58629 - Posted: 12 Apr 2022, 19:18:35 UTC
Last modified: 12 Apr 2022, 19:29:51 UTC

Got an ACEMD4 task on Linux: T0_NNPMM_frag_00-RAIMIS_NNPMM-1-3-RND2497_5. The five previous attempts have all timed out, in between 2,400 seconds and 5,000 seconds.

My metrics are
<flops>181962433195.469788</flops>
<rsc_fpops_est>1000000000000.000000</rsc_fpops_est>
<rsc_fpops_bound>1000000000000000.000000</rsc_fpops_bound>
<duration_correction_factor>12.396821</duration_correction_factor>

size / speed gives 5.5 seconds uncorrected estimate. With DCF, that becomes 68 seconds, and that's what's displayed in BOINC Manager.

'bound' is 1000 x 'est', so it will time out in 5,500 seconds (if DCF is ignored, as I suspect it is). I'll bump them both by 1000 x, and see how it fares while I'm out.

Edit - with a new estimate of 18.5 hours, and a 24 hour deadline, it's gone straight into panic mode. Should get an idea how its doing before I go to bed.
ID: 58629 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 7 · Next

Message boards : News : ACEMD 4

©2025 Universitat Pompeu Fabra