Message boards :
News :
ACEMD 4
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 7 · Next
| Author | Message |
|---|---|
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
This project need to get their networking in order. that's what I have been saying often enough in the past |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level ![]() Scientific publications
|
it's also interesting to see that this new ACEMD4 application does not have the same high PCIe bus use as the ACEMD3 app. should allow faster processing on systems that might be on smaller bus widths (cards on USB risers, cards plugged in via chipsets, older systems with PCIe 2.0 or less, etc) it seems fairly bound on the memory bandwidth though. my 3080Ti is using up to about 80% memory bus which is a bit higher than the ACEMD3 app, fast cards with a smaller bus will be more bound. but this is better for speed I think, reaching back and forth to GPU RAM is a lot faster than reaching back and forth over the PCIe bus to system RAM. still curious about the constant unpacking of the compressed file for every task. wastes 5 mins for each task doing the same thing over and over. if you just leave it unpacked then you save 5 mins on each subsequent task.
|
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 428 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
still curious about the constant unpacking of the compressed file for every task. wastes 5 mins for each task doing the same thing over and over. if you just leave it unpacked then you save 5 mins on each subsequent task. It's BOINC that empties the slot directory when a task has finished uploading, exited and reported. BOINC will check again that the allocated slot is still empty before starting a new task. It won't re-use an old slot if there's anything left behind, whether it's the (a) same project, same application, (b) same project, different application, or (c) a different project entirely. The slot directory is also the 'working' directory in operating system terms, and both the operating system and the GPUGrid project use it in that sense. To use a different location for persistent files would require some effort in modifying the Path environment to let GPUGrid run. Personally, I suspect the "everything, including the kitchen sink" compressed files are perhaps over-specified. The 17,298 items, 10.3 GB I found in there yesterday feels like an 'oh, include it, just in case' solution. When testing is complete and production is about to start, perhaps the project could audit the compressed archives and strip them back to the bare minimum? |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 428 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Another problem in P0_NNPMM_2p95_19-RAIMIS_NNPMM-0-20-RND5821_0 (task 32755591: Exit status 198 (0xc6) EXIT_MEM_LIMIT_EXCEEDED working set size > client RAM limit: 14361.16MB > 14353.11MB |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level ![]() Scientific publications
|
unless they build a single binary file for processing, like most other projects do. then they just dump the binary into the projects folder and it gets used over and over.
|
|
Send message Joined: 2 Jul 16 Posts: 338 Credit: 7,987,341,558 RAC: 259 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Another problem in P0_NNPMM_2p95_19-RAIMIS_NNPMM-0-20-RND5821_0 (task 32755591: I had one too. What kind of app needs 40GB of memory? working set size > client RAM limit: 38573.95MB > 38554.31MB</message> On the next user it was aborted by the project. I've had 3 canceled by server as well, 2 while running. |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 428 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
And I had P0_NNPMM_1hpo_19-RAIMIS_NNPMM-1-20-RND4653_0 cancelled as well, same machine. Same sequence, also ran ~50 minutes. Maybe somebody pulled the batch? |
|
Send message Joined: 26 Mar 18 Posts: 7 Credit: 0 RAC: 0 Level ![]() Scientific publications
|
Hello everybody, Thank you for your feedback on the ACEMD 4 app. Response to the reported issues:
The GPUGRID server is connected to a university network with a substantial bandwidth, so it is not a limiting factor. Most likely, your ISPs are throttling the download speed. As mentioned before, the software package has increased due to PyTorch, which by itself takes ~ 1 GB. I have removed some more unnecessary files, but still it is ~3 GB. The software package is cached by the Boinc client, so it is only downloaded once for each app version and reused.
The WUs have to download ~500 MB of input files. At the moment, I cannot do much about this, but this will be reduced eventually.
I have changed to a different format (gzip), so now it takes 1-2 min to decompress. As I side note, the ACEMD 3 app does the same, but it uses a built-in ZIP decompressor, which doesn’t report that in the log. In the case of the ACEMD 4 app there is an issue that the built-in decompressor doesn’t support files >2 GB. So, I had to add the decompression as a separate task.
I have fixed a memory leak. Now it should consume a reasonable amount of memory (2-4 GB).
|
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level ![]() Scientific publications
|
thanks for giving some attention to the package decompression :). much faster now. another comment I have is regarding the credit reward for the ACEMD4 tasks. they seem to be set to a static 1500cred. it might make sense to implement the same credit reward model from the ACEMD3 tasks (higher credit and actually scaled to difficulty/runtime).
|
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 428 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Thank you for your responses and explanations. Two further points - one an amplification, and the other something different.
I'm on the end of a pretty stable 70 megabit ISP download connection - I don't think they are throttling it, either. I think the answer probably lies in the intermediate routing - the switching centres and interconnectors that the signals pass through between the University and our ISPs. Sometimes they work well, sometimes they drop packets, and sometimes they slow to a crawl. I don't think that's something that can be fixed at the research lab level, but it might be worth mentioning it upstream to the University's technical support team - they might be able to monitor it. On another tack - and this applies to all the GPUGrid researchers as a group - it would help the work proceed more smoothly if you could find a way of paying more careful attention to the meta-data which BOINC passes downstream to our computers with each task. The key value is the estimated size, the <rsc_fpops_est>, of each task. At the moment, I have various machines working on: AbNTest_micro tasks for ADRIA, which run for over a day AbNTest_counts tasks for ADRIA, which run for about an hour Today's NNPMM task from your good self, which looks set to run for about 8 hours. Earlier test runs only lasted a few minutes, but all seem to be given the same project standard <rsc_fpops_est> setting of 5,000,000,000,000,000. The BOINC client uses the fpops estimate, plus its own measurement of elapsed time, to keep track of the effective speed of our machines, and thus the anticipated runtime of newly downloaded tasks. It's tedious, I know, but if the task size estimate isn't routinely adjusted to take account of the testing being undertaken, anticipated runtimes can get seriously distorted. In the worst case, a succession of short tests (if not described accurately) can make our BOINC clients think that our machines have become suddenly many times faster, and can even cause 'normal' tasks to be aborted for taking too long. Experienced volunteers can anticipate and work through these problems, but the project's main work will proceed more smoothly if they don't arise in the first place. |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level ![]() Scientific publications
|
yeah I agree about the est flops size. it throws things all out of wack. my task now which will run for about 3hrs, started with an estimated runtime of 10 days lol.
|
|
Send message Joined: 26 Nov 20 Posts: 1 Credit: 1,333,655,578 RAC: 43 Level ![]() Scientific publications
|
Hello everyone, after restarting the boinc-client.service the progress indication reset from about 35% to just 1% and ist working from there (ACEMD 4 simulations for GPUs 1.03). Rather surprising that the task did not manage to save results and work from there as other applications are. Any explanation? Best Sven |
|
Send message Joined: 2 Jul 16 Posts: 338 Credit: 7,987,341,558 RAC: 259 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Thank you for your responses and explanations. Two further points - one an amplification, and the other something different. I agree, its not on our end either. As mentioned somewhere, a pause and resume of networking in the client can speed up the download. I did this on the 3GB download. The site is often slow and will timeout. Once timing out then it will reload on a refresh. I'd put more weight on a DDoS type of restriction somewhere. Too many requests and speed drops or cut off. The app rename with v3 vs v4 in the name makes things easier. |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 428 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Another 3+ GB download, for ACEMD4 v1.03 This one is coming down at 6MB/sec (so far - the average speed figure is still stabilising while the data download shares the connection), but it looks on target to finish within 5 minutes. Not a problem by itself, but users with slow connections might choose to delay requesting new work until the surge is over. 08/03/2022 10:28:57 | GPUGRID | Started download of x86_64-pc-linux-gnu__cuda1121.tar.gz.0f9ebf1ac84d8d1f5ae2c260dc903be9 08/03/2022 10:36:55 | GPUGRID | Finished download of x86_64-pc-linux-gnu__cuda1121.tar.gz.0f9ebf1ac84d8d1f5ae2c260dc903be9 OK, 7 minutes 58 seconds. And with a six day estimate and a one day deadline, the previous task was booted aside and the new task started immediately. That's called EDF (Earliest Deadline First), or Panic Mode On! |
ServicEnginICSend message Joined: 24 Sep 10 Posts: 592 Credit: 11,972,186,510 RAC: 1,447 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
another comment I have is regarding the credit reward for the ACEMD4 tasks. they seem to be set to a static 1500cred. it might make sense to implement the same credit reward model from the ACEMD3 tasks (higher credit and actually scaled to difficulty/runtime). I do agree that credits granted for ACEMD4 tasks are very undervalued, comparing to same processing times for tasks from other projects, or even ACEMD3 tasks from this same project. For example, Host #186626 PrimeGrid Genefer 18 tasks: ~1250 seconds processing time --> 1750 cedits Gpugrid ACEMD3 ADRIA KIXCMYB tasks: ~82000 seconds processing time --> 540000 credits (+50% bonus, base credits: 360000) Gpugrid ACEMD3 ADRIA e100 tasks: ~3300 seconds processing time -->27000 credits Gpugrid ACEMD4 RAIMIS tasks: ~25600 seconds processing time --> 1500 credits (?) |
|
Send message Joined: 26 Mar 18 Posts: 7 Credit: 0 RAC: 0 Level ![]() Scientific publications
|
Hello everybody, A quick update on the ACEMD 4 app:
I have re-calibrated the estimation of credits. Now the granted credits will be more in line with ACEMD 3, i.e. 1 hour of NVIDIA RTX 2080 Ti calculation is valued at 60 000 credits, not including the additional bonuses.
Currently there is no automated mechanism to set the flops, just the app maintainers set some arbitrary numbers. I have decreased the flop estimate by two orders of magnitude. Hopefully, this is more in line with the actual work.
|
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Still having issues with the acemd4 application. Time exceeded 1800 seconds errors still. Can't download the python/anaconda environment in time. <core_client_version>7.17.0</core_client_version> <![CDATA[ <message> exceeded elapsed time limit 1803.39 (1000000.00G/554.51G)</message> <stderr_txt> 08:51:16 (3808596): wrapper (7.7.26016): starting 08:51:16 (3808596): wrapper (7.7.26016): starting 08:51:16 (3808596): wrapper: running /bin/tar (xf x86_64-pc-linux-gnu__cuda1121.tar.gz) 08:52:11 (3808596): /bin/tar exited; CPU time 52.866473 08:52:11 (3808596): wrapper: running bin/acemd (--boinc --device 0) </stderr_txt> ]]> |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level ![]() Scientific publications
|
Still having issues with the acemd4 application. Keith I saw this too in your tasks list. I'm betting that the reduction in estimated flops caused this to happen. reducing flops makes BOINC think it will take less time to complete and sets the limit for timeout lower.
|
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
You are probably correct Ian. I fixated on the 1800 seconds error since that was the errors I saw with the python tasks. But since this is acemd4, no python involved here I believe. The reduction in estimated gflops was likely the culprit. |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 428 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Got an ACEMD4 task on Linux: T0_NNPMM_frag_00-RAIMIS_NNPMM-1-3-RND2497_5. The five previous attempts have all timed out, in between 2,400 seconds and 5,000 seconds. My metrics are <flops>181962433195.469788</flops> <rsc_fpops_est>1000000000000.000000</rsc_fpops_est> <rsc_fpops_bound>1000000000000000.000000</rsc_fpops_bound> <duration_correction_factor>12.396821</duration_correction_factor> size / speed gives 5.5 seconds uncorrected estimate. With DCF, that becomes 68 seconds, and that's what's displayed in BOINC Manager. 'bound' is 1000 x 'est', so it will time out in 5,500 seconds (if DCF is ignored, as I suspect it is). I'll bump them both by 1000 x, and see how it fares while I'm out. Edit - with a new estimate of 18.5 hours, and a 24 hour deadline, it's gone straight into panic mode. Should get an idea how its doing before I go to bed. |
©2025 Universitat Pompeu Fabra