ACEMD 4

Author	Message
Erich56 Send message Joined: 1 Jan 15 Posts: 1168 Credit: 12,317,898,501 RAC: 75,187 Level Scientific publications	Message 58439 - Posted: 5 Mar 2022, 8:32:19 UTC - in response to Message 58438. This project need to get their networking in order. that's what I have been saying often enough in the past ID: 58439 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 8,067 Level Scientific publications	Message 58440 - Posted: 5 Mar 2022, 15:48:15 UTC it's also interesting to see that this new ACEMD4 application does not have the same high PCIe bus use as the ACEMD3 app. should allow faster processing on systems that might be on smaller bus widths (cards on USB risers, cards plugged in via chipsets, older systems with PCIe 2.0 or less, etc) it seems fairly bound on the memory bandwidth though. my 3080Ti is using up to about 80% memory bus which is a bit higher than the ACEMD3 app, fast cards with a smaller bus will be more bound. but this is better for speed I think, reaching back and forth to GPU RAM is a lot faster than reaching back and forth over the PCIe bus to system RAM. still curious about the constant unpacking of the compressed file for every task. wastes 5 mins for each task doing the same thing over and over. if you just leave it unpacked then you save 5 mins on each subsequent task. ID: 58440 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 58441 - Posted: 5 Mar 2022, 16:33:16 UTC - in response to Message 58440. still curious about the constant unpacking of the compressed file for every task. wastes 5 mins for each task doing the same thing over and over. if you just leave it unpacked then you save 5 mins on each subsequent task. It's BOINC that empties the slot directory when a task has finished uploading, exited and reported. BOINC will check again that the allocated slot is still empty before starting a new task. It won't re-use an old slot if there's anything left behind, whether it's the (a) same project, same application, (b) same project, different application, or (c) a different project entirely. The slot directory is also the 'working' directory in operating system terms, and both the operating system and the GPUGrid project use it in that sense. To use a different location for persistent files would require some effort in modifying the Path environment to let GPUGrid run. Personally, I suspect the "everything, including the kitchen sink" compressed files are perhaps over-specified. The 17,298 items, 10.3 GB I found in there yesterday feels like an 'oh, include it, just in case' solution. When testing is complete and production is about to start, perhaps the project could audit the compressed archives and strip them back to the bare minimum? ID: 58441 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 58442 - Posted: 5 Mar 2022, 17:14:48 UTC Another problem in P0_NNPMM_2p95_19-RAIMIS_NNPMM-0-20-RND5821_0 (task 32755591: Exit status 198 (0xc6) EXIT_MEM_LIMIT_EXCEEDED working set size > client RAM limit: 14361.16MB > 14353.11MB ID: 58442 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 8,067 Level Scientific publications	Message 58443 - Posted: 5 Mar 2022, 17:50:37 UTC - in response to Message 58441. unless they build a single binary file for processing, like most other projects do. then they just dump the binary into the projects folder and it gets used over and over. ID: 58443 · Rating: 0 · rate: / Reply Quote

mmonnin Send message Joined: 2 Jul 16 Posts: 339 Credit: 7,990,341,558 RAC: 84 Level Scientific publications	Message 58444 - Posted: 5 Mar 2022, 18:58:49 UTC - in response to Message 58442. Another problem in P0_NNPMM_2p95_19-RAIMIS_NNPMM-0-20-RND5821_0 (task 32755591: Exit status 198 (0xc6) EXIT_MEM_LIMIT_EXCEEDED working set size > client RAM limit: 14361.16MB > 14353.11MB I had one too. What kind of app needs 40GB of memory? working set size > client RAM limit: 38573.95MB > 38554.31MB</message> On the next user it was aborted by the project. I've had 3 canceled by server as well, 2 while running. ID: 58444 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 58445 - Posted: 5 Mar 2022, 19:22:28 UTC - in response to Message 58444. And I had P0_NNPMM_1hpo_19-RAIMIS_NNPMM-1-20-RND4653_0 cancelled as well, same machine. Same sequence, also ran ~50 minutes. Maybe somebody pulled the batch? ID: 58445 · Rating: 0 · rate: / Reply Quote

Raimondas Send message Joined: 26 Mar 18 Posts: 7 Credit: 0 RAC: 0 Level Scientific publications	Message 58452 - Posted: 7 Mar 2022, 13:57:10 UTC Hello everybody, Thank you for your feedback on the ACEMD 4 app. Response to the reported issues: Long download time of the software package. The GPUGRID server is connected to a university network with a substantial bandwidth, so it is not a limiting factor. Most likely, your ISPs are throttling the download speed. As mentioned before, the software package has increased due to PyTorch, which by itself takes ~ 1 GB. I have removed some more unnecessary files, but still it is ~3 GB. The software package is cached by the Boinc client, so it is only downloaded once for each app version and reused. Long download time of the input files. The WUs have to download ~500 MB of input files. At the moment, I cannot do much about this, but this will be reduced eventually. Long decompression time. I have changed to a different format (gzip), so now it takes 1-2 min to decompress. As I side note, the ACEMD 3 app does the same, but it uses a built-in ZIP decompressor, which doesn’t report that in the log. In the case of the ACEMD 4 app there is an issue that the built-in decompressor doesn’t support files >2 GB. So, I had to add the decompression as a separate task. Excessive memory usage. I have fixed a memory leak. Now it should consume a reasonable amount of memory (2-4 GB). If I missed something important, feel free to remind me. Happy computing, Raimondas ID: 58452 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 8,067 Level Scientific publications	Message 58453 - Posted: 7 Mar 2022, 15:20:23 UTC - in response to Message 58452. Last modified: 7 Mar 2022, 15:25:58 UTC thanks for giving some attention to the package decompression :). much faster now. another comment I have is regarding the credit reward for the ACEMD4 tasks. they seem to be set to a static 1500cred. it might make sense to implement the same credit reward model from the ACEMD3 tasks (higher credit and actually scaled to difficulty/runtime). ID: 58453 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 58454 - Posted: 7 Mar 2022, 16:59:33 UTC - in response to Message 58452. Thank you for your responses and explanations. Two further points - one an amplification, and the other something different. Long download time of the software package. The GPUGRID server is connected to a university network with a substantial bandwidth, so it is not a limiting factor. Most likely, your ISPs are throttling the download speed. As mentioned before, the software package has increased due to PyTorch, which by itself takes ~ 1 GB. I have removed some more unnecessary files, but still it is ~3 GB. The software package is cached by the Boinc client, so it is only downloaded once for each app version and reused. I'm on the end of a pretty stable 70 megabit ISP download connection - I don't think they are throttling it, either. I think the answer probably lies in the intermediate routing - the switching centres and interconnectors that the signals pass through between the University and our ISPs. Sometimes they work well, sometimes they drop packets, and sometimes they slow to a crawl. I don't think that's something that can be fixed at the research lab level, but it might be worth mentioning it upstream to the University's technical support team - they might be able to monitor it. On another tack - and this applies to all the GPUGrid researchers as a group - it would help the work proceed more smoothly if you could find a way of paying more careful attention to the meta-data which BOINC passes downstream to our computers with each task. The key value is the estimated size, the <rsc_fpops_est>, of each task. At the moment, I have various machines working on: AbNTest_micro tasks for ADRIA, which run for over a day AbNTest_counts tasks for ADRIA, which run for about an hour Today's NNPMM task from your good self, which looks set to run for about 8 hours. Earlier test runs only lasted a few minutes, but all seem to be given the same project standard <rsc_fpops_est> setting of 5,000,000,000,000,000. The BOINC client uses the fpops estimate, plus its own measurement of elapsed time, to keep track of the effective speed of our machines, and thus the anticipated runtime of newly downloaded tasks. It's tedious, I know, but if the task size estimate isn't routinely adjusted to take account of the testing being undertaken, anticipated runtimes can get seriously distorted. In the worst case, a succession of short tests (if not described accurately) can make our BOINC clients think that our machines have become suddenly many times faster, and can even cause 'normal' tasks to be aborted for taking too long. Experienced volunteers can anticipate and work through these problems, but the project's main work will proceed more smoothly if they don't arise in the first place. ID: 58454 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 8,067 Level Scientific publications	Message 58455 - Posted: 7 Mar 2022, 17:35:58 UTC - in response to Message 58454. yeah I agree about the est flops size. it throws things all out of wack. my task now which will run for about 3hrs, started with an estimated runtime of 10 days lol. ID: 58455 · Rating: 0 · rate: / Reply Quote

Sven Send message Joined: 26 Nov 20 Posts: 1 Credit: 1,333,655,578 RAC: 0 Level Scientific publications	Message 58457 - Posted: 7 Mar 2022, 21:12:06 UTC Hello everyone, after restarting the boinc-client.service the progress indication reset from about 35% to just 1% and ist working from there (ACEMD 4 simulations for GPUs 1.03). Rather surprising that the task did not manage to save results and work from there as other applications are. Any explanation? Best Sven ID: 58457 · Rating: 0 · rate: / Reply Quote

mmonnin Send message Joined: 2 Jul 16 Posts: 339 Credit: 7,990,341,558 RAC: 84 Level Scientific publications	Message 58459 - Posted: 7 Mar 2022, 22:42:11 UTC - in response to Message 58454. Last modified: 7 Mar 2022, 22:44:01 UTC Thank you for your responses and explanations. Two further points - one an amplification, and the other something different. Long download time of the software package. The GPUGRID server is connected to a university network with a substantial bandwidth, so it is not a limiting factor. Most likely, your ISPs are throttling the download speed. As mentioned before, the software package has increased due to PyTorch, which by itself takes ~ 1 GB. I have removed some more unnecessary files, but still it is ~3 GB. The software package is cached by the Boinc client, so it is only downloaded once for each app version and reused. I'm on the end of a pretty stable 70 megabit ISP download connection - I don't think they are throttling it, either. I think the answer probably lies in the intermediate routing - the switching centres and interconnectors that the signals pass through between the University and our ISPs. Sometimes they work well, sometimes they drop packets, and sometimes they slow to a crawl. I don't think that's something that can be fixed at the research lab level, but it might be worth mentioning it upstream to the University's technical support team - they might be able to monitor it. I agree, its not on our end either. As mentioned somewhere, a pause and resume of networking in the client can speed up the download. I did this on the 3GB download. The site is often slow and will timeout. Once timing out then it will reload on a refresh. I'd put more weight on a DDoS type of restriction somewhere. Too many requests and speed drops or cut off. The app rename with v3 vs v4 in the name makes things easier. ID: 58459 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 58462 - Posted: 8 Mar 2022, 10:44:15 UTC Another 3+ GB download, for ACEMD4 v1.03 This one is coming down at 6MB/sec (so far - the average speed figure is still stabilising while the data download shares the connection), but it looks on target to finish within 5 minutes. Not a problem by itself, but users with slow connections might choose to delay requesting new work until the surge is over. 08/03/2022 10:28:57 \| GPUGRID \| Started download of x86_64-pc-linux-gnu__cuda1121.tar.gz.0f9ebf1ac84d8d1f5ae2c260dc903be9 08/03/2022 10:36:55 \| GPUGRID \| Finished download of x86_64-pc-linux-gnu__cuda1121.tar.gz.0f9ebf1ac84d8d1f5ae2c260dc903be9 OK, 7 minutes 58 seconds. And with a six day estimate and a one day deadline, the previous task was booted aside and the new task started immediately. That's called EDF (Earliest Deadline First), or Panic Mode On! ID: 58462 · Rating: 0 · rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 595 Credit: 12,249,686,510 RAC: 1,140,567 Level Scientific publications	Message 58482 - Posted: 10 Mar 2022, 11:33:48 UTC - in response to Message 58453. another comment I have is regarding the credit reward for the ACEMD4 tasks. they seem to be set to a static 1500cred. it might make sense to implement the same credit reward model from the ACEMD3 tasks (higher credit and actually scaled to difficulty/runtime). I do agree that credits granted for ACEMD4 tasks are very undervalued, comparing to same processing times for tasks from other projects, or even ACEMD3 tasks from this same project. For example, Host #186626 PrimeGrid Genefer 18 tasks: ~1250 seconds processing time --> 1750 cedits Gpugrid ACEMD3 ADRIA KIXCMYB tasks: ~82000 seconds processing time --> 540000 credits (+50% bonus, base credits: 360000) Gpugrid ACEMD3 ADRIA e100 tasks: ~3300 seconds processing time -->27000 credits Gpugrid ACEMD4 RAIMIS tasks: ~25600 seconds processing time --> 1500 credits (?) ID: 58482 · Rating: 0 · rate: / Reply Quote

Raimondas Send message Joined: 26 Mar 18 Posts: 7 Credit: 0 RAC: 0 Level Scientific publications	Message 58623 - Posted: 12 Apr 2022, 12:17:05 UTC Hello everybody, A quick update on the ACEMD 4 app: Credits I have re-calibrated the estimation of credits. Now the granted credits will be more in line with ACEMD 3, i.e. 1 hour of NVIDIA RTX 2080 Ti calculation is valued at 60 000 credits, not including the additional bonuses. Estimated flops Currently there is no automated mechanism to set the flops, just the app maintainers set some arbitrary numbers. I have decreased the flop estimate by two orders of magnitude. Hopefully, this is more in line with the actual work. What happens next? Today I have sent several test WUs. If no issues are discovered, I will send ~1300 production WUs. Happy computing, Raimondas ID: 58623 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 34,713 Level Scientific publications	Message 58626 - Posted: 12 Apr 2022, 17:05:11 UTC Last modified: 12 Apr 2022, 17:07:00 UTC Still having issues with the acemd4 application. Time exceeded 1800 seconds errors still. Can't download the python/anaconda environment in time. <core_client_version>7.17.0</core_client_version> <![CDATA[ <message> exceeded elapsed time limit 1803.39 (1000000.00G/554.51G)</message> <stderr_txt> 08:51:16 (3808596): wrapper (7.7.26016): starting 08:51:16 (3808596): wrapper (7.7.26016): starting 08:51:16 (3808596): wrapper: running /bin/tar (xf x86_64-pc-linux-gnu__cuda1121.tar.gz) 08:52:11 (3808596): /bin/tar exited; CPU time 52.866473 08:52:11 (3808596): wrapper: running bin/acemd (--boinc --device 0) </stderr_txt> ]]> ID: 58626 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 8,067 Level Scientific publications	Message 58627 - Posted: 12 Apr 2022, 17:17:35 UTC - in response to Message 58626. Still having issues with the acemd4 application. Time exceeded 1800 seconds errors still. Can't download the python/anaconda environment in time. <core_client_version>7.17.0</core_client_version> <![CDATA[ <message> exceeded elapsed time limit 1803.39 (1000000.00G/554.51G)</message> <stderr_txt> 08:51:16 (3808596): wrapper (7.7.26016): starting 08:51:16 (3808596): wrapper (7.7.26016): starting 08:51:16 (3808596): wrapper: running /bin/tar (xf x86_64-pc-linux-gnu__cuda1121.tar.gz) 08:52:11 (3808596): /bin/tar exited; CPU time 52.866473 08:52:11 (3808596): wrapper: running bin/acemd (--boinc --device 0) </stderr_txt> ]]> Keith I saw this too in your tasks list. I'm betting that the reduction in estimated flops caused this to happen. reducing flops makes BOINC think it will take less time to complete and sets the limit for timeout lower. ID: 58627 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 34,713 Level Scientific publications	Message 58628 - Posted: 12 Apr 2022, 18:13:55 UTC You are probably correct Ian. I fixated on the 1800 seconds error since that was the errors I saw with the python tasks. But since this is acemd4, no python involved here I believe. The reduction in estimated gflops was likely the culprit. ID: 58628 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 58629 - Posted: 12 Apr 2022, 19:18:35 UTC Last modified: 12 Apr 2022, 19:29:51 UTC Got an ACEMD4 task on Linux: T0_NNPMM_frag_00-RAIMIS_NNPMM-1-3-RND2497_5. The five previous attempts have all timed out, in between 2,400 seconds and 5,000 seconds. My metrics are <flops>181962433195.469788</flops> <rsc_fpops_est>1000000000000.000000</rsc_fpops_est> <rsc_fpops_bound>1000000000000000.000000</rsc_fpops_bound> <duration_correction_factor>12.396821</duration_correction_factor> size / speed gives 5.5 seconds uncorrected estimate. With DCF, that becomes 68 seconds, and that's what's displayed in BOINC Manager. 'bound' is 1000 x 'est', so it will time out in 5,500 seconds (if DCF is ignored, as I suspect it is). I'll bump them both by 1000 x, and see how it fares while I'm out. Edit - with a new estimate of 18.5 hours, and a 24 hour deadline, it's gone straight into panic mode. Should get an idea how its doing before I go to bed. ID: 58629 · Rating: 0 · rate: / Reply Quote