Message boards : News : ACEMD 4
Author | Message |
---|---|
Hello everybody, What is the difference between ACEMD 3 and ACEMD 4? A key new feature is the integration of machine learning into molecular simulations. Specifically, we are implementing a new method called NNP/MM. What is NNP/MM? NNP/MM is a hybrid simulation method combining neural network potentials (NNP) and molecular mechanics (MM). NNP can model the molecular interactions more accurately than the conventional force fields in MM, but it still is not as fast as MM. Thus, only the important part of a molecular system is simulated with NNP, while the rest part is using MM. You can read more in a pre-print of the NNP/MM article: https://arxiv.org/abs/2201.08110 How much more accurate is NNP? You can read a pre-print of the TorchMD-NET article: https://arxiv.org/abs/2202.02541 What are software/hardware requirements for ACEMD 4? Pretty much the same as for ACEMD 3. Only the significant change is the size of the software stack. ACEMD 3 and all its dependencies need just 1 GB, while for ACEMD 4 that has increased to 3 GB, notably due to PyTorch (https://pytorch.org). Also, at the moment, there are just the Linux version for CUDA >=11.2 available. When will ACEMD 3 be replaced by ACEMD 4? Within a few months, we will release ACEMD 4 officially and ACEMD 3 will be deprecated. For a moment, the apps will coexist and you will receive WUs for both of them. What will happen next? We have already sent several WUs to test the deployment of ACEMD 4 and will continue this week. Let us know, if you notice some irregularities. Next week, we are aiming to start sending production WUs.
| |
ID: 58395 | Rating: 0 | rate: / Reply Quote | |
Thanks for the explanations! I'll be reading those preprints soon. | |
ID: 58396 | Rating: 0 | rate: / Reply Quote | |
Thanks! I have added the ACEMD 4 option. Why are there 2 Use CPU boxes? Where do you see these boxes? | |
ID: 58397 | Rating: 0 | rate: / Reply Quote | |
Thanks! I have added the ACEMD 4 option. Click on Edit Preferences and there's one at the top and one at the bottom. | |
ID: 58398 | Rating: 0 | rate: / Reply Quote | |
Thanks! I have added the ACEMD 4 option. Not seeing this on my account. | |
ID: 58399 | Rating: 0 | rate: / Reply Quote | |
Resource share This is what my default location looks like. The CPU is at the top and bottom. | |
ID: 58400 | Rating: 0 | rate: / Reply Quote | |
Resource share OK, after refreshing the browser, I am seeing use cpu in both places also. | |
ID: 58401 | Rating: 0 | rate: / Reply Quote | |
Thanks! I have added the ACEMD 4 option. In Project Preferences | |
ID: 58402 | Rating: 0 | rate: / Reply Quote | |
Next week, we are aiming to start sending production WUs. Can you say at what quantities? | |
ID: 58403 | Rating: 0 | rate: / Reply Quote | |
Just tried to run one of the new v1.01 tasks. Failed with this error message: 13:16:51 (224808): wrapper: running bin/conda-unpack () Another is downloading on my second Linux machine as I type: I'll try to watch it running. Edit - second task failed the same way. Trying to work out what those unprintable (?punctuation?) characters represent. Edit2 - no, all mine are failing, and I can't work out those character codes (not a format I recognise). ˜/153 feels like a start-stop pair (open quote/close quote?), but I can't get further than that. | |
ID: 58404 | Rating: 0 | rate: / Reply Quote | |
About 400 WUs | |
ID: 58405 | Rating: 0 | rate: / Reply Quote | |
Just tried to run one of the new v1.01 tasks. Failed with this error message: Don't worry, I'll get rid of that conda-unpack. | |
ID: 58406 | Rating: 0 | rate: / Reply Quote | |
Don't worry, I'll get rid of that conda-unpack. No probs. I've just failed another one, so I'll set 'No new Tasks' until you're ready. Can you let us know when you're set for the next test, please? | |
ID: 58407 | Rating: 0 | rate: / Reply Quote | |
One out of nine have passed so far. Here's the latest failure, I can't help with what code 195 means, just sharing. Stderr output <core_client_version>7.16.6</core_client_version> <![CDATA[ <message> process exited with code 195 (0xc3, -61)</message> <stderr_txt> 04:49:47 (986503): wrapper (7.7.26016): starting 04:49:47 (986503): wrapper (7.7.26016): starting 04:49:47 (986503): wrapper: running /bin/tar (xf x86_64-pc-linux-gnu__cuda1121.tar.bz2) 04:56:02 (986503): /bin/tar exited; CPU time 369.002826 04:56:02 (986503): wrapper: running bin/conda-unpack () /usr/bin/env: ‘python’: No such file or directory 04:56:03 (986503): bin/conda-unpack exited; CPU time 0.001065 04:56:03 (986503): app exit status: 0x7f 04:56:03 (986503): called boinc_finish(195) </stderr_txt> ]]> | |
ID: 58408 | Rating: 0 | rate: / Reply Quote | |
Your computers are hidden, so I can't see the details of the task which succeeded. Was it the most recent (implying that Raimondas has fixed conda-unpack already), or an old v100 left over from the previous (working) run? | |
ID: 58409 | Rating: 0 | rate: / Reply Quote | |
Raimondas wrote: About 400 WUs Linux only, or Windows also? | |
ID: 58410 | Rating: 0 | rate: / Reply Quote | |
He said earlier it would be a Linux only application. For now. | |
ID: 58413 | Rating: 0 | rate: / Reply Quote | |
Raimondas has issued a new download - 2.86 gigabytes this time. It's downloading (for now) at around 7.5 megabytes/sec, which is great. But I fear that speeds may drop if too many people try to download it at once (as some people reported yesterday). | |
ID: 58414 | Rating: 0 | rate: / Reply Quote | |
Mine will take almost 11 hours to download. It might finish in time on a GTX 1070. | |
ID: 58415 | Rating: 0 | rate: / Reply Quote | |
T13_7-RAIMIS_TEST-0-2-RND0280_1 did a bad thing. It took over from an acemd3 WU that was 88% done. It should wait its turn. | |
ID: 58416 | Rating: 0 | rate: / Reply Quote | |
My task took 9.5 hours to download, and ran for 8 minutes. Of those 8 minutes, the first several minutes showed zero load on the GPU. I assume it was unpacking the task then. So it really ran for only about 5 minutes on the GPU. | |
ID: 58418 | Rating: 0 | rate: / Reply Quote | |
My task took 9.5 hours to download, and ran for 8 minutes. Of those 8 minutes, the first several minutes showed zero load on the GPU. I assume it was unpacking the task then. So it really ran for only about 5 minutes on the GPU. That is pretty much what I got, though I did not pay attention to the load. It was much ado about nothing. When they get more and longer ones, I will try again. | |
ID: 58419 | Rating: 0 | rate: / Reply Quote | |
My hosts received WUs, but they error out after 5 minutes: <core_client_version>7.16.6</core_client_version> I noticed that the only ones failing are v1.02. v2.19 ones validate (well, one did so far, the rest are still running). I'm a bit confused, so is v1 ACEMD 3 and v2 ACEMD 4? Or are both different versions of ACEMD 4? | |
ID: 58420 | Rating: 0 | rate: / Reply Quote | |
I noticed that the only ones failing are v1.02. v2.19 ones validate (well, one did so far, the rest are still running). I'm a bit confused, so is v1 ACEMD 3 and v2 ACEMD 4? Or are both different versions of ACEMD 4? It's confusing the way they switched to a long-winded name and stopped labeling them acemd3 (v2.19) and acemd4 (v 1.0). | |
ID: 58421 | Rating: 0 | rate: / Reply Quote | |
You've also had errors like one I've just encountered: EXIT_DISK_LIMIT_EXCEEDED <workunit> (task 32749572) It's possible that the first error didn't clean up properly behind itself, and caused the combined project total to exceed that 10,000,000,000 byte limit - though that seems unlikely. All hell is breaking out at GPUGrid today, with ADRIA acemd3 tasks completing in under an hour - we'll just have to wait while things get sorted out one by one, and the dust settles! Edit - the slot directory where my next task is running contains 17,298 items, totalling 10.3 GB (10.3 GB on disk) - above the limit, although BOINC hasn't noticed yet. Edit 2 - it has now. Task 32750112 | |
ID: 58422 | Rating: 0 | rate: / Reply Quote | |
Could it have anything to do with your running Borg BOINC? Resistance is futile :-) | |
ID: 58423 | Rating: 0 | rate: / Reply Quote | |
ah, so the failing tasks are actually acemd3? | |
ID: 58424 | Rating: 0 | rate: / Reply Quote | |
We currently have: | |
ID: 58425 | Rating: 0 | rate: / Reply Quote | |
I'm not sure what that disk limit error is about. The limit in question is set at the workunit level, and applies to the amount copied to the working ('slot') directory, plus any data generated during the run. The BOINC limits are applied to the sum total of all files, in all directories, under the control of BOINC. To prove the point, I caught task 32750377 and suspended it before it started running. Then, I shut down the BOINC client, and edited client_state.xml to double the workunit limit. It ran to completion, and was validated. I'll do another one to check, but I can't be sitting here manually editing BOINC files every five minutes - this needs catching at source, and quickly. | |
ID: 58426 | Rating: 0 | rate: / Reply Quote | |
And the next one - workunit 27113095, created 15:43:59 UTC - already has the fix in place. Kudos to whoever was watching our conversation here. | |
ID: 58427 | Rating: 0 | rate: / Reply Quote | |
And the next one - workunit 27113095, created 15:43:59 UTC - already has the fix in place. Kudos to whoever was watching our conversation here. Did the fix require the whole app to be re-downloaded? Please say no... ____________ Reno, NV Team: SETI.USA | |
ID: 58428 | Rating: 0 | rate: / Reply Quote | |
Thanks for the explanation, RH. :) | |
ID: 58429 | Rating: 0 | rate: / Reply Quote | |
Did the fix require the whole app to be re-downloaded? Please say no... No! | |
ID: 58430 | Rating: 0 | rate: / Reply Quote | |
i just downloaded the 2.8GB tar package and it only took about a few minutes to download at the roughly 15Mbps transfer rate. | |
ID: 58431 | Rating: 0 | rate: / Reply Quote | |
Up to 3GB file. 3hr45min to get to 66% download. | |
ID: 58432 | Rating: 0 | rate: / Reply Quote | |
3GB got downloaded in few minutes and unit looks to working fine. Stderr output <core_client_version>7.16.6</core_client_version> <![CDATA[ <stderr_txt> 17:49:50 (2974532): wrapper (7.7.26016): starting 17:49:50 (2974532): wrapper (7.7.26016): starting 17:49:50 (2974532): wrapper: running /bin/tar (xf x86_64-pc-linux-gnu__cuda1121.tar.bz2) 17:58:25 (2974532): /bin/tar exited; CPU time 509.278471 17:58:25 (2974532): wrapper: running bin/acemd (--boinc --device 0) 21:34:24 (2974532): bin/acemd exited; CPU time 12867.200624 21:34:24 (2974532): called boinc_finish(0) </stderr_txt> ]]> run.log # # ACEMD version 4.0.0rc6 # # Copyright (C) 2017-2022 Acellera (www.acellera.com) # # When publishing, please cite: # ACEMD: Accelerating Biomolecular Dynamics in the Microsecond Time Scale # M. J. Harvey, G. Giupponi and G. De Fabritiis, # J Chem. Theory. Comput. 2009 5(6), pp1632-1639 # DOI: 10.1021/ct9000685 # # Arguments: # input: input # platform: # device: 0 # ncpus: # precision: mixed # # ACEMD is running in Boinc mode! # # Read input file: input # Parse input file $ $# Forcefield configuration $ $ parmFile structure.prmtop $ nnpFile model.json $ $# Initial State $ $ coordinates structure.pdb $ binCoordinates input.coor $ binVelocities input.vel $ extendedSystem input.xsc $# temperature 298.15 # Explicit velocity field provided $ $# Output $ $ trajectoryPeriod 25000 $ trajectoryFile output.xtc $ $# Electrostatics $ $ PME on $ cutoff 9.00 # A $ switching on $ switchDistance 7.50 # A $ implicitSolvent off $ $# Temperature Control $ $ thermostat on $ thermostatTemperature 310.00 # K $ thermostatDamping 0.10 # /ps $ $# Pressure Control $ $ barostat off $ barostatPressure 1.0000 # bar $ barostatAnisotropic off $ barostatConstRatio off $ barostatConstXY off $ $# Integration $ $ timeStep 2.00 # fs $ slowPeriod 1 $ $# External forces $ $ $# Restraints $ $ $# Run Configuration $ $ restart off $ run 500000 # Parse force field and topology files # Force field: AMBER # PRMTOP file: structure.prmtop # # Force field parameters # Number of atom parameters: 12 # Number of bond parameters: 14 # Number of angle parameters: 22 # Number of dihedral parameters: 20 # Number of improper parameters: 0 # Number of CMAP parameters: 0 # # System topology # Number of atoms: 5058 # Number of bonds: 5062 # Number of angles: 136 # Number of dihedrals: 240 # Number of impropers: 0 # Number of CMAPs: 0 # # Initializing engine # Version: 7.7 # Plugin directory: /var/lib/boinc-client/slots/3/lib/acemd3 # Loaded plugins # CPU # PME # CUDA # CudaCompiler # WARNING: there is no library for "OpenCL" plugin # PlumedCUDA # WARNING: there is no library for "PlumedOpenCL" plugin # PlumedReference # TorchReference # TorchCUDA # WARNING: there is no library for "TorchOpenCL" plugin # Available platforms # CPU # CUDA # # Bonded interactions # Harmonic bond interactions # Number of terms: 5062 # Harmonic angle interactions # Number of terms: 136 # Urey-Bradley interactions # Number of terms: 0 # Number of skipped terms (zero force constant): 136 # NOTE: Urey-Bradley interations skipped # Proper dihedral interations # Number of terms: 224 # Number of skipped terms (zero force constants): 16 # Improper dihedral interations # Number of terms: 0 # NOTE: improper dihedral interations skipped # CMAP interactions # Number of terms: 0 # NOTE: CMAP interations skipped # # Non-bonded interactions # Number of exclusions: 5391 # Lennard-Jones terms # Cutoff distance: 9.000 A # Switching distance: 7.500 A # Coulombic (PME) term # Ewald tolerance: 0.000500 # No NBFIX # No implicit solvent # # NNP # Configuration file: model.json # Model type: torch # Model file: model.nnp # Number of atoms: 75 # # Constraining hydrogen (X-H) bonds # Number of constrained bonds: 3356 # Making water molecules rigid # Number of water molecules: 1661 # Number of constraints: 5017 # # Reading box sizes from input.xsc # # Creating simulation system # Number of particles: 5058 # Number of degrees of freedom 10154 # Periodic box size: 37.314 37.226 37.280 A # # Integrator # Type: velocity Verlet # Step size: 2.00 fs # Constraint tolerance: 1.0e-06 # # Thermostat # Type: Langevin # Target temperature: 310.00 K # Friction coefficient: 0.10 ps^-1 # # Setting up platform: CUDA # Interactions: 1 2 4 7 14 12 # Platform properties: # DeviceIndex: 0 # DeviceName: NVIDIA GeForce RTX 3070 # UseBlockingSync: false # Precision: mixed # UseCpuPme: false # CudaCompiler: /usr/local/cuda/bin/nvcc # TempDirectory: /tmp # CudaHostCompiler: # DisablePmeStream: false # DeterministicForces: false # # Set initial positions from an input file # # Initial velocities # File: input.vel # # Optimize platform for MD # Number of constraints: 5017 # Harmonic bond interations # Initial number of terms: 5062 # Optimized number of terms: 45 # Remaining interactions: 2 4 7 14 12 1 # # Running simulation # Current step: 0 # Number of steps: 500000 # # Trajectory output # Positions: output.xtc # Period: 25000 # Wrapping: off # # Log, trajectory, and restart files are written every 50.000 ps (25000 steps) # Step Time Bond Angle Urey-Bradley Dihedral Improper CMAP Non-bonded Implicit External Potential Kinetic Total Temperature Volume # [ps] [kcal/mol] [kcal/mol] [kcal/mol] [kcal/mol] [kcal/mol] [kcal/mol] [kcal/mol] [kcal/mol] [kcal/mol] [kcal/mol] [kcal/mol] [kcal/mol] [K] [A^3] 25000 50.00 7.8698 22.5692 0.0000 62.2167 0.0000 0.0000 -15734.1899 0.0000 -1379454.7644 -1395096.2985 3140.7230 -1391955.5755 311.303 51783.76 # Speed: average 6.63 ns/day, current 6.63 ns/day # Progress: 5.0, remaining time: 3:26:16, ETA: Fri Mar 4 21:35:50 2022 50000 100.00 6.7940 26.1562 0.0000 56.6657 0.0000 0.0000 -15592.6529 0.0000 -1379450.5644 -1394953.6014 3101.0705 -1391852.5309 307.372 51783.76 # Speed: average 6.66 ns/day, current 6.68 ns/day # Progress: 10.0, remaining time: 3:14:43, ETA: Fri Mar 4 21:35:03 2022 75000 150.00 4.6298 22.0773 0.0000 58.5973 0.0000 0.0000 -15798.4139 0.0000 -1379459.1078 -1395172.2173 3143.0430 -1392029.1743 311.533 51783.76 # Speed: average 6.66 ns/day, current 6.68 ns/day # Progress: 15.0, remaining time: 3:03:42, ETA: Fri Mar 4 21:34:50 2022 100000 200.00 7.8170 26.8665 0.0000 59.5203 0.0000 0.0000 -15618.1979 0.0000 -1379453.2405 -1394977.2346 3138.9295 -1391838.3051 311.125 51783.76 # Speed: average 6.67 ns/day, current 6.68 ns/day # Progress: 20.0, remaining time: 2:52:48, ETA: Fri Mar 4 21:34:42 2022 125000 250.00 8.7645 24.8827 0.0000 59.5112 0.0000 0.0000 -15731.4732 0.0000 -1379450.9954 -1395089.3103 3081.5431 -1392007.7672 305.437 51783.76 # Speed: average 6.67 ns/day, current 6.68 ns/day # Progress: 25.0, remaining time: 2:41:56, ETA: Fri Mar 4 21:34:37 2022 | |
ID: 58433 | Rating: 0 | rate: / Reply Quote | |
is it really necessary to spend 4-5 mins every task to extract the same 3GB file? seems unnecessary. if it's not downloading a new file every time, then why extract the same file over and over? why not just leave it extracted? | |
ID: 58434 | Rating: 0 | rate: / Reply Quote | |
Up to 3GB file. 3hr45min to get to 66% download. 5hr20min to download 10+ min to start processing and already at 50% complete Task completed in just under 12 min. Less than 2 minutes of processing on a 3070Ti at around 55% load. | |
ID: 58436 | Rating: 0 | rate: / Reply Quote | |
Since the original task, I have received several more, with no giant file to download again. Then I got a task for the same machine, and it is downloading another beast of a file, veeeery slowly. 4.5 hours so far, and only 29% complete. BOINCtasks says it 9.71KBps. Ouch. | |
ID: 58437 | Rating: 0 | rate: / Reply Quote | |
Since the original task, I have received several more, with no giant file to download again. Then I got a task for the same machine, and it is downloading another beast of a file, veeeery slowly. 4.5 hours so far, and only 29% complete. BOINCtasks says it 9.71KBps. Ouch. A Follow-up: The d/l seems to stall eventually. But going into Boinc and turning off/on networking seems to restart the d/l at a reasonable pace, and be done in a matter of minutes in stead of hours. This project need to get their networking in order. ____________ Reno, NV Team: SETI.USA | |
ID: 58438 | Rating: 0 | rate: / Reply Quote | |
This project need to get their networking in order. that's what I have been saying often enough in the past | |
ID: 58439 | Rating: 0 | rate: / Reply Quote | |
it's also interesting to see that this new ACEMD4 application does not have the same high PCIe bus use as the ACEMD3 app. should allow faster processing on systems that might be on smaller bus widths (cards on USB risers, cards plugged in via chipsets, older systems with PCIe 2.0 or less, etc) | |
ID: 58440 | Rating: 0 | rate: / Reply Quote | |
still curious about the constant unpacking of the compressed file for every task. wastes 5 mins for each task doing the same thing over and over. if you just leave it unpacked then you save 5 mins on each subsequent task. It's BOINC that empties the slot directory when a task has finished uploading, exited and reported. BOINC will check again that the allocated slot is still empty before starting a new task. It won't re-use an old slot if there's anything left behind, whether it's the (a) same project, same application, (b) same project, different application, or (c) a different project entirely. The slot directory is also the 'working' directory in operating system terms, and both the operating system and the GPUGrid project use it in that sense. To use a different location for persistent files would require some effort in modifying the Path environment to let GPUGrid run. Personally, I suspect the "everything, including the kitchen sink" compressed files are perhaps over-specified. The 17,298 items, 10.3 GB I found in there yesterday feels like an 'oh, include it, just in case' solution. When testing is complete and production is about to start, perhaps the project could audit the compressed archives and strip them back to the bare minimum? | |
ID: 58441 | Rating: 0 | rate: / Reply Quote | |
Another problem in P0_NNPMM_2p95_19-RAIMIS_NNPMM-0-20-RND5821_0 (task 32755591: | |
ID: 58442 | Rating: 0 | rate: / Reply Quote | |
unless they build a single binary file for processing, like most other projects do. then they just dump the binary into the projects folder and it gets used over and over. | |
ID: 58443 | Rating: 0 | rate: / Reply Quote | |
Another problem in P0_NNPMM_2p95_19-RAIMIS_NNPMM-0-20-RND5821_0 (task 32755591: I had one too. What kind of app needs 40GB of memory? working set size > client RAM limit: 38573.95MB > 38554.31MB</message> On the next user it was aborted by the project. I've had 3 canceled by server as well, 2 while running. | |
ID: 58444 | Rating: 0 | rate: / Reply Quote | |
And I had P0_NNPMM_1hpo_19-RAIMIS_NNPMM-1-20-RND4653_0 cancelled as well, same machine. Same sequence, also ran ~50 minutes. Maybe somebody pulled the batch? | |
ID: 58445 | Rating: 0 | rate: / Reply Quote | |
Hello everybody,
The GPUGRID server is connected to a university network with a substantial bandwidth, so it is not a limiting factor. Most likely, your ISPs are throttling the download speed. As mentioned before, the software package has increased due to PyTorch, which by itself takes ~ 1 GB. I have removed some more unnecessary files, but still it is ~3 GB. The software package is cached by the Boinc client, so it is only downloaded once for each app version and reused.
The WUs have to download ~500 MB of input files. At the moment, I cannot do much about this, but this will be reduced eventually.
I have changed to a different format (gzip), so now it takes 1-2 min to decompress. As I side note, the ACEMD 3 app does the same, but it uses a built-in ZIP decompressor, which doesn’t report that in the log. In the case of the ACEMD 4 app there is an issue that the built-in decompressor doesn’t support files >2 GB. So, I had to add the decompression as a separate task.
I have fixed a memory leak. Now it should consume a reasonable amount of memory (2-4 GB).
| |
ID: 58452 | Rating: 0 | rate: / Reply Quote | |
thanks for giving some attention to the package decompression :). much faster now. | |
ID: 58453 | Rating: 0 | rate: / Reply Quote | |
Thank you for your responses and explanations. Two further points - one an amplification, and the other something different.
I'm on the end of a pretty stable 70 megabit ISP download connection - I don't think they are throttling it, either. I think the answer probably lies in the intermediate routing - the switching centres and interconnectors that the signals pass through between the University and our ISPs. Sometimes they work well, sometimes they drop packets, and sometimes they slow to a crawl. I don't think that's something that can be fixed at the research lab level, but it might be worth mentioning it upstream to the University's technical support team - they might be able to monitor it. On another tack - and this applies to all the GPUGrid researchers as a group - it would help the work proceed more smoothly if you could find a way of paying more careful attention to the meta-data which BOINC passes downstream to our computers with each task. The key value is the estimated size, the <rsc_fpops_est>, of each task. At the moment, I have various machines working on: AbNTest_micro tasks for ADRIA, which run for over a day AbNTest_counts tasks for ADRIA, which run for about an hour Today's NNPMM task from your good self, which looks set to run for about 8 hours. Earlier test runs only lasted a few minutes, but all seem to be given the same project standard <rsc_fpops_est> setting of 5,000,000,000,000,000. The BOINC client uses the fpops estimate, plus its own measurement of elapsed time, to keep track of the effective speed of our machines, and thus the anticipated runtime of newly downloaded tasks. It's tedious, I know, but if the task size estimate isn't routinely adjusted to take account of the testing being undertaken, anticipated runtimes can get seriously distorted. In the worst case, a succession of short tests (if not described accurately) can make our BOINC clients think that our machines have become suddenly many times faster, and can even cause 'normal' tasks to be aborted for taking too long. Experienced volunteers can anticipate and work through these problems, but the project's main work will proceed more smoothly if they don't arise in the first place. | |
ID: 58454 | Rating: 0 | rate: / Reply Quote | |
yeah I agree about the est flops size. it throws things all out of wack. my task now which will run for about 3hrs, started with an estimated runtime of 10 days lol. | |
ID: 58455 | Rating: 0 | rate: / Reply Quote | |
Hello everyone, | |
ID: 58457 | Rating: 0 | rate: / Reply Quote | |
Thank you for your responses and explanations. Two further points - one an amplification, and the other something different. I agree, its not on our end either. As mentioned somewhere, a pause and resume of networking in the client can speed up the download. I did this on the 3GB download. The site is often slow and will timeout. Once timing out then it will reload on a refresh. I'd put more weight on a DDoS type of restriction somewhere. Too many requests and speed drops or cut off. The app rename with v3 vs v4 in the name makes things easier. | |
ID: 58459 | Rating: 0 | rate: / Reply Quote | |
Another 3+ GB download, for ACEMD4 v1.03 08/03/2022 10:28:57 | GPUGRID | Started download of x86_64-pc-linux-gnu__cuda1121.tar.gz.0f9ebf1ac84d8d1f5ae2c260dc903be9 08/03/2022 10:36:55 | GPUGRID | Finished download of x86_64-pc-linux-gnu__cuda1121.tar.gz.0f9ebf1ac84d8d1f5ae2c260dc903be9 OK, 7 minutes 58 seconds. And with a six day estimate and a one day deadline, the previous task was booted aside and the new task started immediately. That's called EDF (Earliest Deadline First), or Panic Mode On! | |
ID: 58462 | Rating: 0 | rate: / Reply Quote | |
another comment I have is regarding the credit reward for the ACEMD4 tasks. they seem to be set to a static 1500cred. it might make sense to implement the same credit reward model from the ACEMD3 tasks (higher credit and actually scaled to difficulty/runtime). I do agree that credits granted for ACEMD4 tasks are very undervalued, comparing to same processing times for tasks from other projects, or even ACEMD3 tasks from this same project. For example, Host #186626 PrimeGrid Genefer 18 tasks: ~1250 seconds processing time --> 1750 cedits Gpugrid ACEMD3 ADRIA KIXCMYB tasks: ~82000 seconds processing time --> 540000 credits (+50% bonus, base credits: 360000) Gpugrid ACEMD3 ADRIA e100 tasks: ~3300 seconds processing time -->27000 credits Gpugrid ACEMD4 RAIMIS tasks: ~25600 seconds processing time --> 1500 credits (?) | |
ID: 58482 | Rating: 0 | rate: / Reply Quote | |
Hello everybody,
I have re-calibrated the estimation of credits. Now the granted credits will be more in line with ACEMD 3, i.e. 1 hour of NVIDIA RTX 2080 Ti calculation is valued at 60 000 credits, not including the additional bonuses.
Currently there is no automated mechanism to set the flops, just the app maintainers set some arbitrary numbers. I have decreased the flop estimate by two orders of magnitude. Hopefully, this is more in line with the actual work.
| |
ID: 58623 | Rating: 0 | rate: / Reply Quote | |
Still having issues with the acemd4 application. | |
ID: 58626 | Rating: 0 | rate: / Reply Quote | |
Still having issues with the acemd4 application. Keith I saw this too in your tasks list. I'm betting that the reduction in estimated flops caused this to happen. reducing flops makes BOINC think it will take less time to complete and sets the limit for timeout lower. ____________ | |
ID: 58627 | Rating: 0 | rate: / Reply Quote | |
You are probably correct Ian. I fixated on the 1800 seconds error since that was the errors I saw with the python tasks. | |
ID: 58628 | Rating: 0 | rate: / Reply Quote | |
Got an ACEMD4 task on Linux: T0_NNPMM_frag_00-RAIMIS_NNPMM-1-3-RND2497_5. The five previous attempts have all timed out, in between 2,400 seconds and 5,000 seconds. | |
ID: 58629 | Rating: 0 | rate: / Reply Quote | |
It's reached 50% in 2 hours 7 minutes, so this task is heading for about four and a quarter hours on my GTX 1660 Super. That would have failed without manual intervention. | |
ID: 58632 | Rating: 0 | rate: / Reply Quote | |
File size too big by both users on upload. | |
ID: 58633 | Rating: 0 | rate: / Reply Quote | |
Thu 14 Apr 2022 09:27:26 AM CDT | GPUGRID | Aborting task T1_GAFF2_frag_00-RAIMIS_NNPMM-0-1-RND6653_1: exceeded elapsed time limit 7231.33 (1000000.00G/138.29G) | |
ID: 58645 | Rating: 0 | rate: / Reply Quote | |
Still getting elapsed time limit errors. Looks like the estimated GFLOPS was changed but still not enough. | |
ID: 58646 | Rating: 0 | rate: / Reply Quote | |
bombed out after 25mins and 20% completion on a 3080Ti | |
ID: 58648 | Rating: 0 | rate: / Reply Quote | |
Just got T1_GAFF2_frag_00-RAIMIS_NNPMM-0-1-RND6653_5. | |
ID: 58649 | Rating: 0 | rate: / Reply Quote | |
got another new task. | |
ID: 58679 | Rating: 0 | rate: / Reply Quote | |
called it. ran for 2hrs18mins and errored right after completion T2_NNPMM_frag_01-RAIMIS_NNPMM-0-2-RND4664_0 upload failure: <file_xfer_error> from the event log: Tue 19 Apr 2022 02:13:28 PM EDT | GPUGRID | File size: 20443520.000000 bytes. Limit: 10000000.000000 bytes it's important for the devs to see that these error out rather than fiddling with things on my end to ensure I get credit. otherwise they may be under the impression that it's not a problem. ____________ | |
ID: 58680 | Rating: 0 | rate: / Reply Quote | |
it's important for the devs to see that these error out rather than fiddling with things on my end to ensure I get credit. otherwise they may be under the impression that it's not a problem. Agreed. But in that case, it's also helpful to post the local information from the event log that the devs can't easily see - like captainjack's note File size: 137187308.000000 bytes. Limit: 10000000.000000 bytes That gives them the magnitude of the required correction, as well as its location. My (single) patched run did indeed complete successfully after surgery, so the file size should be the last correction needed. | |
ID: 58681 | Rating: 0 | rate: / Reply Quote | |
i just edited with that info from the log. | |
ID: 58682 | Rating: 0 | rate: / Reply Quote | |
Two more, half the run time, and 10x the file size for _4 output file. both run on 3080Tis again. Tue 19 Apr 2022 07:54:14 PM EDT | GPUGRID | File size: 213191804.000000 bytes. Limit: 10000000.000000 bytes T2_GAFF2_frag_00-RAIMIS_NNPMM-0-1-RND5192_2 Tue 19 Apr 2022 07:53:26 PM EDT | GPUGRID | File size: 213539276.000000 bytes. Limit: 10000000.000000 bytes ____________ | |
ID: 58683 | Rating: 0 | rate: / Reply Quote | |
Same thing here. Ran 2 1/2 hours to completion and then failed on too large an upload file. | |
ID: 58684 | Rating: 0 | rate: / Reply Quote | |
Woke up this morning to find two unstarted ACEMD4 tasks awaiting my attention (and downloaded a third since). | |
ID: 58685 | Rating: 0 | rate: / Reply Quote | |
the estimated runtime of my tasks were very close. at inception they started right at around 2hrs and that's how long it took. that was with no adjustments. so the estimated flops seems correct. | |
ID: 58686 | Rating: 0 | rate: / Reply Quote | |
The other thing they still have to sort out is checkpointing. I've just come home to find that BOINC had downloaded and started a new ACEMD4 task - for some reason, it pre-empted the two Einstein tasks running on the GPU I dedicate to GPUGrid. That must have been EDF kicking in, but with a six hour cache and a 24 hour deadline, it shouldn't have been needed. | |
ID: 58687 | Rating: 0 | rate: / Reply Quote | |
Another 3 hours wasted because of too large an upload file. | |
ID: 58688 | Rating: 0 | rate: / Reply Quote | |
I've fixed the file size problem, partly to stop them recirculating to other users, but also to get some real science done, if possible. It is highly likely that the one (and only) at current Server Status page "successful users in last 24h" for ACEMD 4 tasks, is you ;-) | |
ID: 58689 | Rating: 0 | rate: / Reply Quote | |
I've fixed the file size problem, partly to stop them recirculating to other users, but also to get some real science done, if possible. It might be one user, but it's four tasks and counting so far: Host 132158 Host 508381 I'll try and see off this run of timewasters, even if I have to do it all myself! | |
ID: 58690 | Rating: 0 | rate: / Reply Quote | |
That implies a longer runtime than this morning's group, so the output size may be larger as well. Turned out not to be a problem - the file size was 20.4 MB, despite running nearly twice as long. I can't see anything about the filename which would reliably distinguish between "quick run, large file" and "slow run, small file" tasks. | |
ID: 58691 | Rating: 0 | rate: / Reply Quote | |
That implies a longer runtime than this morning's group, so the output size may be larger as well. T2_GAFF2_frag_00-RAIMIS_NNPMM = short run, large file size T2_NNPMM_frag_01-RAIMIS_NNPMM = longer run, smaller (but still too big) file size i processed several of both types. ____________ | |
ID: 58692 | Rating: 0 | rate: / Reply Quote | |
Yup, that works. | |
ID: 58693 | Rating: 0 | rate: / Reply Quote | |
Got another Python. I will let it run its course this time. | |
ID: 58694 | Rating: 0 | rate: / Reply Quote | |
a handful of new ACEMD4 tasks went out today. I got one. | |
ID: 58700 | Rating: 0 | rate: / Reply Quote | |
Same here. Got two tasks today that completed and validated successfully. | |
ID: 58704 | Rating: 0 | rate: / Reply Quote | |
I also got yesterday an ACEMD 4 task: exceeded elapsed time limit 16199.05 (10000000.00G/617.32G) May be some fine tuning in WU configuration parameters is still required on Project side. | |
ID: 58705 | Rating: 0 | rate: / Reply Quote | |
Yes, bummer for missing the expected compute time by less than 90 seconds. | |
ID: 58706 | Rating: 0 | rate: / Reply Quote | |
my 3080Ti completed one no problem. I think something was wrong with that persons machine and it was hung up or there was something slowing down the computation. My 3080Ti completed it in half the time. | |
ID: 58707 | Rating: 0 | rate: / Reply Quote | |
Hello everybody,
I have tried to tune the flops, but at the end the final values are very similar to the original ones. It seems many volunteers are adjusting the factors by themselves, so trying to fix for ones, ruins for others.
I have updated the limits of the file sizes to match the scale of the new WUs.
| |
ID: 58713 | Rating: 0 | rate: / Reply Quote | |
I doubt that there's much large-scale tuning of flops (the speed measure) by general users out in the community. The few who post here regularly might well have tweaked it, of course. | |
ID: 58714 | Rating: 0 | rate: / Reply Quote | |
I'm loaded up as much as possible for now. one on each GPU. short 1-day deadlines (can't remember if they were always that short) | |
ID: 58715 | Rating: 0 | rate: / Reply Quote | |
While I was typing the previous post, the server sent my host 132158 task 32888461. <flops>181882964117.026398</flops> (181 Gflops), which is actually quite close to the running APR of 198 Gflops for the nine tasks it's completed so far - not enough to be considered definitive yet. The task size is <rsc_fpops_est>10000000000000.000000</rsc_fpops_est> so from size / speed, the raw runtime estimate is 55 seconds (limit 15.25 hours). That should be good enough for now. Card is a GTX 1660 Super. | |
ID: 58716 | Rating: 0 | rate: / Reply Quote | |
Received 25 tasks on Linux. Ran about 45 - 60 seconds. All error. Error message: process exited with code 195 (0xc3, -61) Received 0 tasks on Windows. | |
ID: 58717 | Rating: 0 | rate: / Reply Quote | |
Received 25 tasks on Linux. Ran about 45 - 60 seconds. All error. Error message:process exited with code 195 (0xc3, -61) I received several of your resends, they are processing fine on my system. but that brings up a talking point. I see many older GPUs having errors on these. I have 2080Tis, 3070Tis, and 3080Tis and they have all processed successfully. I see one user with a GTX 650, and he errors with an architecture error, so the app was obviously built without Kepler support. I see several other users with <2GB VRAM that errored out, I also assume these cases might be due to too few memory then cases like yours where it should be supported and with enough memory, but for some reason causes errors. has anyone had successful ACEMD4 runs on Maxwell or Pascall cards? Received 0 tasks on Windows. that's because there is no Windows app. these ACEMD4 tasks only have a Linux application for now. ____________ | |
ID: 58718 | Rating: 0 | rate: / Reply Quote | |
captainjack is running a GTX 970 under Linux, and the acemd child is failing with app exit status: 0x87 I don't recognise that one, but it should be in somebody's documentation. | |
ID: 58719 | Rating: 0 | rate: / Reply Quote | |
I just allowed tasks on my 1080Ti to see if it will run on Pascal. | |
ID: 58720 | Rating: 0 | rate: / Reply Quote | |
captainjack is running a GTX 970 under Linux, and the acemd child is failing with yes I know. but he also has a windows system. I was letting him know the reason his windows system didnt get any, because an app for Windows does not exist at this time. check this WU. it's one that he (and several other Maxwell card systems) failed with the same 0x87 code. another Maxwell Quadro M2000 failed as well with 0x7. https://gpugrid.net/workunit.php?wuid=27220282 my system finally completed it without issue. makes me wonder if the app works on Maxwell or Pascal cards at all. if it turns out that these tasks don't run on Maxwell/Pascal, the project might need to do additional filtering in their scheduler by compute capability to exclude incompatible systems. ____________ | |
ID: 58721 | Rating: 0 | rate: / Reply Quote | |
that'll be a good test. thanks. ____________ | |
ID: 58722 | Rating: 0 | rate: / Reply Quote | |
It will be a while before I can report success or failure. Have to download the 3.5GB application file still before starting the task. | |
ID: 58723 | Rating: 0 | rate: / Reply Quote | |
haha, a few of my systems did the same earlier. ____________ | |
ID: 58724 | Rating: 0 | rate: / Reply Quote | |
Pascal works: https://gpugrid.net/result.php?resultid=32888358 | |
ID: 58725 | Rating: 0 | rate: / Reply Quote | |
Mine just started. So far, so good. | |
ID: 58726 | Rating: 0 | rate: / Reply Quote | |
The 1080 Ti can crunch the acemd4 tasks with no issues. | |
ID: 58727 | Rating: 0 | rate: / Reply Quote | |
i think the CPU speed plays a pretty significant roll in GPU speed on these tasks. so your fast 5950X is helping out a good bit. | |
ID: 58728 | Rating: 0 | rate: / Reply Quote | |
Same speed for all my 5950X hosts which the 1080 Ti is the newest. So the cpu speed is not the reason for the speedup. | |
ID: 58729 | Rating: 0 | rate: / Reply Quote | |
Checkpointing is still not active. | |
ID: 58730 | Rating: 0 | rate: / Reply Quote | |
exceeded elapsed time limit 12932.15 (10000000.00G/773.27G)</message> | |
ID: 58733 | Rating: 0 | rate: / Reply Quote | |
Same speed for all my 5950X hosts which the 1080 Ti is the newest. So the cpu speed is not the reason for the speedup. i wasn't saying there was any "speedup". just that your fast CPU is helping vs how a 1080Ti might perform with a lower end CPU. put the same GPU on a slower CPU and it will slow down to some extent. ____________ | |
ID: 58734 | Rating: 0 | rate: / Reply Quote | |
exceeded elapsed time limit 12932.15 (10000000.00G/773.27G)</message> your hosts are hidden. what are the system specifics? what CPU/RAM specs, etc. the time limit is set by BOINC, not the project directly. BOINC bases the limits on the speed of the device and the estimated flops set by the project. ____________ | |
ID: 58735 | Rating: 0 | rate: / Reply Quote | |
Checkpointing is still not active.I confirm that. | |
ID: 58736 | Rating: 0 | rate: / Reply Quote | |
exceeded elapsed time limit 12932.15 (10000000.00G/773.27G)</message> It's possible that the lack of checkpointing contributed to this problem. ACEND 4 tells BOINC that it has checkponted (and I think it's correct - the checkpoint have been written). So BOINC thinks it's OK to task-switch to another project. But when the time comes to return to GPUGrid, ACEMD fails to read the checkpoint files, deletes the result file so far, and starts from the beginning. But it retains the memory of elapsed time so far, bringing it much closer to the abort point. | |
ID: 58737 | Rating: 0 | rate: / Reply Quote | |
if it's retaining the timer when restarting the task from 0 then yes I agree the checkpointing could be the root cause. if it's not checkpointing, the timer should reset to 0. | |
ID: 58738 | Rating: 0 | rate: / Reply Quote | |
I still don't understand the messages that the tasks wont complete in time lol. it will only download 1 per GPU, then no more. | |
ID: 58739 | Rating: 0 | rate: / Reply Quote | |
We mostly concentrate on the estimates calculated by the client, from size, speed, and (at this project) DCF. | |
ID: 58740 | Rating: 0 | rate: / Reply Quote | |
I still don't understand the messages that the tasks wont complete in time lol. it will only download 1 per GPU, then no more. I find that this warning appears for ACEMD 4 tasks when you try to set a work buffer greater than 1 day. BOINC Manager probably "thinks": Why should I want to store more than one day of tasks, for tasks with one day deadline? The same happens with ACEMD 3 tasks when setting a work buffer greater than 5 days. It's in some way tricky... * Related question - Short explanation - Extended explanation That is, for current ACEMD 4 tasks (24 hours deadline): If you want to get more than one task per GPU, set your work buffer provisionally at a value lower than 1, and revert to your custom value once a further task is downloaded. (Or leave your buffer permanently set to 0.95 ;-) | |
ID: 58741 | Rating: 0 | rate: / Reply Quote | |
interesting observation. I've experienced similar things in the past with other projects with counterintuitive behavior/response to a cache level set "too high". | |
ID: 58742 | Rating: 0 | rate: / Reply Quote | |
The BOINC manager UI shows 42 minutes left, while the work fetch debug shows 2811 minutes: [work_fetch] --- state for NVIDIA GPU ---
[work_fetch] shortfall 0.00 nidle 0.00 saturated 167688.92 busy 167688.92 That's odd, because the fraction_done_exact isn't set in the app_config.xml | |
ID: 58743 | Rating: 0 | rate: / Reply Quote | |
I had one of these ACEMD4 tasks paused with a short amount of computation completed. at 2% and about 6minutes. with the task paused, the task remained in its "extracted" state. upon resuming the task, it restarts from 1%, not 0%. I'm guessing 0-1% is for the file extraction. but indeed the timer stayed where it was at ~6minutes and continued from there and did not reset to the actual time elapsed for 1%. ____________ | |
ID: 58744 | Rating: 0 | rate: / Reply Quote | |
Take a look at the tasks on my host. | |
ID: 58745 | Rating: 0 | rate: / Reply Quote | |
The host from my previous post has received an ACEMD3 task. It has reached 16.3%, when the host received an ACEMD4 task, which took over, as the latter has much shorter deadline. The ACEMD3 could restart from the checkpoint, so it will finish eventually. I wonder how many times the ACEMD3 taks will be suspended, and how many days will pass until it's completed. | |
ID: 58746 | Rating: 0 | rate: / Reply Quote | |
I don't think much chance at all. We've blown through all those 1000 tasks I think. Not much chance your acemd3 task will get pre-empted. | |
ID: 58747 | Rating: 0 | rate: / Reply Quote | |
I don't think much chance at all. We've blown through all those 1000 tasks I think.Not yet. The task which preempted the ACEMD3 task is: P0_NNPMM_frag_85-RAIMIS_NNPMM-5-10-RND6112_0 The blue number is the total number of tasks in the given sequence The red number is the number of the task in the given sequence (starting from 0, so the last one will be 9-10) The green number is the number of resends. So those 1000 tasks are actually 100 task sequences, each sequence is broken into 10 pieces. | |
ID: 58748 | Rating: 0 | rate: / Reply Quote | |
Thanks for the task enumeration explanation. | |
ID: 58749 | Rating: 0 | rate: / Reply Quote | |
it seems like they are only letting out about 100 into the wild at any given time rather than releasing them all at once. | |
ID: 58750 | Rating: 0 | rate: / Reply Quote | |
One of my machines currently has: P0_NNPMM_frag_76-RAIMIS_NNPMM-3-10-RND5267_0 sent 27 Apr 2022 | 1:25:23 UTC P0_NNPMM_frag_85-RAIMIS_NNPMM-7-10-RND6112_0 sent 27 Apr 2022 | 3:14:45 UTC Not all models seem to be progressing at the same rate (note neither is a resend from another user - both are first run after creation). | |
ID: 58751 | Rating: 0 | rate: / Reply Quote | |
The ACEMD3 could restart from the checkpoint, so it will finish eventually.It has failed actually. :( It was suspended a couple of times, so I set "no new task" to make it finish within 24 hours, but it didn't help. | |
ID: 58756 | Rating: 0 | rate: / Reply Quote | |
Looks like we're coming to the end of this batch - time to take stock. For this post, I'm looking at host 132158, running on a GTX 1660 Super. <flops>181882876215.769470</flops> <rsc_fpops_est>10000000000000.000000</rsc_fpops_est> <rsc_fpops_bound>10000000000000000.000000</rsc_fpops_bound> - so speed 181,882,876,216 or 181.88 Gflops. This website has an APR of 198.76, but it stopped updating after 9 completed tasks with v1.03 (I've got 13 showing valid at the moment). Size/speed gives 54.98, confirming what BOINC is estimating. I reckon the size estimate (fpops_est) should be increased by a factor of about 250, to get closer to the target of DCF=1. BUT DON'T DO IT ALL IN ONE MASSIVE JUMP. We could probably manage a batch with the estimate 5x the current value, to get DCF away from the upper limit (DCF corrects very, very, slowly when it gets this far out of equilibrium, and limits work fetch requests to a nominal 1 second). Then, a batch with a further increase of 10x, and a third batch with another 10x, should do it. But monitor the effects of the changes carefully. | |
ID: 58757 | Rating: 0 | rate: / Reply Quote | |
The ACEMD4 app puts less stress on the GPU, than the ACEMD3 app. | |
ID: 58758 | Rating: 0 | rate: / Reply Quote | |
I also do notice that the these tasks don't fully utilize the GPU (mentioned in an earlier post with usage stats). But I think it’s a CPU bottleneck with these tasks. Faster CPU allows the GPU to work harder. | |
ID: 58759 | Rating: 0 | rate: / Reply Quote | |
running another GAFF2 task now. GPU utilization is higher (~96%) but power use is even lower at about 230W for a 3080Ti (power limit at 300W). | |
ID: 58760 | Rating: 0 | rate: / Reply Quote | |
QM tasks run in the same manner as the GAFF2 tasks. high ~96% GPU utilization, low GPU power draw, ~1GB VRAM used, ~2% VRAM bus use, high PCIe use. | |
ID: 58761 | Rating: 0 | rate: / Reply Quote | |
ACEMD 4 tasks now seem to be processing ok on my antique GTX 970. I had previously reported that they were all failing. | |
ID: 58763 | Rating: 0 | rate: / Reply Quote | |
out of the ~1200 or so that I assume went out. I processed over 400 of the myself. all completed successfully with no errors (excluding ones cancelled or aborted). I "saved" many _7s too. | |
ID: 58764 | Rating: 0 | rate: / Reply Quote | |
no ACEMD 4 tasks for very long time :-( | |
ID: 59919 | Rating: 0 | rate: / Reply Quote | |
no ACEMD 4 tasks for very long time :-( no reply from the project team for 1 1/2 months now - do we volunteers not deserve any kind of information? | |
ID: 60232 | Rating: 0 | rate: / Reply Quote | |
Here you can see the progress of acemd software | |
ID: 60236 | Rating: 0 | rate: / Reply Quote | |
Message boards : News : ACEMD 4