ATM

Author	Message
Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level Scientific publications	Message 60046 - Posted: 9 Mar 2023, 10:43:01 UTC - in response to Message 60045. At least in the runs I sent recently we are expecting 341 samples. Thanks, that's helpful. I've reached sample 266, so I'll be able to predict when it's likely to finish. But I think you need to reconsider some design decisions. The current task properties (from BOINC Manager) are: This task will take over 24 hours to run on my GTX 1660 Ti - that's long, even by GPUGrid standards. BOINC doesn't think it's checkpointed since the beginning, even though checkpoints are listed at the end of each sample in the job.log BOINC Manager shows that the fraction done is 75.000% - and has displayed that figure, unchanging, since a few minutes into the run. I'm not seeing any sign of an output file (or I haven't found it yet!), although it's specified in the <result> XML: <file_ref> <file_name>T_QUICO_Tyk2_new_2_ejm_47_ejm_55_4-QUICO_TEST_ATM-0-1-RND8906_2_0</file_name> <open_name>output.tar.bz2</open_name> <copy_file/> </file_ref> More when it finishes. ID: 60046 · Rating: 0 · rate: / Reply Quote

Quico Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 28 Feb 23 Posts: 35 Credit: 0 RAC: 0 Level Scientific publications	Message 60047 - Posted: 9 Mar 2023, 11:40:25 UTC - in response to Message 60046. Last modified: 9 Mar 2023, 11:56:09 UTC At least in the runs I sent recently we are expecting 341 samples. Thanks, that's helpful. I've reached sample 266, so I'll be able to predict when it's likely to finish. But I think you need to reconsider some design decisions. The current task properties (from BOINC Manager) are: This task will take over 24 hours to run on my GTX 1660 Ti - that's long, even by GPUGrid standards. That's good to know, thanks. Next time I'll prepare them so they run for shorter amounts of time and finish over next submissions. Is there an aprox time you suggest per task? I'm not seeing any sign of an output file (or I haven't found it yet!), although it's specified in the XML: <file_ref> <file_name>T_QUICO_Tyk2_new_2_ejm_47_ejm_55_4-QUICO_TEST_ATM-0-1-RND8906_2_0</file_name> <open_name>output.tar.bz2</open_name> <copy_file/> </file_ref> More when it finishes. Can you see a cntxt_0 folder or several r0-r21 folders? These should be some of the outputs that the run generates, and also the ones I'm getting from the succesful runs. ID: 60047 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level Scientific publications	Message 60048 - Posted: 9 Mar 2023, 11:57:55 UTC - in response to Message 60047. Can you see a cntxt_0 folder or several r0-r21 folders? These should be some of the outputs that the run generates, and also the ones I'm getting from the succesful runs. Yes, I have all of those, and they're filling up nicely. I want to catch the final upload archive, and check it for size. ID: 60048 · Rating: 0 · rate: / Reply Quote

Quico Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 28 Feb 23 Posts: 35 Credit: 0 RAC: 0 Level Scientific publications	Message 60049 - Posted: 9 Mar 2023, 14:37:12 UTC - in response to Message 60048. Can you see a cntxt_0 folder or several r0-r21 folders? These should be some of the outputs that the run generates, and also the ones I'm getting from the succesful runs. Yes, I have all of those, and they're filling up nicely. I want to catch the final upload archive, and check it for size. Ah I see, from what I've seen the final upload archive has been around 500MB for these runs. Taking into accont what was mentioned filesize-wise in the beginning of the thread I'll tweak some paramaters in order to avoid heavier files ID: 60049 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 5,269 Level Scientific publications	Message 60050 - Posted: 9 Mar 2023, 14:50:08 UTC - in response to Message 60049. you should also add weights to the <tasks> element in the jobs.xml file that's being used as well as adding some kind of progress reporting for the main script. jumping to 75% at the start and staying there for 12-24hrs until it jumps to 100% at the end is counterintuitive for most users and causes confusion about if the task is doing anything or not. ID: 60050 · Rating: 0 · rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,408,899,587 RAC: 0 Level Scientific publications	Message 60051 - Posted: 9 Mar 2023, 14:51:15 UTC - in response to Message 60047. Last modified: 9 Mar 2023, 14:53:22 UTC Next time I'll prepare them so they run for shorter amounts of time and finish over next submissions. Is there an aprox time you suggest per task? The sweet spot would be 0.5 to 4 hours. Above 8 hours is starting to drag. Some climate projects take over a week to run. It really depends on your needs, we're here to serve :-) It seems a quicker turn around time while you're tweaking your project would be to your benefit. It seems it would help you if you created your own BOINC account and ran your WUs the same way we do. Get in the trenches with us and see what we see. ID: 60051 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level Scientific publications	Message 60052 - Posted: 9 Mar 2023, 16:51:12 UTC Last modified: 9 Mar 2023, 17:01:11 UTC Well, here it is: BOINC sees that as 500.28 MB (Linux counts in 1000s, BOINC counts in 1024s) - wish me luck! Edit - phew, it got through. But that's very, very close to the old limit. Task 33344733 ID: 60052 · Rating: 0 · rate: / Reply Quote

Quico Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 28 Feb 23 Posts: 35 Credit: 0 RAC: 0 Level Scientific publications	Message 60053 - Posted: 9 Mar 2023, 18:29:11 UTC - in response to Message 60051. Next time I'll prepare them so they run for shorter amounts of time and finish over next submissions. Is there an aprox time you suggest per task? The sweet spot would be 0.5 to 4 hours. Above 8 hours is starting to drag. Some climate projects take over a week to run. It really depends on your needs, we're here to serve :-) It seems a quicker turn around time while you're tweaking your project would be to your benefit. It seems it would help you if you created your own BOINC account and ran your WUs the same way we do. Get in the trenches with us and see what we see. Once the Windows version is live my personal set-up will join the cause and will have more feedback :) Well, here it is: BOINC sees that as 500.28 MB (Linux counts in 1000s, BOINC counts in 1024s) - wish me luck! Edit - phew, it got through. But that's very, very close to the old limit. Task 33344733 Thanks, for the insight. I'll make it save frames less frequently in order to avoid bigger filesizes. ID: 60053 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 5,269 Level Scientific publications	Message 60068 - Posted: 13 Mar 2023, 16:26:17 UTC nothing but errors from the current ATM batch. run.sh is missing or misnamed/misreferenced. ID: 60068 · Rating: 0 · rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,408,899,587 RAC: 0 Level Scientific publications	Message 60069 - Posted: 13 Mar 2023, 17:49:34 UTC Last modified: 13 Mar 2023, 17:49:46 UTC I vaguely recall GG had a rule something like a computer can only DL 200 WUs a day. If it's still in place it would be absurd since the overriding rule is that a computer can only hold 2 WUs at a time. At the rate ATM WUs are failing I could hit that limit, so I halted GG DLs. Please delete all your WUs until you fix the bug. ID: 60069 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level Scientific publications	Message 60074 - Posted: 14 Mar 2023, 12:35:21 UTC Today's tasks are running OK - the run.sh script problem has been cured. I'm running one that the previous user aborted before it even started - no need for that any more (WU 27426736). ID: 60074 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 5,269 Level Scientific publications	Message 60075 - Posted: 14 Mar 2023, 12:51:35 UTC - in response to Message 60074. Last modified: 14 Mar 2023, 12:52:47 UTC i wouldnt say "cured". but newer tasks seem to be fine. I'm still getting a good number of resends with the same problem. i guess they'll make their way through the meat grinder before defaulting out. example: http://www.gpugrid.net/result.php?resultid=33357435 ID: 60075 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level Scientific publications	Message 60076 - Posted: 14 Mar 2023, 14:47:33 UTC - in response to Message 60075. My point was: if you get one of these, let it run - it may be going to produce useful science. If it's one of the faulty ones, you waste about 20 seconds, and move on. ID: 60076 · Rating: 0 · rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,408,899,587 RAC: 0 Level Scientific publications	Message 60084 - Posted: 15 Mar 2023, 9:28:37 UTC Last modified: 15 Mar 2023, 9:30:08 UTC Quico/GDF, GPU utilization is low so I'd like to test running 3 and 4 ATM WUs simultaneously. Sadly GG chokes off work at 2 WUs per computer so that's presently impossible. ID: 60084 · Rating: 0 · rate: / Reply Quote

Quico Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 28 Feb 23 Posts: 35 Credit: 0 RAC: 0 Level Scientific publications	Message 60085 - Posted: 15 Mar 2023, 10:12:31 UTC - in response to Message 60076. Last modified: 15 Mar 2023, 10:16:48 UTC Sorry about the run.sh missing issue of the past few days. It slipped through me. Also they were a few re-send tests that also crashed, but it should be fixed now. Is there a way I could delete the failed/crashed files from the server? We're also trying to find alternatives to avoid the filesize issue. I hope we can find a nice solution in the next few days. Do the last few runs take less time, being less of a drag to run them? I'm trying to find the sweet spot for everyone/most of us. Thanks everyone! ID: 60085 · Rating: 0 · rate: / Reply Quote

Quico Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 28 Feb 23 Posts: 35 Credit: 0 RAC: 0 Level Scientific publications	Message 60086 - Posted: 15 Mar 2023, 10:13:40 UTC - in response to Message 60084. Quico/GDF, GPU utilization is low so I'd like to test running 3 and 4 ATM WUs simultaneously. Sadly GG chokes off work at 2 WUs per computer so that's presently impossible. How low is it? It really shouldn't be the case at least taking into account the tests we performed internally. ID: 60086 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level Scientific publications	Message 60087 - Posted: 15 Mar 2023, 10:47:56 UTC - in response to Message 60085. My host 508381 (GTX 1660 Ti) has finished a couple overnight, in about 9 hours. The last one finished just as I was reading your message, and I saw the upload size - 114 MB. Another failed with 'Energy is NaN', but that's another question. The size and time figures are comfortable for me, but others will post their own views. It would be helpful to work on the intermediate progress reports and checkpointing - at the moment, neither are reported to BOINC. This host (Linux Mint 20.3) spends the entire run reporting 75% progress: my other machine (Linux Mint 21.1) is stuck at 3%. Both run exactly the same build of BOINC. ID: 60087 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 5,269 Level Scientific publications	Message 60091 - Posted: 15 Mar 2023, 11:29:34 UTC - in response to Message 60086. Last modified: 15 Mar 2023, 11:45:09 UTC My observations show the GPU switching from periods of high utilization (~96-98%) to periods of idle (0%). About every minute or two. i think the current size of the ATM are pretty good. about 4hrs on a 3080Ti and about 5hrs on a 2080Ti. I'll second Richards's comment that you should put some effort into checkpointing about fixing the completion reporting (add weights to the job.xml file) ID: 60091 · Rating: 0 · rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,408,899,587 RAC: 0 Level Scientific publications	Message 60093 - Posted: 15 Mar 2023, 14:47:42 UTC - in response to Message 60086. Last modified: 15 Mar 2023, 14:53:46 UTC Quico/GDF, GPU utilization is low so I'd like to test running 3 and 4 ATM WUs simultaneously. Sadly GG chokes off work at 2 WUs per computer so that's presently impossible. How low is it? It really shouldn't be the case at least taking into account the tests we performed internally. GPUgrid is set to only DL 2 WUs per computer. It used to be higher but since ACEMD WUs take around 12ish hours and have approxiamtely 50% GPU utilization a normal BOINC client couldn't really make efficient use of more than 2. The history of setting the limit may have had something to do with DDOS attacks and throttling server access as a defense. But Python WUs with a very low GPU utilization and ATM with about 25% utilization could run more. I believe it's possible for the work server to designate how many WUs of a given kind based on the client's hardware. Some use a custom BOINC client that tricks the server into thinking their computer is more than one computer. I suspect 1080s & 2080s could run 3 and 3080s could run 4 ATM WUs. Be nice to give it a try. Checkpointing should be high on your To-Do List followed closely by progress reporting. File size is not an issue on the client side since you DL files over a GB. But increasing the limit on your server side would make that problem vanish. Run times have shortened and run fine, maybe a little shorter would be nice but not a priority. ID: 60093 · Rating: 0 · rate: / Reply Quote

Stephen Uitti Send message Joined: 17 Mar 14 Posts: 4 Credit: 77,427,636 RAC: 0 Level Scientific publications	Message 60094 - Posted: 15 Mar 2023, 14:58:08 UTC I noticed Free energy calculations of protein ligand binding in WUProp. For example, today's time is 0.03 hours. I checked, and i've 68 of these with a total of minimal time. So i checked, and they all get "Error while computing". I looked at a recent work unit, 27429650 T_CDK2_new_2_edit_1oiu_26_T2_2A_1-QUICO_TEST_ATM-0-1-RND4575_0 The log has this: + python -m pip install git+https://github.com/raimis/AToM-OpenMM.git@5d7eac55295e8c6e777505c3ca7c998f1c85987d Running command git clone --filter=blob:none --quiet https://github.com/raimis/AToM-OpenMM.git /t/boinclib/boinc-client/slots/8/tmp/pip-req-build-3qm67lb1 Running command git rev-parse -q --verify 'sha^5d7eac55295e8c6e777505c3ca7c998f1c85987d' Running command git fetch -q https://github.com/raimis/AToM-OpenMM.git 5d7eac55295e8c6e777505c3ca7c998f1c85987d Running command git checkout -q 5d7eac55295e8c6e777505c3ca7c998f1c85987d error: subprocess-exited-with-error Ã python setup.py egg_info did not run successfully. â exit code: -4 I'm running Linux Mint 19 (a bit out of date), git is git version 2.17.1 /usr/bin/python is Python 2.7.17 and /usr/bin/python3 is Python 3.6.9 -- this was common until recently uname -a Linux berfon 5.4.0-104-generic #118~18.04.1-Ubuntu SMP Thu Mar 3 13:53:15 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux My machine has a gtx-950, so cuda tasks are OK. It's having an issue writing to /t/boinclib/boinc-client/slots/8/tmp sudo ls -ld /t/boinclib/boinc-client/slots/8/ drwxrwx--x 2 boinc boinc 4096 Mar 15 10:24 /t/boinclib/boinc-client/slots/8/ So it doesn't look like a permissions issue. The disk drive this is on has over 1 TB space free. It looks to me like git failed, and this is what is happening on all the work units. My machine is running "New version of ACEMD" routinely. My preferences for GPUGrid is to run everything. I'm not sure which category this is in, but it must be one of the beta apps. I hope this helps. ID: 60094 · Rating: 0 · rate: / Reply Quote