ATM

Author	Message
Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1117 Credit: 40,876,970,595 RAC: 1 Level Scientific publications	Message 60806 - Posted: 31 Oct 2023, 23:39:26 UTC - in response to Message 60804. i see that the host is linking up to your "work" venue/location. verify that the settings for the work venue allow beta/test applications. ID: 60806 · Rating: 0 · rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 595 Credit: 13,083,686,510 RAC: 1,351,209 Level Scientific publications	Message 60807 - Posted: 31 Oct 2023, 23:42:10 UTC - in response to Message 60801. Your system seems not to be asking for ATMbeta tasks. Try taking a look to message #60725 ID: 60807 · Rating: 0 · rate: / Reply Quote

goldfinch Send message Joined: 5 May 19 Posts: 36 Credit: 736,808,218 RAC: 28,150 Level Scientific publications	Message 60808 - Posted: 1 Nov 2023, 1:07:54 UTC - in response to Message 60806. This 'work' profile is coming from WCG and doesn't have anything about beta-testing tasks. Besides, it didn't cause any issues in the past. Where else should i look? The logs show that it's GPUGrid that doesn't return me new tasks: 1/11/2023 12:02:55 PM \| GPUGRID \| No tasks are available for ATM: Free energy calculations of protein-ligand binding That is, to me it appears that my BOINC client sends requests to the server but receives nothing. I enabled debugging logs to BOINC, this is the output: 1/11/2023 12:05:39 PM \| \| [work_fetch] ------- start work fetch state ------- 1/11/2023 12:05:39 PM \| \| [work_fetch] target work buffer: 86400.00 + 86400.00 sec 1/11/2023 12:05:39 PM \| \| [work_fetch] --- project states --- 1/11/2023 12:05:39 PM \| GPUGRID \| [work_fetch] REC 0.000 prio -0.000 can request work 1/11/2023 12:05:39 PM \| \| [work_fetch] --- state for CPU --- 1/11/2023 12:05:39 PM \| \| [work_fetch] shortfall 844411.53 nidle 0.00 saturated 2273.24 busy 0.00 1/11/2023 12:05:39 PM \| GPUGRID \| [work_fetch] share 0.000 no applications 1/11/2023 12:05:39 PM \| \| [work_fetch] --- state for NVIDIA GPU --- 1/11/2023 12:05:39 PM \| \| [work_fetch] shortfall 172800.00 nidle 1.00 saturated 0.00 busy 0.00 1/11/2023 12:05:39 PM \| GPUGRID \| [work_fetch] share 0.000 project is backed off (resource backoff: 502.21, inc 600.00) 1/11/2023 12:05:39 PM \| \| [work_fetch] --- state for Intel GPU --- 1/11/2023 12:05:39 PM \| \| [work_fetch] shortfall 172800.00 nidle 1.00 saturated 0.00 busy 0.00 1/11/2023 12:05:39 PM \| GPUGRID \| [work_fetch] share 0.000 project is backed off (resource backoff: 255.48, inc 600.00) 1/11/2023 12:05:39 PM \| \| [work_fetch] ------- end work fetch state ------- 1/11/2023 12:05:39 PM \| GPUGRID \| choose_project: scanning 1/11/2023 12:05:39 PM \| GPUGRID \| can't fetch CPU: no applications 1/11/2023 12:05:39 PM \| GPUGRID \| can't fetch NVIDIA GPU: project is backed off 1/11/2023 12:05:39 PM \| GPUGRID \| can't fetch Intel GPU: project is backed off Though i don't know what that means ID: 60808 · Rating: 0 · rate: / Reply Quote

goldfinch Send message Joined: 5 May 19 Posts: 36 Credit: 736,808,218 RAC: 28,150 Level Scientific publications	Message 60809 - Posted: 1 Nov 2023, 1:52:58 UTC - in response to Message 60807. Your system seems not to be asking for ATMbeta tasks. Try taking a look to message #60725 Thanks for the response. However, I checked earlier and confirmed that beta tasks were enabled: see my post Will appreciate any help in resolving this. ID: 60809 · Rating: 0 · rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 595 Credit: 13,083,686,510 RAC: 1,351,209 Level Scientific publications	Message 60810 - Posted: 1 Nov 2023, 7:42:47 UTC - in response to Message 60808. Last modified: 1 Nov 2023, 7:49:31 UTC When properly configured, the message should say somethimg like this: 1/11/2023 12:02:55 PM \| GPUGRID \| No tasks are available for ATMbeta: Free energy calculations of protein-ligand binding Try following message #60725, and accessing to the specific links it contains to GPUGRID Project Preferences and GPUGRID Hosts Edit Preferences as stated for "Home" venue (for example), then change your host to "Home" location. This should work. ID: 60810 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1117 Credit: 40,876,970,595 RAC: 1 Level Scientific publications	Message 60811 - Posted: 1 Nov 2023, 11:59:31 UTC - in response to Message 60810. Last modified: 1 Nov 2023, 11:59:52 UTC When properly configured, the message should say somethimg like this: 1/11/2023 12:02:55 PM \| GPUGRID \| No tasks are available for ATMbeta: Free energy calculations of protein-ligand binding Try following message #60725, and accessing to the specific links it contains to GPUGRID Project Preferences and GPUGRID Hosts Edit Preferences as stated for "Home" venue (for example), then change your host to "Home" location. This should work. this is exactly what I meant about checking the venue settings. many people do not know of the different venues or how they are set. the WCG supplied compute web preferences are not the same thing as the project-specific host venue preferences. you can make different selections for different venues to give folks the ability to have different computers crunch different things within the same project. if your 'home' or 'default' (blank) preferences are allowing ATMbeta, but the host is set to the work venue which does not allow ATMbeta, then you wont get them. you need to be mindful of what venue the host is set to, and what the specific settings for that venue are. goldfinch, go here: https://gpugrid.net/prefs.php?subset=project and you will see that there are 4 different venues to choose from (default/home/school/work). make sure you are settings the preferences to allow ATMbeta and test apps for the correct venue corresponding to your actual selected venue. you can see what venue it's set to here: https://gpugrid.net/hosts_user.php under the location column. (blank = default) ID: 60811 · Rating: 0 · rate: / Reply Quote

goldfinch Send message Joined: 5 May 19 Posts: 36 Credit: 736,808,218 RAC: 28,150 Level Scientific publications	Message 60812 - Posted: 1 Nov 2023, 12:15:04 UTC - in response to Message 60811. Thank you @Ian&Steve C. and @ServicEnginIC, i didn't realise that i was checking default profile, while my venue profile was work, and the latter didn't have Test tasks checkbox ticked. Tons of appreciation for your patience! Thank you very much! ID: 60812 · Rating: 0 · rate: / Reply Quote

[BAT] Svennemans Send message Joined: 27 May 21 Posts: 54 Credit: 1,004,151,720 RAC: 0 Level Scientific publications	Message 60813 - Posted: 1 Nov 2023, 12:54:17 UTC - in response to Message 60795. there's no need to reanalyze all this. it's been done ad nauseam many months ago and discussed over and over again. all of the behavior is known at this point. tasks in the "0" group will all process and count the progress naturally. tasks in "1+" groups will all jump to 100% after the extraction phase. but will complete successfully in the normal time if you leave it alone. Well, I did reanalyze it. Because I'm stubborn like that. But mainly because I failed to notice the 400+ posts hidden by default in this thread. :-D The good news: I came to the same conclusion as Richard Haselgrove's post: progress = float(isample - last_sample)/float(num_samples - last_sample) should fix it, but even better would be: progress = float(isample - last_sample + 1)/float(num_samples - last_sample + 1) Since that would make the denominator = number of samples in the batch (fractions of 1/70 instead of now 1/69) and would let the count go from 1->70 instead of 0->69. The bad news: None of the github repo's containing the above code, and being retrieved on WU start, contain any branch or issue aiming to fix the progress issue. So I'll test my code fix locally once more. First test seemed to work fine, but WU terminated on the NaN issue quickly. Then I'll raise an issue or a pull request on the appropriate Github repo to try and get it fixed there. Fingers crossed... ID: 60813 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1117 Credit: 40,876,970,595 RAC: 1 Level Scientific publications	Message 60814 - Posted: 1 Nov 2023, 13:40:51 UTC - in response to Message 60813. refresh my memory; isample = current sample? last_sample = previous sample? num_samples = MAX_SAMPLES? if so, i don't think that code will work. isample - last_sample will always = 1 max_samples being "+70" for ALL units (0-10, 1-10, 2-10, etc) means that it's a relative value, not absolute. I think this is the crux of the issue. because the sample range for a 1-10 unit for example is actually [71,140] and for a 2-10 unit would be [141, 210] and so on, but the denominator is likely always using 70. so anything past the 0-10 unit's [1-70] range is a value >1 and is represented in boinc by just maxing out to 100% straight away. ID: 60814 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1117 Credit: 40,876,970,595 RAC: 1 Level Scientific publications	Message 60815 - Posted: 1 Nov 2023, 13:53:12 UTC - in response to Message 60814. I think it's this section that has some problem, but I haven't fully digested exactly what it's doing yet. starting from line 102 here: https://github.com/Gallicchio-Lab/AToM-OpenMM/blob/master/sync/atm.py last_sample = self.replicas[0].get_cycle() num_samples = self.config['MAX_SAMPLES'] if num_samples.startswith("+"): num_extra_samples = int(num_samples[1:]) num_samples = num_extra_samples + last_sample - 1 self.logger.info(f"Additional number of samples: {num_extra_samples}") ID: 60815 · Rating: 0 · rate: / Reply Quote

[BAT] Svennemans Send message Joined: 27 May 21 Posts: 54 Credit: 1,004,151,720 RAC: 0 Level Scientific publications	Message 60818 - Posted: 1 Nov 2023, 15:06:58 UTC - in response to Message 60815. I think it's this section that has some problem, but I haven't fully digested exactly what it's doing yet. starting from line 102 here: https://github.com/Gallicchio-Lab/AToM-OpenMM/blob/master/sync/atm.py last_sample = self.replicas[0].get_cycle() num_samples = self.config['MAX_SAMPLES'] if num_samples.startswith("+"): num_extra_samples = int(num_samples[1:]) num_samples = num_extra_samples + last_sample - 1 self.logger.info(f"Additional number of samples: {num_extra_samples}") For normal units (0-whatever), MAX_SAMPLES will be "70" and last_sample = 1. In that case num_samples will be 70. isample iterating from 1->70 (inclusive). So my formula's denominator will be num_samples-last_sample + 1 or 70-1+1=70. The numerator (isample - last_sample + 1) goes from 1-1+1=1 until 70-1+1=70. So works for regular units. For additional units (>0-whatever), MAX_SAMPLES will be "+70" and last_sample will be 71 or 141 or... In that case the 'if num_samples.startswith("+")' clause will be triggered. num_extra_samples will be 70 and num_samples = num_extra_samples + last_sample - 1, giving 70 + 71 - 1 = 140 (or 210, or...) isample will iterate from 71->140 or 141->210 or... Denominator will be 140 - 71 + 1 = 70 or 210 - 141 + 1 = 70 or... Numerator will be 71-71+1=1 until 140-71+1=70, or 141-141+1=1 until 210-141+1=70, or... so in both cases, whatever the NUM_SAMPLES may be and whatever the first sample number may be, the progress will go from 1/NUM_SAMPLES to NUM_SAMPLES/NUM_SAMPLES and you will get a nice representative percentage. Except of course for the 0.199% added in the beginning for the unpack tasks... ID: 60818 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1117 Credit: 40,876,970,595 RAC: 1 Level Scientific publications	Message 60819 - Posted: 1 Nov 2023, 15:28:10 UTC - in response to Message 60818. did you follow the exact logic of the code all thr way through? or are you making some assumptions about how you think it works? for starters, ALL tasks ("0", or "1+") get +70 in the MAX_SAMPLES, check your async.cntl for confirmation. it's relative all the time. just that the starting point for the 0 is 0, and the starting point for 1+ is whatever the end of the previous segment was. so all tasks follow the same code path in that respect. second, where do you get that last_sample=1? the code says last_sample = self.replicas[0].get_cycle(), but havent worked through the code yet to see what that actually evaluates to. can you elaborate with specific code paths to where "self.replicas[0].get_cycle()" = 1? similarly with num_extra_samples, how does this equal to 70? it needs to be expanded from num_extra_samples = int(num_samples[1:]). not sure how the unbounded num_samples[1:] ends up being 70 in this case. i still think a problem lies in this section. ID: 60819 · Rating: 0 · rate: / Reply Quote

[BAT] Svennemans Send message Joined: 27 May 21 Posts: 54 Credit: 1,004,151,720 RAC: 0 Level Scientific publications	Message 60820 - Posted: 1 Nov 2023, 16:26:25 UTC - in response to Message 60819. Last modified: 1 Nov 2023, 16:27:17 UTC did you follow the exact logic of the code all thr way through? or are you making some assumptions about how you think it works? for starters, ALL tasks ("0", or "1+") get +70 in the MAX_SAMPLES, check your async.cntl for confirmation. it's relative all the time. just that the starting point for the 0 is 0, and the starting point for 1+ is whatever the end of the previous segment was. so all tasks follow the same code path in that respect. some comments added to follow along: num_samples = self.config['MAX_SAMPLES'] //HERE, num_samples is a string type! if num_samples.startswith("+"): num_extra_samples = int(num_samples[1:]) //[1:] will skip the first character of the string, so "+70" becomes "70". Then cast it into an integer type num_samples = num_extra_samples + last_sample - 1 // HERE, num_samples becomes an integer type. self.logger.info(f"Additional number of samples: {num_extra_samples}") else: num_samples = int(num_samples) //HERE, num_samples becomes integer self.logger.info(f"Target number of samples: {num_samples}") Doesn't matter if it's always "+70" for "0" tasks, last_sample will be 1, so num_samples = num_extra_samples + last_sample - 1 = 70 + 1 - 1 = 70. second, where do you get that last_sample=1? the code says last_sample = self.replicas[0].get_cycle(), but havent worked through the code yet to see what that actually evaluates to. can you elaborate with specific code paths to where "self.replicas[0].get_cycle()" = 1? The get_cycle() calculation is buried deep somewhere in the openMM libraries, so that I haven't managed to find exactly so I can prove it to you, however empirically (run.log) the "0" units start from cycle 1, the "1" units start from cycle 71. Also it doesn't really matter what the last_sample is. Let's say it's X. Then: num_samples = num_extra_samples + last_sample - 1 = 70 + X - 1 for isample in range(last_sample, num_samples + 1): => isample going from X until X + 70 - 1 (last integer of 'range' = excluded!) numerator = isample - last_sample + 1 => numerator going from X - X + 1 = 1 until X + 70 - 1 - X + 1 = 70 denominator = num_samples - last_sample + 1 = 70 + X - 1 - X + 1 = 70 so again progress will go from 1/70 to 70/70 replace 70 everywhere by an arbitrary value of 'MAX_VALUES' and again see that it doesn't matter whatever value is in there, it will work as expected. similarly with num_extra_samples, how does this equal to 70? it needs to be expanded from num_extra_samples = int(num_samples[1:]). not sure how the unbounded num_samples[1:] ends up being 70 in this case. i still think a problem lies in this section. Simple python string operation. Python is very clever and flexible with types. See my added comments in the first code snippet. num_samples = self.config['MAX_SAMPLES'] //HERE, num_samples is a string type! so it's a string here, but it's also an array of characters, where num_samples[0] will be a '+' in most cases. If it is a plus, then the IF-clause will trigger. num_extra_samples = int(num_samples[1:]) //[1:] will skip the first character of the string, so "+70" becomes "70". Then cast it into an integer type since the [0] is a '+', the [1:] will be "70", because if the end of the range is left empty, python 'knows' how long the string is and return until the end of the string but not beyond. So not 'unbounded', but 'implicitly bounded'. the int() part around it will re-type the "70" string into an integer 70. Assigning it to num_extra_samples will redefine that variable to 'int' (python magic again. And if it wasn't a '+' because MAX_VALUES was "70", then the 'else' clause will trigger: num_samples = int(num_samples) This will simply redefine the 'string' num_samples = "70" to an 'int' num_samples = 70 plug that into the example above and see that once again, the progess counter works as it should. ID: 60820 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1117 Credit: 40,876,970,595 RAC: 1 Level Scientific publications	Message 60821 - Posted: 1 Nov 2023, 17:04:05 UTC - in response to Message 60820. OK I'm following better (forgot that [1:] was omitting the first character and was thinking it was starting from one with the 0/1 mixup in what is "first"). I also couldn't find the get_cycle() routine. the way these tasks (scripts) are setup is very convoluted with how all the pieces of code get pulled in. importing bits of code from all over the place while the actual execution script is only a few lines long lol. really have to go down a rabbit hole to see what's happening. so it seems the original progress calculation of progress = float(isample)/float(num_samples - last_sample) ends up evaluating to a negative number since the num_samples will be 70 and the last sample will be >70. BOINC must be freaking out not knowing what to do with a negative number and calling it 100%. ID: 60821 · Rating: 0 · rate: / Reply Quote

[BAT] Svennemans Send message Joined: 27 May 21 Posts: 54 Credit: 1,004,151,720 RAC: 0 Level Scientific publications	Message 60822 - Posted: 1 Nov 2023, 17:14:57 UTC - in response to Message 60821. Last modified: 1 Nov 2023, 17:18:04 UTC OK I'm following better (forgot that [1:] was omitting the first character and was thinking it was starting from one with the 0/1 mixup in what is "first"). I also couldn't find the get_cycle() routine. the way these tasks (scripts) are setup is very convoluted with how all the pieces of code get pulled in. importing bits of code from all over the place while the actual execution script is only a few lines long lol. really have to go down a rabbit hole to see what's happening. so it seems the original progress calculation of progress = float(isample)/float(num_samples - last_sample) ends up evaluating to a negative number since the num_samples will be 70 and the last sample will be >70. BOINC must be freaking out not knowing what to do with a negative number and calling it 100%. Not negative no, but for "1+" units it will go beyond 1, and also (minor issue) increment in fractions of 1/69 instead of 1/70. Remember that num_samples = the max_samples parameter PLUS the last_sample! isample will go from 71-140 (or 141-210 etc) num_samples will be 140 or 210 or... last_sample will be 71 or 141 or... (but remember it doesn't really matter) so numerator 71=>140, denominator = 140 - 71 = 69 (or 210 - 141 = 69 or...) progress going from 71/69 until 140/69. Both > 1 so progress immediately jumps to 100%. ID: 60822 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1117 Credit: 40,876,970,595 RAC: 1 Level Scientific publications	Message 60823 - Posted: 1 Nov 2023, 18:05:48 UTC - in response to Message 60822. you're right I missed that bit. thanks. I think a pull request to the original code would be good if you're able to do that. will probably be the only way to get it to the coder's attention since there is a bit of a game of telephone between the users and the person who has to ultimately make the change. the other option would be to insert your own version of the atm.py code, and splice in some command(s) to replace it right after it downloads the original copy. which could be done by modifying the wrapper config file (job.xml.[...]) with the additional command to swap in a new version of run.sh. in the new run.sh needs to be a new version of rbfe_explicit_sync.py. and finally in rbfe_explicit_sync.py can be your command to copy in the new version of atm.py. very convoluted, but should work to automate the changes for you locally on newly downloaded tasks without having to stop them (which will fail the task) or trying to intercept things on the fly. ID: 60823 · Rating: 0 · rate: / Reply Quote

[BAT] Svennemans Send message Joined: 27 May 21 Posts: 54 Credit: 1,004,151,720 RAC: 0 Level Scientific publications	Message 60824 - Posted: 1 Nov 2023, 18:28:49 UTC - in response to Message 60823. you're right I missed that bit. thanks. I think a pull request to the original code would be good if you're able to do that. will probably be the only way to get it to the coder's attention since there is a bit of a game of telephone between the users and the person who has to ultimately make the change. the other option would be to insert your own version of the atm.py code, and splice in some command(s) to replace it right after it downloads the original copy. which could be done by modifying the wrapper config file (job.xml.[...]) with the additional command to swap in a new version of run.sh. in the new run.sh needs to be a new version of rbfe_explicit_sync.py. and finally in rbfe_explicit_sync.py can be your command to copy in the new version of atm.py. very convoluted, but should work to automate the changes for you locally on newly downloaded tasks without having to stop them (which will fail the task) or trying to intercept things on the fly. That's basically how I'm testing it on my machine, but that would also imply somebody does that on the server side - adding the new atm.py and run.sh to the 'program package'. If you do it local, you also need to edit the boinc_state.xml in a boinc stopped state to bypass the code sign mechanism by inserting the correct md5sum and bytesize. FYI - run.sh is part of the server-generated input files for each WU. I'm not a programmer (or not anymore) so no real git skills, I did post an issue on the relevant github. If that's not picked up I'll try the pull request. ID: 60824 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1117 Credit: 40,876,970,595 RAC: 1 Level Scientific publications	Message 60825 - Posted: 1 Nov 2023, 18:59:49 UTC - in response to Message 60824. you can set <dont_check_file_sizes> in cc_config.xml and change anything you want in BOINC :) the only "BOINC" files you need to modify are the job.xml and you set it up to copy in the new run.sh after it extracts the input files (overwriting the original). only their tar file is checked, not the extracted contents. ID: 60825 · Rating: 0 · rate: / Reply Quote

[BAT] Svennemans Send message Joined: 27 May 21 Posts: 54 Credit: 1,004,151,720 RAC: 0 Level Scientific publications	Message 60826 - Posted: 1 Nov 2023, 20:37:42 UTC - in response to Message 60825. you can set <dont_check_file_sizes> in cc_config.xml and change anything you want in BOINC :) the only "BOINC" files you need to modify are the job.xml and you set it up to copy in the new run.sh after it extracts the input files (overwriting the original). only their tar file is checked, not the extracted contents. Could be, although I'm reading this entry as "BOINC will check the integrity of this file (job.xml) to avoid tampering" <file> <name>job.xml.789bd8d206da56434f30083d18653299</name> <nbytes>828.000000</nbytes> <max_nbytes>0.000000</max_nbytes> <status>1</status> <signature_required/> <file_signature> 4b7b99c3260c591fe387d31d63158d0061c1b2fb5ef74395eada7cbb13c67b80 ...etcetera... 0e12d16e50df943339987857aa157b863ad1dcbb8712cd0e21c1968fc7ca561a . </file_signature> But it's a moot point, isn't it? It would potentially fix the issue for me or for anyone willing to put in the tweaking effort but not for the general user. I'll give it a try though. ;-) ID: 60826 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1117 Credit: 40,876,970,595 RAC: 1 Level Scientific publications	Message 60827 - Posted: 1 Nov 2023, 20:50:19 UTC - in response to Message 60826. i did the same thing on PythonGPU earlier this year. worked fine. ID: 60827 · Rating: 0 · rate: / Reply Quote