Message boards :
News :
ATM
Message board moderation
Previous · 1 . . . 21 · 22 · 23 · 24 · 25 · 26 · 27 . . . 35 · Next
| Author | Message |
|---|---|
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 4,772 Level ![]() Scientific publications
|
i see that the host is linking up to your "work" venue/location. verify that the settings for the work venue allow beta/test applications.
|
ServicEnginICSend message Joined: 24 Sep 10 Posts: 592 Credit: 11,972,186,510 RAC: 1,075 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Your system seems not to be asking for ATMbeta tasks. Try taking a look to message #60725 |
|
Send message Joined: 5 May 19 Posts: 36 Credit: 711,308,218 RAC: 45 Level ![]() Scientific publications
|
This 'work' profile is coming from WCG and doesn't have anything about beta-testing tasks. Besides, it didn't cause any issues in the past. Where else should i look? The logs show that it's GPUGrid that doesn't return me new tasks: 1/11/2023 12:02:55 PM | GPUGRID | No tasks are available for ATM: Free energy calculations of protein-ligand bindingThat is, to me it appears that my BOINC client sends requests to the server but receives nothing. I enabled debugging logs to BOINC, this is the output: 1/11/2023 12:05:39 PM | | [work_fetch] ------- start work fetch state -------Though i don't know what that means |
|
Send message Joined: 5 May 19 Posts: 36 Credit: 711,308,218 RAC: 45 Level ![]() Scientific publications
|
Your system seems not to be asking for ATMbeta tasks.Thanks for the response. However, I checked earlier and confirmed that beta tasks were enabled: see my post Will appreciate any help in resolving this. |
ServicEnginICSend message Joined: 24 Sep 10 Posts: 592 Credit: 11,972,186,510 RAC: 1,075 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
When properly configured, the message should say somethimg like this: 1/11/2023 12:02:55 PM | GPUGRID | No tasks are available for ATMbeta: Free energy calculations of protein-ligand binding Try following message #60725, and accessing to the specific links it contains to GPUGRID Project Preferences and GPUGRID Hosts Edit Preferences as stated for "Home" venue (for example), then change your host to "Home" location. This should work. |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 4,772 Level ![]() Scientific publications
|
When properly configured, the message should say somethimg like this: this is exactly what I meant about checking the venue settings. many people do not know of the different venues or how they are set. the WCG supplied compute web preferences are not the same thing as the project-specific host venue preferences. you can make different selections for different venues to give folks the ability to have different computers crunch different things within the same project. if your 'home' or 'default' (blank) preferences are allowing ATMbeta, but the host is set to the work venue which does not allow ATMbeta, then you wont get them. you need to be mindful of what venue the host is set to, and what the specific settings for that venue are. goldfinch, go here: https://gpugrid.net/prefs.php?subset=project and you will see that there are 4 different venues to choose from (default/home/school/work). make sure you are settings the preferences to allow ATMbeta and test apps for the correct venue corresponding to your actual selected venue. you can see what venue it's set to here: https://gpugrid.net/hosts_user.php under the location column. (blank = default)
|
|
Send message Joined: 5 May 19 Posts: 36 Credit: 711,308,218 RAC: 45 Level ![]() Scientific publications
|
Thank you @Ian&Steve C. and @ServicEnginIC, i didn't realise that i was checking *default* profile, while my *venue* profile was *work*, and the latter didn't have Test tasks checkbox ticked. Tons of appreciation for your patience! Thank you very much! |
|
Send message Joined: 27 May 21 Posts: 54 Credit: 1,004,151,720 RAC: 0 Level ![]() Scientific publications
|
there's no need to reanalyze all this. it's been done ad nauseam many months ago and discussed over and over again. all of the behavior is known at this point. Well, I did reanalyze it. Because I'm stubborn like that. But mainly because I failed to notice the 400+ posts hidden by default in this thread. :-D The good news: I came to the same conclusion as Richard Haselgrove's post: progress = float(isample - last_sample)/float(num_samples - last_sample) should fix it, but even better would be: progress = float(isample - last_sample + 1)/float(num_samples - last_sample + 1) Since that would make the denominator = number of samples in the batch (fractions of 1/70 instead of now 1/69) and would let the count go from 1->70 instead of 0->69. The bad news: None of the github repo's containing the above code, and being retrieved on WU start, contain any branch or issue aiming to fix the progress issue. So I'll test my code fix locally once more. First test seemed to work fine, but WU terminated on the NaN issue quickly. Then I'll raise an issue or a pull request on the appropriate Github repo to try and get it fixed there. Fingers crossed... |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 4,772 Level ![]() Scientific publications
|
refresh my memory; isample = current sample? last_sample = previous sample? num_samples = MAX_SAMPLES? if so, i don't think that code will work. isample - last_sample will always = 1 max_samples being "+70" for ALL units (0-10, 1-10, 2-10, etc) means that it's a relative value, not absolute. I think this is the crux of the issue. because the sample range for a 1-10 unit for example is actually [71,140] and for a 2-10 unit would be [141, 210] and so on, but the denominator is likely always using 70. so anything past the 0-10 unit's [1-70] range is a value >1 and is represented in boinc by just maxing out to 100% straight away.
|
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 4,772 Level ![]() Scientific publications
|
I think it's this section that has some problem, but I haven't fully digested exactly what it's doing yet. starting from line 102 here: https://github.com/Gallicchio-Lab/AToM-OpenMM/blob/master/sync/atm.py last_sample = self.replicas[0].get_cycle()
|
|
Send message Joined: 27 May 21 Posts: 54 Credit: 1,004,151,720 RAC: 0 Level ![]() Scientific publications
|
I think it's this section that has some problem, but I haven't fully digested exactly what it's doing yet. For normal units (0-whatever), MAX_SAMPLES will be "70" and last_sample = 1. In that case num_samples will be 70. isample iterating from 1->70 (inclusive). So my formula's denominator will be num_samples-last_sample + 1 or 70-1+1=70. The numerator (isample - last_sample + 1) goes from 1-1+1=1 until 70-1+1=70. So works for regular units. For additional units (>0-whatever), MAX_SAMPLES will be "+70" and last_sample will be 71 or 141 or... In that case the 'if num_samples.startswith("+")' clause will be triggered. num_extra_samples will be 70 and num_samples = num_extra_samples + last_sample - 1, giving 70 + 71 - 1 = 140 (or 210, or...) isample will iterate from 71->140 or 141->210 or... Denominator will be 140 - 71 + 1 = 70 or 210 - 141 + 1 = 70 or... Numerator will be 71-71+1=1 until 140-71+1=70, or 141-141+1=1 until 210-141+1=70, or... so in both cases, whatever the NUM_SAMPLES may be and whatever the first sample number may be, the progress will go from 1/NUM_SAMPLES to NUM_SAMPLES/NUM_SAMPLES and you will get a nice representative percentage. Except of course for the 0.199% added in the beginning for the unpack tasks... |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 4,772 Level ![]() Scientific publications
|
did you follow the exact logic of the code all thr way through? or are you making some assumptions about how you think it works? for starters, ALL tasks ("0", or "1+") get +70 in the MAX_SAMPLES, check your async.cntl for confirmation. it's relative all the time. just that the starting point for the 0 is 0, and the starting point for 1+ is whatever the end of the previous segment was. so all tasks follow the same code path in that respect. second, where do you get that last_sample=1? the code says last_sample = self.replicas[0].get_cycle(), but havent worked through the code yet to see what that actually evaluates to. can you elaborate with specific code paths to where "self.replicas[0].get_cycle()" = 1? similarly with num_extra_samples, how does this equal to 70? it needs to be expanded from num_extra_samples = int(num_samples[1:]). not sure how the unbounded num_samples[1:] ends up being 70 in this case. i still think a problem lies in this section.
|
|
Send message Joined: 27 May 21 Posts: 54 Credit: 1,004,151,720 RAC: 0 Level ![]() Scientific publications
|
did you follow the exact logic of the code all thr way through? or are you making some assumptions about how you think it works? some comments added to follow along:
num_samples = self.config['MAX_SAMPLES'] //HERE, num_samples is a string type!
if num_samples.startswith("+"):
num_extra_samples = int(num_samples[1:]) //[1:] will skip the first character of the string, so "+70" becomes "70". Then cast it into an integer type
num_samples = num_extra_samples + last_sample - 1 // HERE, num_samples becomes an integer type.
self.logger.info(f"Additional number of samples: {num_extra_samples}")
else:
num_samples = int(num_samples) //HERE, num_samples becomes integer
self.logger.info(f"Target number of samples: {num_samples}")
Doesn't matter if it's always "+70" for "0" tasks, last_sample will be 1, so num_samples = num_extra_samples + last_sample - 1 = 70 + 1 - 1 = 70. second, where do you get that last_sample=1? the code says last_sample = self.replicas[0].get_cycle(), but havent worked through the code yet to see what that actually evaluates to. can you elaborate with specific code paths to where "self.replicas[0].get_cycle()" = 1? The get_cycle() calculation is buried deep somewhere in the openMM libraries, so that I haven't managed to find exactly so I can prove it to you, however empirically (run.log) the "0" units start from cycle 1, the "1" units start from cycle 71. Also it doesn't really matter what the last_sample is. Let's say it's X. Then: num_samples = num_extra_samples + last_sample - 1 = 70 + X - 1 for isample in range(last_sample, num_samples + 1): => isample going from X until X + 70 - 1 (last integer of 'range' = excluded!) numerator = isample - last_sample + 1 => numerator going from X - X + 1 = 1 until X + 70 - 1 - X + 1 = 70 denominator = num_samples - last_sample + 1 = 70 + X - 1 - X + 1 = 70 so again progress will go from 1/70 to 70/70 replace 70 everywhere by an arbitrary value of 'MAX_VALUES' and again see that it doesn't matter whatever value is in there, it will work as expected. similarly with num_extra_samples, how does this equal to 70? it needs to be expanded from num_extra_samples = int(num_samples[1:]). not sure how the unbounded num_samples[1:] ends up being 70 in this case. Simple python string operation. Python is very clever and flexible with types. See my added comments in the first code snippet. num_samples = self.config['MAX_SAMPLES'] //HERE, num_samples is a string type! so it's a string here, but it's also an array of characters, where num_samples[0] will be a '+' in most cases. If it is a plus, then the IF-clause will trigger. num_extra_samples = int(num_samples[1:]) //[1:] will skip the first character of the string, so "+70" becomes "70". Then cast it into an integer type since the [0] is a '+', the [1:] will be "70", because if the end of the range is left empty, python 'knows' how long the string is and return until the end of the string but not beyond. So not 'unbounded', but 'implicitly bounded'. the int() part around it will re-type the "70" string into an integer 70. Assigning it to num_extra_samples will redefine that variable to 'int' (python magic again. And if it wasn't a '+' because MAX_VALUES was "70", then the 'else' clause will trigger: num_samples = int(num_samples) This will simply redefine the 'string' num_samples = "70" to an 'int' num_samples = 70 plug that into the example above and see that once again, the progess counter works as it should. |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 4,772 Level ![]() Scientific publications
|
OK I'm following better (forgot that [1:] was omitting the first character and was thinking it was starting from one with the 0/1 mixup in what is "first"). I also couldn't find the get_cycle() routine. the way these tasks (scripts) are setup is very convoluted with how all the pieces of code get pulled in. importing bits of code from all over the place while the actual execution script is only a few lines long lol. really have to go down a rabbit hole to see what's happening. so it seems the original progress calculation of progress = float(isample)/float(num_samples - last_sample) ends up evaluating to a negative number since the num_samples will be 70 and the last sample will be >70. BOINC must be freaking out not knowing what to do with a negative number and calling it 100%.
|
|
Send message Joined: 27 May 21 Posts: 54 Credit: 1,004,151,720 RAC: 0 Level ![]() Scientific publications
|
OK I'm following better (forgot that [1:] was omitting the first character and was thinking it was starting from one with the 0/1 mixup in what is "first"). I also couldn't find the get_cycle() routine. the way these tasks (scripts) are setup is very convoluted with how all the pieces of code get pulled in. importing bits of code from all over the place while the actual execution script is only a few lines long lol. really have to go down a rabbit hole to see what's happening. Not negative no, but for "1+" units it will go beyond 1, and also (minor issue) increment in fractions of 1/69 instead of 1/70. Remember that num_samples = the max_samples parameter PLUS the last_sample! isample will go from 71-140 (or 141-210 etc) num_samples will be 140 or 210 or... last_sample will be 71 or 141 or... (but remember it doesn't really matter) so numerator 71=>140, denominator = 140 - 71 = 69 (or 210 - 141 = 69 or...) progress going from 71/69 until 140/69. Both > 1 so progress immediately jumps to 100%. |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 4,772 Level ![]() Scientific publications
|
you're right I missed that bit. thanks. I think a pull request to the original code would be good if you're able to do that. will probably be the only way to get it to the coder's attention since there is a bit of a game of telephone between the users and the person who has to ultimately make the change. the other option would be to insert your own version of the atm.py code, and splice in some command(s) to replace it right after it downloads the original copy. which could be done by modifying the wrapper config file (job.xml.[...]) with the additional command to swap in a new version of run.sh. in the new run.sh needs to be a new version of rbfe_explicit_sync.py. and finally in rbfe_explicit_sync.py can be your command to copy in the new version of atm.py. very convoluted, but should work to automate the changes for you locally on newly downloaded tasks without having to stop them (which will fail the task) or trying to intercept things on the fly.
|
|
Send message Joined: 27 May 21 Posts: 54 Credit: 1,004,151,720 RAC: 0 Level ![]() Scientific publications
|
you're right I missed that bit. thanks. That's basically how I'm testing it on my machine, but that would also imply somebody does that on the server side - adding the new atm.py and run.sh to the 'program package'. If you do it local, you also need to edit the boinc_state.xml in a boinc stopped state to bypass the code sign mechanism by inserting the correct md5sum and bytesize. FYI - run.sh is part of the server-generated input files for each WU. I'm not a programmer (or not anymore) so no real git skills, I did post an issue on the relevant github. If that's not picked up I'll try the pull request. |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 4,772 Level ![]() Scientific publications
|
you can set <dont_check_file_sizes> in cc_config.xml and change anything you want in BOINC :) the only "BOINC" files you need to modify are the job.xml and you set it up to copy in the new run.sh after it extracts the input files (overwriting the original). only their tar file is checked, not the extracted contents.
|
|
Send message Joined: 27 May 21 Posts: 54 Credit: 1,004,151,720 RAC: 0 Level ![]() Scientific publications
|
you can set <dont_check_file_sizes> in cc_config.xml and change anything you want in BOINC :) the only "BOINC" files you need to modify are the job.xml and you set it up to copy in the new run.sh after it extracts the input files (overwriting the original). only their tar file is checked, not the extracted contents. Could be, although I'm reading this entry as "BOINC will check the integrity of this file (job.xml) to avoid tampering" <file>
<name>job.xml.789bd8d206da56434f30083d18653299</name>
<nbytes>828.000000</nbytes>
<max_nbytes>0.000000</max_nbytes>
<status>1</status>
<signature_required/>
<file_signature>
4b7b99c3260c591fe387d31d63158d0061c1b2fb5ef74395eada7cbb13c67b80
...etcetera...
0e12d16e50df943339987857aa157b863ad1dcbb8712cd0e21c1968fc7ca561a
.
</file_signature>But it's a moot point, isn't it? It would potentially fix the issue for me or for anyone willing to put in the tweaking effort but not for the general user. I'll give it a try though. ;-) |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 4,772 Level ![]() Scientific publications
|
i did the same thing on PythonGPU earlier this year. worked fine.
|
©2025 Universitat Pompeu Fabra