ATM

Message boards : News : ATM
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 22 · 23 · 24 · 25 · 26 · 27 · 28 . . . 35 · Next

AuthorMessage
goldfinch

Send message
Joined: 5 May 19
Posts: 36
Credit: 711,308,218
RAC: 45
Level
Lys
Scientific publications
wat
Message 60828 - Posted: 2 Nov 2023, 9:25:02 UTC

I had this task running for 3+ hours, and the laptop hang. I switched it off, back on, restarted BOINC - the task failed. I saw a post about tasks failing on restarts, but that task's output is different from mine. I have a few questions:
    - can tasks be restarted?
    - if not, what's the purpose of the checkpoints?
    - what was wrong with my task after restart, and where/how can i get more detailed info, if needed? BOINC log doesn't show previous session's log, only the current where the task failed, but i need the logs that were at the time of hanging...

As for the BOINC logs, here they are:

2/11/2023 7:22:35 PM | GPUGRID | [task_debug] task is running in processor group 0
2/11/2023 7:22:35 PM | GPUGRID | [task] task_state=EXECUTING for TYK2_m15_m16_5-QUICO_ATM_Sch_GAFF2_xTB_MM_False_v3-2-10-RND4709_0 from start
2/11/2023 7:23:34 PM | GPUGRID | [task] Process for TYK2_m15_m16_5-QUICO_ATM_Sch_GAFF2_xTB_MM_False_v3-2-10-RND4709_0 exited, exit code 195, task state 1
2/11/2023 7:23:34 PM | GPUGRID | [task] task_state=EXITED for TYK2_m15_m16_5-QUICO_ATM_Sch_GAFF2_xTB_MM_False_v3-2-10-RND4709_0 from handle_exited_app
2/11/2023 7:23:34 PM | GPUGRID | [task] result state=COMPUTE_ERROR for TYK2_m15_m16_5-QUICO_ATM_Sch_GAFF2_xTB_MM_False_v3-2-10-RND4709_0 from CS::report_result_error
2/11/2023 7:23:34 PM | GPUGRID | [task] Process for TYK2_m15_m16_5-QUICO_ATM_Sch_GAFF2_xTB_MM_False_v3-2-10-RND4709_0 exited
2/11/2023 7:23:34 PM | GPUGRID | [task] exit code 195 (0xc3): The operating system cannot run %1. (0xc3)
2/11/2023 7:23:36 PM | GPUGRID | Computation for task TYK2_m15_m16_5-QUICO_ATM_Sch_GAFF2_xTB_MM_False_v3-2-10-RND4709_0 finished
2/11/2023 7:23:36 PM | GPUGRID | Output file TYK2_m15_m16_5-QUICO_ATM_Sch_GAFF2_xTB_MM_False_v3-2-10-RND4709_0_0 for task TYK2_m15_m16_5-QUICO_ATM_Sch_GAFF2_xTB_MM_False_v3-2-10-RND4709_0 absent
2/11/2023 7:23:36 PM | GPUGRID | [task] result state=COMPUTE_ERROR for TYK2_m15_m16_5-QUICO_ATM_Sch_GAFF2_xTB_MM_False_v3-2-10-RND4709_0 from CS::app_finished
Should i activate other debug logs for such cases? Which, if yes?

Thanks for your help.
ID: 60828 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 4,772
Level
Trp
Scientific publications
wat
Message 60829 - Posted: 2 Nov 2023, 10:29:34 UTC - in response to Message 60828.  

At this time. Tasks cannot be restarted at all. This is because checkpointing is broken in some way that the devs haven’t figured out. Checkpointing is there because it’s *supposed* to work. It just doesn’t right now. Restarting for any reason will cause it to fail.

Not sure why your task hung. Could be because it’s a laptop and overheated. Or could be because it’s windows. The windows application has a lot more problems than the Linux application
ID: 60829 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
[BAT] Svennemans

Send message
Joined: 27 May 21
Posts: 54
Credit: 1,004,151,720
RAC: 0
Level
Met
Scientific publications
wat
Message 60830 - Posted: 2 Nov 2023, 11:06:07 UTC - in response to Message 60827.  

i did the same thing on PythonGPU earlier this year. worked fine.



Well I tried, didn't work until I killed BOINC and edited the (new) filesize into client_state.xml. Even though the logfile clearly showed 'don't check filesizes' enabled, it failed due to job.xml size mismatch.

Either a bug in the latest version, or some setting overriding it, like this one?

<signature_required/>


anyway, with the client_state size edit it does work.

made these changes:

    <task>
        <application>C:/Windows/system32/cmd.exe</application>
        <command_line>/c copy ..\..\newrun.bat run.bat</command_line>
        <weight>1</weight>
    </task>
    <task>
        <application>C:/Windows/system32/cmd.exe</application>
        <command_line>/c call run.bat *.cntl</command_line>
        <setenv>CUDA_DEVICE=$GPU_DEVICE_NUM</setenv>
        <stdout_filename>run.log</stdout_filename>
        <weight>1000</weight>
        <fraction_done_filename>progress</fraction_done_filename>
    </task>


the '*.cntl' command line added to run.bat is needed because the input config file xxxxx.cntl is actually hardcoded into run.bat for each WU - as run.bat is part of the WU file.

So changes done to run.bat:
at the beginning:
set PARM1=%1
echo %PARM1%
for %%A in (%PARM1%) do (set "CONFIG_FILE=%%A")
echo %CONFIG_FILE%


...and deleted the original line setting the CONFIG_FILE variable

This will load the (alphabetically last) config file - hoping they never include more than one... ;-)

replace atm.py:
@echo Replace atm.py
copy ..\..\projects\www.gpugrid.net\atm_correct_progress.py Lib\site-packages\sync
rename Lib\site-packages\sync\atm.py atm.py.orig
rename Lib\site-packages\sync\atm_correct_progress.py atm.py

@echo Run AToM
python.exe Scripts\rbfe_explicit_sync.py %CONFIG_FILE% || goto EX22


and some exit handling to preserve relevant output files

set LEVEL=%ERRORLEVEL%

:EXIT
copy run.log ..\..\projects\www.gpugrid.net\
copy stderr.txt ..\..\projects\www.gpugrid.net\
copy progress ..\..\projects\www.gpugrid.net\

exit %LEVEL%

:EX14
set LEVEL=14
goto EXIT

:EX22
set LEVEL=22
goto EXIT


And now it's working fine with a 3-10 job (samples 211-280) and correct progress - no manual intervention.
Thanks for the tip!
ID: 60830 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 4,772
Level
Trp
Scientific publications
wat
Message 60831 - Posted: 2 Nov 2023, 11:35:03 UTC - in response to Message 60830.  

Glad to see it’s working. A good amount of work for just a quality of life change though. And the changes for windows seem a bit more involved than they otherwise would be on Linux. Maybe the windows vs Linux client is why you couldn’t get it working initially? I assume you stopped and restarted BOINC after the change to cc_config, and not just a re-read config file. BOINC has to be restarted.

Either way. Cool that it works. Hopefully that can be pushed up to the devs through GitHub.
ID: 60831 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
[BAT] Svennemans

Send message
Joined: 27 May 21
Posts: 54
Credit: 1,004,151,720
RAC: 0
Level
Met
Scientific publications
wat
Message 60832 - Posted: 2 Nov 2023, 11:40:29 UTC - in response to Message 60831.  

Glad to see it’s working. A good amount of work for just a quality of life change though.


Yeah, and it didn't even really bother me in the first place. :-D I just like a good puzzle.

No response on the GIT issue yet, though....
ID: 60832 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
goldfinch

Send message
Joined: 5 May 19
Posts: 36
Credit: 711,308,218
RAC: 45
Level
Lys
Scientific publications
wat
Message 60838 - Posted: 3 Nov 2023, 8:55:04 UTC - in response to Message 60830.  

Dear [BAT] Svennemans, can you please consolidate your fix in 1 post? I got what you did with the config files, but i couldn't follow what you changed in Python files and where you put your new run.bat because i'm not familiar with the projects structure in BOINC and, particularly, with ATM projects. I also couldn't find the name and location of the modified config file. I think not only i but others, too, will appreciate the guide in the form of steps, especially considering that your system is Windows like mine... E.g.,

    * go to %programdata%\BOINC\slots\0\Lib\site-packages\sync
    * change atm.py as per such and such post
    * go to ...
    * change so and so as per such and such post
    * create new run.bat and place it ...
    * ...


Again, if you have time for that, your guide will be much appreciated. I would love to test your solution. (I have basic Python and cmd/powershell scripting skills - not enough to figure out things for myself, but enough to apply a fix.)
Thank you.

ID: 60838 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
[BAT] Svennemans

Send message
Joined: 27 May 21
Posts: 54
Credit: 1,004,151,720
RAC: 0
Level
Met
Scientific publications
wat
Message 60839 - Posted: 3 Nov 2023, 14:39:44 UTC - in response to Message 60838.  

Sure I can, goldfinch.

Buckle up, because as Ian&Steve C. said:
A good amount of work for just a quality of life change


Here goes:

Procedure for Windows!

    * Get an ATMbeta running task and set GPUGRID to 'no new tasks' because you'll need to stop BOINC later which would crash any running ATM task.
    If you do have queued up ATMbeta tasks, suspend them before they start.
    * Figure out in which slot directory within %programdata%\BOINC\slots\ the ATM task is running - let's call that <atmslot>
    * Determine where you'll put your locally changed files. I chose to put them in the project directory %programdata%\BOINC\projects\www.gpugrid.net\ but that's up to personal choice. You will need the path to this directory for the script changes, either absolute (e.g. C:\ProgramData\BOINC\projects\www.gpugrid.net\) or relative from the perspective of the <atmslot> directory (e.g. ..\..\projects\www.gpugrid.net)
    Let's call this the <localcopy> directory
    * Copy <atmslot>\run.bat to <localcopy>\newrun.bat
    Copy <atmslot>\Lib\site-packages\sync\atm.py to <localcopy>\atm_correct_progress.py
    Copy %programdata%\BOINC\projects\www.gpugrid.net\job.xml.<a-very-long-number> to <localcopy>\newjob.xml
    Caution: When you're running other GPUGRID applications besides ATMbeta, there may be more than one job.xml.<very-long-number> present! To find the correct one, open <atmslot>\job.xml and check which one to use in all subsequent steps of this procedure.
    <atmslot>\job.xml:
    <soft_link>../../projects/www.gpugrid.net/job.xml.789bd8d206da56434f30083d18653299</soft_link>

    * Edit <localcopy>\atm_correct_progress.py
    Find the line (approx #139) that says
    # Report progress on GPUGRID
                       progress = float(isample)/float(num_samples - last_sample)

    and change it to this
    # Report progress on GPUGRID
                        progress = float(isample - last_sample + 1)/float(num_samples - last_sample + 1)

    ATTENTION to readers that are not Python-savvy!! Be very careful not to touch/change any of the whitespace leading those lines, as Python treats that as relevant. Safest to not copy/paste the above lines, just manually make the changes!
    Save&close
    * Edit <localcopy>\newrun.bat

      * Add the following lines to the beginning of the file:
      set PARM1=%1
      echo %PARM1%
      for %%A in (%PARM1%) do (set "CONFIG_FILE=%%A")
      echo %CONFIG_FILE%

      * Look for this line
      python.exe -m pip install %REPO_URL% || exit 14

      And change it to this
      python.exe -m pip install %REPO_URL% || goto EX14

      This is not strictly necessary, but together with the errorhandling in 6.5 will preserve a copy of logfiles when something goes wrong to facilitate debugging.
      * Look for these lines
      @echo Extract restart
      tar.exe xjvf restart.tar.bz2 || true

      And insert code to copy the local version of atm.py
      @echo Extract restart
      tar.exe xjvf restart.tar.bz2 || true
      
      @echo Replace atm.py
      copy ..\..\projects\www.gpugrid.net\atm_correct_progress.py Lib\site-packages\sync
      rename Lib\site-packages\sync\atm.py atm.py.orig
      rename Lib\site-packages\sync\atm_correct_progress.py atm.py
      

      NOTE: I gave the example where <localcopy> = '..\..\projects\www.gpugrid.net\' which would work for anyone - but remember to change it if you put your <localcopy> elsewhere.
      NOTE2: I'm aware I could have directly copied over atm.py but this gives me a visual confirmation when inspecting <atmslot>\Lib\site-packages\sync that my script is working fine.
      * Look for these lines
      @echo Run AToM
      set CONFIG_FILE=TYK2_m08_m01_asyncre.cntl
      python.exe Scripts\rbfe_explicit_sync.py %CONFIG_FILE% || exit 22

      Your 'set' statement will be different, because the config file 'xxxxxx.cntl' is different for each task. Don't worry about it.
      Delete the 'set' statement and make following changes to the 'python' statement
      @echo Run AToM
      python.exe Scripts\rbfe_explicit_sync.py %CONFIG_FILE% || goto EX22

      Note again the optional but useful change for errorhandling purposes
      * Add the following lines to the end of the file
      set LEVEL=%ERRORLEVEL%
      
      :EXIT
      copy Lib\site-packages\sync\atm.py ..\..\projects\www.gpugrid.net\
      copy run.bat ..\..\projects\www.gpugrid.net\
      copy progress ..\..\projects\www.gpugrid.net\
      copy stderr.txt ..\..\projects\www.gpugrid.net\
      copy run.log ..\..\projects\www.gpugrid.net\
      
      exit %LEVEL%
      
      :EX14
      set LEVEL=14
      goto EXIT
      
      :EX22
      set LEVEL=22
      goto EXIT


      This allows you to inspect logfiles if the ATM task should fail unexpectedly and allows you to check if corrected versions of run.bat and atm.py were indeed copied into <atmslot>
      * Save & Close


    * Edit <localcopy>\newjob.xml


      * Look for the following <task> entry
          <task>
              <application>Library/usr/bin/tar.exe</application>
              <command_line>xjvf input.tar.bz2</command_line>
              <setenv>PATH=$PWD/Library/usr/bin</setenv>
              <weight>1</weight>
          </task>

      and insert a new <task> behind the above
          <task>
              <application>Library/usr/bin/tar.exe</application>
              <command_line>xjvf input.tar.bz2</command_line>
              <setenv>PATH=$PWD/Library/usr/bin</setenv>
              <weight>1</weight>
          </task>
          <task>
              <application>C:/Windows/system32/cmd.exe</application>
              <command_line>/c copy ..\..\projects\www.gpugrid.net\newrun.bat run.bat</command_line>
              <weight>1</weight>
          </task>

      NOTE: I gave the example where <localcopy> = '..\..\projects\www.gpugrid.net\' which would work for anyone - but remember to change it if you put your <localcopy> elsewhere.

      * In the next <task> entry, look for the following line
              <command_line>/c call run.bat</command_line>
      

      and change it to this
      <command_line>/c call run.bat *.cntl</command_line>
      

      * Save & close
      * Open a command prompt window and 'CD' to your <localcopy> directory. Type 'DIR newjob.xml'
      02/11/2023  04:29             1.043 newjob.xml
                     1 File(s)          1.043 bytes
                     0 Dir(s)  2.334.709.661.696 bytes free

      Write down the number of bytes for newjob.xml. I is 1043 in my case, but it may be different for you. Don't write down the '.'.
      NOTE: You could also use <right-click>=>properties on the file in Explorer, but not everyone's OS language is english - and then you also need to remember to use the 'size' and NOT 'size on disk'


    * Now wait for all running ATM tasks to finish. When they do - and remember from the beginning you should have no new (un-suspended) ATM jobs queued up for starting - shut down BOINC.
    Make sure the BOINC client is stopped, not just the manager. Best way to do this using BOINC manager is selecting menu "File"->"Exit BOINC", make sure "Stop running tasks when exiting the BOINC manager" is selected and press "OK". Using task manager, verify "boinc.exe" is not running.
    * Navigate to %programdata%\BOINC directory
    * Edit 'cc_config.xml'
    Look for the following line

    <dont_check_file_sizes>0</dont_check_file_sizes>

    And change it to
    <dont_check_file_sizes>1</dont_check_file_sizes>

    If the line wasn't there, just add it somewhere in the <options> section.
    Save & Close
    NOTE: this alone should do the trick theoretically, but it didn't work for me. Your mileage may vary. It doesn't hurt in any case and you're free to try and skip the next section editing 'client_state.xml' but for me, editing 'client_state.xml' was necessary...
    * Edit 'client_state.xml'

      * Search for the following section
      <project>
          <master_url>https://www.gpugrid.net/</master_url>
          <project_name>GPUGRID</project_name>

      * Within this <project> section, look for the following sub-section
      <file>
          <name>job.xml.789bd8d206da56434f30083d18653299</name>
          <nbytes>1018.000000</nbytes>

      NOTE: You may have a different <very-long-number> after 'job.xml' and a different number in <nbytes> - don't worry about it.
      However, Caution: if you have multiple versions of job.xml, see Step 4 of this procedure to figure out the correct one.
      * Change the <nbytes> number to the byte-size of newjob.xml you noted down.
      from
      <nbytes>1018.000000</nbytes>

      to
      <nbytes>1043.000000</nbytes>

      But do use your numbers, not mine...
      * Save & Close


    * Navigate back to %programdata%\BOINC\projects\www.gpugrid.net
    Copy newjob.xml to job.xml.<very-long-number>
    Use your <very-long-number>, not mine!
    * Restart BOINC. Unsuspend queued up ATM tasks if any and "Allow new tasks" for GUPGRID.
    * Cross your fingers and see what happens. If you did everything right, progress will be correct from now on. If tasks fail immediately, you may need to check the steps from the beginning to see where it failed.



Procedure for Linux
Probably very similar than above, except


    * Edit run.sh instead of run.bat
    * Replace all Windows commands by equivalent Linux shell commands
    * Replace all Windows path separators '\' by Linux path separators '/'


I don't have a Linux machine with GPU, so can't verify, but I'm sure some Linux-savvy user can post a full Linux version.

Remember:


    * The GPUGRID project may publish an updated job.xml version at any time! If that happens, you'll have to once again copy this version to 'newjob.xml' and redo everything starting from 'Edit newjob.xml'
    * The run.bat/run.sh files are part of the files downloaded from GPUGRID for every new ATM task. There again, GPUGRID may release a new version at any time. You'll have to redo the appropriate changes to a fresh copy of 'newrun.bat' when that happens.



Have fun! :-)

ID: 60839 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
goldfinch

Send message
Joined: 5 May 19
Posts: 36
Credit: 711,308,218
RAC: 45
Level
Lys
Scientific publications
wat
Message 60842 - Posted: 4 Nov 2023, 1:39:11 UTC - in response to Message 60839.  

Well... breath-taking! Thanks you so much! It works! I checked both run.bat and atm.py after BOINC restart - they were updated versions. However, this victory came with tears... While I was editing new version of run.bat, my laptop hang, with an ATM task being almost 100% complete... Maybe, overheating - it's a laptop, after all. I wish checkpoints could be fixed as easily as progress indicator!

Question: why did you copy and then rename the atm.py file instead of copying to a new name straight away? As for the ways how to determine a correct slot, here are my 2 cents (works with BOINC Manager; i don't know much about headless BOINC):

    * Select a running ATMBeta task
    * Click 'Property'
    * Check the 'Directory' property - it shows the task's slot



Again, thanks for the instructions. Easy to follow, easy to implement. You even took care of people who don't know Python! So gracious of you... Thanks a lot!

ID: 60842 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
[BAT] Svennemans

Send message
Joined: 27 May 21
Posts: 54
Credit: 1,004,151,720
RAC: 0
Level
Met
Scientific publications
wat
Message 60843 - Posted: 4 Nov 2023, 12:41:44 UTC - in response to Message 60842.  
Last modified: 4 Nov 2023, 12:44:40 UTC

Well... breath-taking! Thanks you so much! It works! I checked both run.bat and atm.py after BOINC restart - they were updated versions. However, this victory came with tears... While I was editing new version of run.bat, my laptop hang, with an ATM task being almost 100% complete... Maybe, overheating - it's a laptop, after all. I wish checkpoints could be fixed as easily as progress indicator!


Yeah, those checkpoints would indeed be great. I did see on the dev's Github page an issue for preemption and checkpointing and a comment they're trying to fix it, so fingers crossed...


Question: why did you copy and then rename the atm.py file instead of copying to a new name straight away?


See Note 2 in section 6.3. :-D
Long explanation: I had a couple of faults while debugging my procedure. BOINC then cleans out the slot directory faster than I could edit/validate the content of any files so I did it this way. I could quickly navigate into the Lib\site-packages\sync directory as the code was being unpacked and see if atm.py.orig popped up - or not. That was before I thought to include some errorhandling in run.bat.

As for the ways how to determine a correct slot, here are my 2 cents (works with BOINC Manager; i don't know much about headless BOINC):

    * Select a running ATMBeta task
    * Click 'Property'
    * Check the 'Directory' property - it shows the task's slot



That's very true and didn't even think about that. I just quickly browsed through the few slots directories I had on my system to find the correct one. But your explanation is useful for anyone not knowing how to recognize the correct slot content on sight. I'll edit my post to include it.
<EDIT>: seems I can no longer edit my previous post. Oh well, they'll figure that one out I'm sure. ;-)

For headless BOINC, you'd need to look into client_state.xml for an <active_task> section of gpugrid:
<active_task>
    <project_master_url>https://www.gpugrid.net/</project_master_url>
    <result_name>Tyk2_jmc_28_jmc_27_1_RE-QUICO_ATM_Sch_GAFF2-3-10-RND3439_0</result_name>
    <active_task_state>1</active_task_state>
    <app_version_num>109</app_version_num>
    <slot>1</slot>
...



Again, thanks for the instructions. Easy to follow, easy to implement. You even took care of people who don't know Python! So gracious of you... Thanks a lot!


You're welcome, I'm happy it's useful to you - and maybe others.
ID: 60843 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
goldfinch

Send message
Joined: 5 May 19
Posts: 36
Credit: 711,308,218
RAC: 45
Level
Lys
Scientific publications
wat
Message 60846 - Posted: 5 Nov 2023, 3:10:30 UTC - in response to Message 60843.  

I enabled checkpoints logging and discovered that checkpointing occurred right before uploading the results. That is, during computation checkpointing doesn't seem to occur (at least, according to debug logs).

As for this,

Question: why did you copy and then rename the atm.py file instead of copying to a new name straight away?


See Note 2 in section 6.3. :-D
Long explanation: I had a couple of faults while debugging my procedure. BOINC then cleans out the slot directory faster than I could edit/validate the content of any files so I did it this way. I could quickly navigate into the Lib\site-packages\sync directory as the code was being unpacked and see if atm.py.orig popped up - or not. That was before I thought to include some errorhandling in run.bat.

That's not what i meant. My question was about why use this:
@echo Replace atm.py
copy ..\..\projects\www.gpugrid.net\atm_correct_progress.py Lib\site-packages\sync
rename Lib\site-packages\sync\atm.py atm.py.orig
rename Lib\site-packages\sync\atm_correct_progress.py atm.py

instead of this:
@echo Replace atm.py
rename Lib\site-packages\sync\atm.py atm.py.orig
copy ..\..\projects\www.gpugrid.net\atm_correct_progress.py Lib\site-packages\sync\atm.py

especially, seeing a similar approach in the config file modification:
<command_line>/c copy ..\..\projects\www.gpugrid.net\newrun.bat run.bat</command_line>
.
In the second approach there are only 2 commands instead of 3, and result will be the same. If BOINC clears the slot directory, i don't see how using copy and subsequent rename can help, or how it will be affected differently compared to renaming the original and copying a corrected Python file with the correct name. Maybe, i'm missing something - after all, you debugged it, not i (:
Thanks again. Next thing to solve is cooling the laptop. Any useful scripts for that? (joking:)
ID: 60846 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
goldfinch

Send message
Joined: 5 May 19
Posts: 36
Credit: 711,308,218
RAC: 45
Level
Lys
Scientific publications
wat
Message 60847 - Posted: 5 Nov 2023, 8:06:04 UTC - in response to Message 60843.  

Actually, your fix is more than simply improving quality of life. Because the progress is correctly displayed now, BOINC doesn't download next task immediately, but waits for some time. In my case, it downloaded the next task at ~95% of the current task, which is better because the task doesn't spend too much time in the queue. So, the fix also implicitly improves task management.
ID: 60847 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
[BAT] Svennemans

Send message
Joined: 27 May 21
Posts: 54
Credit: 1,004,151,720
RAC: 0
Level
Met
Scientific publications
wat
Message 60850 - Posted: 5 Nov 2023, 12:23:59 UTC - in response to Message 60846.  

I enabled checkpoints logging and discovered that checkpointing occurred right before uploading the results. That is, during computation checkpointing doesn't seem to occur (at least, according to debug logs).


Interesting. I do see that the Python code creates a simulation state checkpoint after every of the 70 samples.

2023-11-05 06:37:16 - INFO     - sync_re                        - Finished: sample 288, replica 21 (duration: 13.26600000000326 s)

2023-11-05 06:37:16 - INFO     - sync_re                        - Started: exchange replicas

2023-11-05 06:37:16 - INFO     - sync_re                        - Replica 15: 18 --> 17

2023-11-05 06:37:16 - INFO     - sync_re                        - Replica 21: 17 --> 18

2023-11-05 06:37:16 - INFO     - sync_re                        - Finished: exchange replicas (duration: 0.031000000017229468 s)

2023-11-05 06:37:16 - INFO     - sync_re                        - Started: update replicas

2023-11-05 06:37:30 - INFO     - sync_re                        - Finished: update replicas (duration: 14.046999999962281 s)

2023-11-05 06:37:30 - INFO     - sync_re                        - Started: write replicas samples and trajectories

2023-11-05 06:37:30 - INFO     - sync_re                        - Finished: write replicas samples and trajectories (duration: 0.0 s)

2023-11-05 06:37:30 - INFO     - sync_re                        - Started: checkpointing

2023-11-05 06:38:45 - INFO     - sync_re                        - Finished: checkpointing (duration: 74.75 s)

2023-11-05 06:38:45 - INFO     - sync_re                        - Finished: sample 288 (duration: 372.85899999999674 s)

2023-11-05 06:38:45 - INFO     - sync_re                        - Started: sample 289


So the potential should be there to have more granular checkpoints.

In the second approach there are only 2 commands instead of 3, and result will be the same. If BOINC clears the slot directory, i don't see how using copy and subsequent rename can help, or how it will be affected differently compared to renaming the original and copying a corrected Python file with the correct name. Maybe, i'm missing something - after all, you debugged it, not i (:


The objective was not to prevent cleaning the slot directory, but to hopefully be able to see the atm.py.orig file pop in existance in the second before boinc cleans the slot. Granted I could have done that with one less command. I stand duly chastised for my reckless waste of processing cycles. ;-)

Thanks again. Next thing to solve is cooling the laptop. Any useful scripts for that? (joking:)


Yup: Try this - worked for me. :-)
ID: 60850 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 4,772
Level
Trp
Scientific publications
wat
Message 60854 - Posted: 7 Nov 2023, 16:50:56 UTC

Another idiosyncrasy that has been less often discussed, and I knew in the back of my mind that this was the case, but since these tasks download some packages at runtime, this requires that you maintain internet connectivity for the tasks to run. I had a small issue with one host where it couldnt access the internet due to a network adapter issue, and tasks started to fail one by one (only in the setup phase I think, tasks that already downloaded what they need will run fine).

I think this kind of goes against some long running BOINC methodology where donwloaded tasks are fully self contained and can be run offline. I am aware that some folks might have limited network access or time-of-use billing schemes that make sense for limiting network activity to certain times. this may be another mechanism causing errors for some folks.

I don't often have network connectivity issues, but I do think it would be in the interest of the users for the devs to rework this so that tasks are fully self contained. I just don't think that losing network access should be a source of errors, even if it doesn't waste much computational time.

I'm aware that this can make things harder for the devs, and I 100% understand why they do it the way they do (downloading directly from github allows them to make changes basically real-time without changing anything on the BOINC side of things). I'm not really demanding a change here, I can deal with it either way, but it would be nice.
ID: 60854 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
[BAT] Svennemans

Send message
Joined: 27 May 21
Posts: 54
Credit: 1,004,151,720
RAC: 0
Level
Met
Scientific publications
wat
Message 60857 - Posted: 7 Nov 2023, 23:02:06 UTC - in response to Message 60854.  

Yeah, I follow your logic. I can only assume they have a good reason for it.
On the plus side, it allows us to find and inspect the code when issues pop up.

And on that note, since I didn't get a response on my issue on Github regarding the progress indicator, I've taken a crash-self-course in git and created a pull request for the code fix.

Fingers crossed...
ID: 60857 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 4,772
Level
Trp
Scientific publications
wat
Message 60858 - Posted: 7 Nov 2023, 23:19:58 UTC - in response to Message 60857.  

nice, a PR should at least get someone's attention lol
ID: 60858 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
[BAT] Svennemans

Send message
Joined: 27 May 21
Posts: 54
Credit: 1,004,151,720
RAC: 0
Level
Met
Scientific publications
wat
Message 60880 - Posted: 17 Nov 2023, 14:58:31 UTC - in response to Message 60858.  

nice, a PR should at least get someone's attention lol


And so it finally did. :-)

Pull request was accepted and merged into the original AToM-OpenMM/Master repo. All that's left now is for it to be merged into the proper repo that is retrieved at the execution of any WU, and progress % will be fixed.
Which will obviously only be useful if ATMbeta task generation starts up again...

Regarding one of your earlier questions:
second, where do you get that last_sample=1? the code says last_sample = self.replicas[0].get_cycle(), but havent worked through the code yet to see what that actually evaluates to. can you elaborate with specific code paths to where "self.replicas[0].get_cycle()" = 1?

I still haven't found the actual code position where it happens, but you can check that the starting cycle # is actually just read from the worker replica input file <taskname>.xml
<Parameters ATMAcore=".0625" ATMAlpha="0" ATMDirection="1" ATMLambda1=".5" ATMLambda2=".5" ATMU0="0" ATMUbcore="2092" ATMUmax="4184" ATMW0="0" BiasEnergy="0" MonteCarloPressure="1" MonteCarloTemperature="300" REAlchemicalIntermediate="0" RECycle="0" REMDSteps="0" REPertEnergy="0" REPotEnergy="0" REStateId="0" RETemperature="0"/>

This is copied into the r0-rxx replica dirs as checkpoint/restart files. The RECycle parameter will be 0, or 70 or whatever at the start and then increased at any new cycle/checkpoint.

The reason checkpointing/restarting doesn't work is because between the BOINC wrapper and the actual working (Python) program, there is that run.bat/run.sh command shell process acting as a sort of in-between wrapper that doesn't properly forward communication between the BOINC client/wrapper and the actual python program leading to all sorts of mayhem that prevents the python program to gracefully exit and/or restart using its built-in checkpoint/restart functionality.
That's because a restart will re-run the run.bat/run.sh in its entirety, overwriting part but not all of the existing working files, leaving the python program with inconsistent input data at restart leading to a crash.

I'm taking a quick look to see if I can figure out some workaround, but the true fix would be running the python from the actual BOINC wrapper instead of using that .bat/.sh file in between. That would also imply, as you said before, having the AToM-OpenMM code downloaded as part of the project files instead of retrieved for any new WU.
ID: 60880 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 4,772
Level
Trp
Scientific publications
wat
Message 60881 - Posted: 17 Nov 2023, 16:15:53 UTC - in response to Message 60880.  

that's great that someone finally noticed the PR and acted on it. maybe you need to drop a comment or something about merging it into the GPUGRID repo? ATM task generation IS ongoing right now. there appears to be a single small batch (~250 tasks) running right now. new tasks are generated when the previous segment is recieved. so the tasks in progress stays around 250, but the RTS shows 0 most of the time since so many hosts are asking for work.

we're about halfway through this run it seems. most of the tasks I'm getting now are 4-10s or 5-10s or 6-10s. they will replicate until 9-10 just like before.
ID: 60881 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 318
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60882 - Posted: 27 Nov 2023, 16:58:25 UTC

A new batch was launched this afternoon (27 November 2023).

There's a possible systemic deployment error:

File "/hdd/boinc-client/slots/1/bin/rbfe_explicit_sync.py", line 2, in <module>
from sync.atm import openmm_job_AmberRBFE
ImportError: cannot import name 'openmm_job_AmberRBFE' from 'sync.atm' (/hdd/boinc-client/slots/1/lib/python3.9/site-packages/sync/atm.py)

syk_m22_m32_3-QUICO_ATM_Sch_ANI-0-10-RND4131
syk_m07_m35_5-QUICO_ATM_Sch_ANI-0-10-RND4539
syk_m17_m25_4-QUICO_ATM_Sch_ANI-0-10-RND6268
syk_m43_m15_2-QUICO_ATM_Sch_ANI-0-10-RND5566
ID: 60882 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 4,772
Level
Trp
Scientific publications
wat
Message 60883 - Posted: 27 Nov 2023, 17:51:23 UTC - in response to Message 60882.  

can confirm. I have like 80+ errors from these.

at least they error after like 30s and don't waste much time.
ID: 60883 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60884 - Posted: 28 Nov 2023, 17:24:12 UTC - in response to Message 60882.  

A new batch was launched this afternoon (27 November 2023).

There's a possible systemic deployment error:

[quote] File "/hdd/boinc-client/slots/1/bin/rbfe_explicit_sync.py", line 2, in <module>
from sync.atm import openmm_job_AmberRBFE
...

Here, too, all tasks with name "syk..." are failing after about 1 minute :-(
ID: 60884 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 . . . 22 · 23 · 24 · 25 · 26 · 27 · 28 . . . 35 · Next

Message boards : News : ATM

©2025 Universitat Pompeu Fabra