ATM

Message boards : News : ATM
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 35 · Next

AuthorMessage
kotenok2000

Send message
Joined: 18 Jul 13
Posts: 79
Credit: 210,528,292
RAC: 163
Level
Leu
Scientific publications
wat
Message 60151 - Posted: 24 Mar 2023, 2:54:44 UTC - in response to Message 60150.  
Last modified: 24 Mar 2023, 2:55:02 UTC

Why not
<app>
<name>PythonGPU</name>
<max_concurrent>1</max_concurrent>
<fraction_done_exact/>
<gpu_versions>
<gpu_usage>1</gpu_usage>
<cpu_usage>4</cpu_usage>
</gpu_versions>
</app>

?
ID: 60151 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Bedrich Hajek

Send message
Joined: 28 Mar 09
Posts: 490
Credit: 11,731,645,728
RAC: 52,725
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60152 - Posted: 24 Mar 2023, 9:48:54 UTC

ID: 60152 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
kotenok2000

Send message
Joined: 18 Jul 13
Posts: 79
Credit: 210,528,292
RAC: 163
Level
Leu
Scientific publications
wat
Message 60153 - Posted: 24 Mar 2023, 11:47:30 UTC - in response to Message 60152.  
Last modified: 24 Mar 2023, 12:05:55 UTC

ID: 60153 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1114
Credit: 40,838,348,595
RAC: 4,765,598
Level
Trp
Scientific publications
wat
Message 60154 - Posted: 24 Mar 2023, 12:16:41 UTC
Last modified: 24 Mar 2023, 12:26:33 UTC

progress reporting is still not working.

instead of halting progress at 75%, it now halts at 0.19%. the weights help prevent the task from jumping to 75%, but there is still something missing.

Python tasks are able to jump to about 1% after the extraction phase due to the weights, and then slowly creeps up over time as the task progresses. 2%, 3%, 4%, etc until it hits 100% in a natural and linear way. The ATM tasks do not do this at all. they sit at 0.19% for hours and hours with no indication of when they will complete. is it 4hrs? is it 20hrs? there's no feedback to the user. when it's done it just jumps to 100% without warning.

makes it very difficult to tell is a task is stuck or working.

-Edit-

The "BACE" tasks do seem to be reporting progress now. but the earlier tasks from yesterday ("T_p38") do not.
ID: 60154 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist

Send message
Joined: 28 Feb 23
Posts: 35
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60155 - Posted: 24 Mar 2023, 13:37:41 UTC - in response to Message 60154.  

progress reporting is still not working.

instead of halting progress at 75%, it now halts at 0.19%. the weights help prevent the task from jumping to 75%, but there is still something missing.

Python tasks are able to jump to about 1% after the extraction phase due to the weights, and then slowly creeps up over time as the task progresses. 2%, 3%, 4%, etc until it hits 100% in a natural and linear way. The ATM tasks do not do this at all. they sit at 0.19% for hours and hours with no indication of when they will complete. is it 4hrs? is it 20hrs? there's no feedback to the user. when it's done it just jumps to 100% without warning.

makes it very difficult to tell is a task is stuck or working.

-Edit-

The "BACE" tasks do seem to be reporting progress now. but the earlier tasks from yesterday ("T_p38") do not.


T_p38 were sent before the update so I guess it makes sense that they don't show reporting yet. Is the progress report for the BACE runs good? Is it staying stuck?
ID: 60155 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1114
Credit: 40,838,348,595
RAC: 4,765,598
Level
Trp
Scientific publications
wat
Message 60156 - Posted: 24 Mar 2023, 13:50:20 UTC - in response to Message 60155.  

Yes, BACE looks good.

But something wrong with CDK2_new. It jumped to 100% but is still running.
ID: 60156 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Emilio Gallicchio

Send message
Joined: 23 Mar 23
Posts: 4
Credit: 87,500
RAC: 0
Level

Scientific publications
wat
Message 60157 - Posted: 24 Mar 2023, 13:59:50 UTC - in response to Message 60140.  

Hello Quico and everyone. Thank you for trying AToM-OpenMM on GPUGRID.

I am unsure if it is relevant to this issue, but AToM implements full checkpointing. Each replica's status is stored in a .xml file in the replica directory. We usually checkpoint every 10 mins, but this interval can be changed in the control file with the CHECKPOINT_TIME parameter (in seconds). Checkpointing is also triggered by SIGTERM or SIGINT signals sent to the main AToM process.

Launching the AToM job from the same folder reads the checkpoints and should restart the simulation as if it had kept running.
ID: 60157 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
bibi

Send message
Joined: 4 May 17
Posts: 15
Credit: 17,444,875,743
RAC: 222,959
Level
Trp
Scientific publications
watwatwatwatwat
Message 60158 - Posted: 24 Mar 2023, 14:08:25 UTC
Last modified: 24 Mar 2023, 14:13:21 UTC

The python task must tell the boinc client how many ticks are to calculate (MAX_SAMPLES = 341 from *_asyncre.cntl times 22 replica) and the end of each tick.

In addition, the elapsed time used starts counting again at 0 after each restart. I don't know what the current situation is.

If the progress indicator is now ok, forgot my reply
ID: 60158 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 326,008
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60159 - Posted: 24 Mar 2023, 14:09:29 UTC
Last modified: 24 Mar 2023, 14:18:59 UTC

The ATM tasks also record that a task has checkpointed in the job.log file in the slot directory (or did so, a few debug iterations ago - see message 60046).

That file can be viewed while a task is running, but not after it's finished. It's written (I think) by the science app, but messages are passed to BOINC by the wrapper: that's probably where the problem is.

Edit: OK, I've downloaded a BACE task (resend _4) and a T_PTP1B_new task (resend _3). I'll watch them when the current pair of Abouh tasks have finished.
ID: 60159 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Emilio Gallicchio

Send message
Joined: 23 Mar 23
Posts: 4
Credit: 87,500
RAC: 0
Level

Scientific publications
wat
Message 60160 - Posted: 24 Mar 2023, 15:45:51 UTC - in response to Message 60158.  

The GPUGRID version of AToM:

https://github.com/Gallicchio-Lab/AToM-OpenMM/blob/master/sync/atm.py

has this:

           
           # Report progress on GPUGRID
           progress = float(isample)/float(num_samples - last_sample)
           open("progress", "w").write(str(progress))



which checks out as far as I can tell. last_sample is retrieved from checkpoints upon restart, so the progress % should be tracked correctly across restarts.
ID: 60160 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 326,008
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60161 - Posted: 24 Mar 2023, 15:46:40 UTC

OK, the BACE task is running, and after 7 minutes or so, I see:

2023-03-24 15:40:33 - INFO     - sync_re                        - Started: checkpointing
2023-03-24 15:40:49 - INFO     - sync_re                        - Finished: checkpointing (duration: 15.699278543004766 s)
2023-03-24 15:40:49 - INFO     - sync_re                        - Finished: sample 1 (duration: 303.5407383099664 s)

in the run.log file. So checkpointing is happening, but just not being reported through to BOINC.

Progress is 3.582% after eleven minutes.
ID: 60161 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Emilio Gallicchio

Send message
Joined: 23 Mar 23
Posts: 4
Credit: 87,500
RAC: 0
Level

Scientific publications
wat
Message 60162 - Posted: 24 Mar 2023, 16:04:08 UTC - in response to Message 60157.  

Actually, it is unclear if AToM's GPUGRID version checkpoints after catching termination signals. I'll ask Raimondas. Termination without checkpointing is usually okay, but progress since the checkpoint would be lost, and the number of samples recorded in the checkpoint file would not reflect the actual number of samples recorded.

Does anyone know if BOINC sends specific signals to terminate an app? Would the app pass the signal to the main AToM's python process?
ID: 60162 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 326,008
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60163 - Posted: 24 Mar 2023, 16:20:44 UTC - in response to Message 60162.  

The app seems to be both checkpointing, and updating progress, at the end of each sample. That will make re-alignment after a pause easier, but there's always some over-run, and data lost on restart. It's up to the application itself to record the data point reached, and to be used for the restart, as an integral part of the checkpointing process.

I can't answer immediately on the termination question, but it's all open-source and I can look through it. In this case, it's more complicated, because BOINC will talk to the wrapper, and the wrapper will talk to the science app.

But the basic idea is that BOINC will send a request to terminate over the API, and wait for the application to close itself down as it sees fit. Actual signals will only be used to force termination in the case of an unconditional quit, such as an operating system closedown.
ID: 60163 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Aurum
Avatar

Send message
Joined: 12 Jul 17
Posts: 404
Credit: 17,408,899,587
RAC: 2
Level
Trp
Scientific publications
watwatwat
Message 60164 - Posted: 24 Mar 2023, 16:20:50 UTC
Last modified: 24 Mar 2023, 16:20:58 UTC

Seriously? Only 14 tasks a day?

GPUGRID	3/24/2023 9:17:44 AM	This computer has finished a daily quota of 14 tasks
ID: 60164 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 326,008
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60165 - Posted: 24 Mar 2023, 16:42:27 UTC - in response to Message 60164.  

Seriously? Only 14 tasks a day?

The quota adjusts dynamically - it goes up if you report successful tasks, and goes down if you report errors.
ID: 60165 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 326,008
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60166 - Posted: 24 Mar 2023, 16:53:12 UTC

The T_PTP1B_new task, on the other hand, is not reporting progress, even though it's logging checkpoints in the run.log

A file is maintained in the slot folder, called 'boinc_task_state.xml' (it's probably written by the wrapper, though I'm not certain of that).

The current contents are:

<active_task>
    <project_master_url>https://www.gpugrid.net/</project_master_url>
    <result_name>T_PTP1B_new_23484_23482_T3_2A_1-QUICO_TEST_ATM-0-1-RND3714_3</result_name>
    <checkpoint_cpu_time>10.942300</checkpoint_cpu_time>
    <checkpoint_elapsed_time>30.176729</checkpoint_elapsed_time>
    <fraction_done>0.001996</fraction_done>
    <peak_working_set_size>8318976</peak_working_set_size>
    <peak_swap_size>16592896</peak_swap_size>
    <peak_disk_usage>1318196036</peak_disk_usage>
</active_task>

The <fraction done> is reported as the 'progress%' figure - this one is reported as 0.199% by BOINC Manager (which truncates) and 0.200% by other tools (which round).

This task has been running for 43 minutes, and boinc_task_state.xml hasn't been re-written since the first minute.

ID: 60166 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
KAMasud

Send message
Joined: 27 Jul 11
Posts: 138
Credit: 539,953,398
RAC: 0
Level
Lys
Scientific publications
watwat
Message 60167 - Posted: 24 Mar 2023, 20:30:16 UTC


task 27438680
Completed and validated. While the following task had a failure after a re-start.
task 27438865
ID: 60167 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
KAMasud

Send message
Joined: 27 Jul 11
Posts: 138
Credit: 539,953,398
RAC: 0
Level
Lys
Scientific publications
watwat
Message 60168 - Posted: 24 Mar 2023, 20:30:47 UTC


task 27438680
Completed and validated. While the following task had a failure after a re-start.
task 27438865
ID: 60168 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 326,008
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60169 - Posted: 24 Mar 2023, 20:49:28 UTC

My BACE task 33378091 finished successfully after 5 hours, under Linux Mint 21.1 with a GTX 1660 Super.

Four previous attempts failed, two of them under Windows with a 0xc0000135 error in Python.exe - that's a missing DLL.
ID: 60169 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
KAMasud

Send message
Joined: 27 Jul 11
Posts: 138
Credit: 539,953,398
RAC: 0
Level
Lys
Scientific publications
watwat
Message 60170 - Posted: 24 Mar 2023, 21:46:07 UTC

Task 27438853
Completed and validated. Short one though.
ID: 60170 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 35 · Next

Message boards : News : ATM

©2025 Universitat Pompeu Fabra