ATM: Free Energy Calculations new application

Message boards : Number crunching : ATM: Free Energy Calculations new application
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · Next

AuthorMessage
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 59751 - Posted: 18 Jan 2023, 19:50:34 UTC
Last modified: 18 Jan 2023, 20:30:40 UTC

Just starting the thread for discussion of this new application. ATM = AToM.

after a little snafu with the first batch (incorrect config files), the latest batch seems to run on my system. no idea for runtime yet or if it will finish successfully.

This is another Python-based application. the package ships with the python environment similar to how the PythonGPU Reinforcement Learning (RL) app does.

Test Bench:
Xeon E5-2697Av4 (16c/32t)
64GB DDR4-2400 RDIMM (ECC)
RTX 3060 12GB
Ubuntu 22.04.1


So far observed behavior:
-uses ~97% of the GPU core, ~45% GPU memory bus, ~0-1% PCIe bus, close to full power use.
-about 400-500MB VRAM used (low, like acemd3)
-does not like to be paused and resumed, or BOINC stopped and restarted. it causes the task to fail

unknown total runtime expectation since the one task I had failed when I restarted BOINC lol.
ID: 59751 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 59752 - Posted: 18 Jan 2023, 20:36:00 UTC - in response to Message 59751.  

about the restart failure. looks like it fails trying to create a directory that already exists.

mkdir: cannot create directory 'atm_tmp': File exists


needs some work to allow for that.
ID: 59752 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 59754 - Posted: 18 Jan 2023, 20:55:38 UTC

another quality of life improvement should be adding a <weight> line to the main task in the job.xml file. right now with 2 tasks in the file, and no weights defined, I'm guessing it splits it 50/50 and it thinks the task is 50% done once the extraction phase is complete.
ID: 59754 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 59755 - Posted: 18 Jan 2023, 21:33:51 UTC
Last modified: 18 Jan 2023, 21:34:03 UTC

task ran to completion in about an hour. but hit an error and threw it all away because the file size is too big.

upload failure: <file_xfer_error>
  <file_name>T11_4-RAIMIS_TEST_ATM-0-1-RND7054_2_0</file_name>
  <error_code>-131 (file size too big)</error_code>
</file_xfer_error>


what a waste.
ID: 59755 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 891
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59771 - Posted: 19 Jan 2023, 18:06:13 UTC

Have over a dozen of quick-failing ATM tasks.

The wrapper does not have a correctly name tar file or something.

02:56:29 (1242346): wrapper: running /bin/tar (xf input.tar.bz2)
/bin/tar: This does not look like a tar archive
bzip2: (stdin) is not a bzip2 file.
/bin/tar: Child returned status 2
/bin/tar: Error is not recoverable: exiting now
ID: 59771 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 59792 - Posted: 24 Jan 2023, 16:01:40 UTC - in response to Message 59755.  

looks like the small batch of tasks that went out today are better setup. ran for about an hour and completed successfully without the file size issue when complete.

great :)

still would like a little more background info on these tasks, what they are doing, and the goal of the research.
ID: 59792 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
FritzB

Send message
Joined: 7 Apr 15
Posts: 17
Credit: 2,978,057,945
RAC: 73
Level
Phe
Scientific publications
wat
Message 59899 - Posted: 10 Feb 2023, 22:25:09 UTC
Last modified: 10 Feb 2023, 22:25:52 UTC

This one https://www.gpugrid.net/workunit.php?wuid=27399736 is runnig for about 11 hours and it is stuck at 66,666% for at least 4 hours now. There is almost no load on the GPU. Just a few percent (3-5) once in a while, but constantly some load on the memory controller (10-30). Hope it will finish some day :)
ID: 59899 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 59936 - Posted: 16 Feb 2023, 16:24:10 UTC

still no official communication from the project about these tasks.

the recent batches have been very hit or miss and exhibit much different behavior than my initial post.

"TL2" tasks, ran for hours and hours with little to no GPU or CPU use. I aborted them and moved on.

"TL3" tasks yesterday, also had little to no GPU or CPU use, but did complete in about 30 mins.

"TL4" tasks today seem like a repeat of TL2. no GPU use, runs for hours with no progress.

also weights need to be defined in the jobs.xml file so the tasks don't jump to 75% after a few seconds and then sit there for hours doing nothing.
ID: 59936 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 428
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59937 - Posted: 16 Feb 2023, 17:01:14 UTC

Just been sent a TL4 from WU 27405970. I see you've aborted two previous tasks from the same WU, Ian, on two different machines. Did you get any CPU usage figures from previous runs? I think I'll start it up with the GTX 1660 plus one core, but I'll probably abort it myself if it doesn't show much response.
ID: 59937 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 59938 - Posted: 16 Feb 2023, 17:04:18 UTC - in response to Message 59937.  

they spin up multiple processes like the Python tasks do. but i didnt catch them at the very beginning to see if they spike in use or anything like that.

once they get going, they basically sit idle as far as the GPU and CPU go. little to no use at all. i just killed them rather than letting them sit there for hours occupying my GPU.
ID: 59938 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 428
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59939 - Posted: 16 Feb 2023, 17:30:46 UTC

OK, I've set 3 CPUs for continuity from the current Python task, and I've put weights of 1-1-1-97 in the job file so I can see what's happening.

My normal remote monitoring console shows the current average CPU usage, and I've put nvidia-smi on a five second loop. If either of those drops to zero, I'll abort it.

Chocks away!
ID: 59939 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 428
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59941 - Posted: 16 Feb 2023, 18:15:43 UTC

I see what you mean. Nearly half an hour in, CPU usage is showing around 25% of a single core, and GPU usage spiked once, to 41%, after about a quarter of an hour. It's one way of saving electricity, but I'd rather be doing something useful. Aborting.
ID: 59941 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Aurum
Avatar

Send message
Joined: 12 Jul 17
Posts: 404
Credit: 17,408,899,587
RAC: 0
Level
Trp
Scientific publications
watwatwat
Message 59961 - Posted: 22 Feb 2023, 17:28:34 UTC

1.13 ATM running fine for me.
Keep aborting them and I'll run them for you.
ID: 59961 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
zombie67 [MM]

Send message
Joined: 16 Jul 07
Posts: 209
Credit: 5,496,860,456
RAC: 12,111
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59964 - Posted: 23 Feb 2023, 3:08:08 UTC

FWIW, the first task I received completed successfully.

http://www.gpugrid.net/workunit.php?wuid=27410175
Reno, NV
Team: SETI.USA
ID: 59964 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
FritzB

Send message
Joined: 7 Apr 15
Posts: 17
Credit: 2,978,057,945
RAC: 73
Level
Phe
Scientific publications
wat
Message 59967 - Posted: 23 Feb 2023, 8:18:47 UTC - in response to Message 59964.  
Last modified: 23 Feb 2023, 8:19:26 UTC

I've also finished one:
https://www.gpugrid.net/workunit.php?wuid=27410166

We're both using Linux Mint. It seems to crash on Win 10 machines (computer #600532 is mine, too).
ID: 59967 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
zombie67 [MM]

Send message
Joined: 16 Jul 07
Posts: 209
Credit: 5,496,860,456
RAC: 12,111
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59968 - Posted: 23 Feb 2023, 15:33:06 UTC

Over night, I had 4 of these tasks cancelled by server.
Reno, NV
Team: SETI.USA
ID: 59968 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
KAMasud

Send message
Joined: 27 Jul 11
Posts: 138
Credit: 539,953,398
RAC: 0
Level
Lys
Scientific publications
watwat
Message 59969 - Posted: 23 Feb 2023, 16:50:20 UTC - in response to Message 59961.  

1.13 ATM running fine for me.
Keep aborting them and I'll run them for you.

_______________

Same here. I quite enjoy completing these WUs. There should be a way to analyse these WUs as to why it is happening on certain machines. We are mostly running the same hardware and OS. It would be fun to see the results.



















-
ID: 59969 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 59970 - Posted: 23 Feb 2023, 17:27:24 UTC
Last modified: 23 Feb 2023, 17:29:36 UTC

keep in mind these are bata tasks, and the batch being sent NOW are not necessarily the same as the batch sent last week and wont be the same as whatever is sent sometime in the future, until they get all the bugs worked out. the tasks last week basically ran with no perceived use of the GPU or CPU, so what were they doing? who knows. no official word from the project about these tasks at all. I wasn't willing to let the GPU/CPU be occupied for hours on end with the task spinning it's wheels when they could be doing something more useful.

maybe this current batch has been tweaked from last week and thats why they are working OK, for those that have completed this latest batch, did they have any meaningful use of the GPU or CPU? it also seems this batch was released with a new Windows application (they were Linux only before) for testing.
ID: 59970 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
KAMasud

Send message
Joined: 27 Jul 11
Posts: 138
Credit: 539,953,398
RAC: 0
Level
Lys
Scientific publications
watwat
Message 59971 - Posted: 24 Feb 2023, 5:28:26 UTC - in response to Message 59970.  

keep in mind these are bata tasks, and the batch being sent NOW are not necessarily the same as the batch sent last week and wont be the same as whatever is sent sometime in the future, until they get all the bugs worked out. the tasks last week basically ran with no perceived use of the GPU or CPU, so what were they doing? who knows. no official word from the project about these tasks at all. I wasn't willing to let the GPU/CPU be occupied for hours on end with the task spinning it's wheels when they could be doing something more useful.

maybe this current batch has been tweaked from last week and thats why they are working OK, for those that have completed this latest batch, did they have any meaningful use of the GPU or CPU? it also seems this batch was released with a new Windows application (they were Linux only before) for testing.

_______________________

Well, most of us know that Abouh reads every word written on these threads and without much song and dance, makes changes. He is the Only Admin on all the projects who diligently attend. Maybe, quite possibly. No arguments with your tweaking statement.
ID: 59971 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59972 - Posted: 24 Feb 2023, 8:36:54 UTC - in response to Message 59971.  

Well, most of us know that Abouh reads every word written on these threads and without much song and dance, makes changes. He is the Only Admin on all the projects who diligently attend. Maybe, quite possibly. No arguments with your tweaking statement.

well, Abouh is the only one from the project team who actively communicates with us volunteers - which is great.
All others obviously don't care, and this has been like this over the years, unfortunately.
For example: 9 days ago I asked in the ACEMD 4 thread when new ACEMD 4 task will be around, or whether this subproject is dead.
No reply so far; whereas a reply could be very simple, not longer than just a line :-(

You know what I want to say ... it's kind of disappointing at times :-(
ID: 59972 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · 3 · Next

Message boards : Number crunching : ATM: Free Energy Calculations new application

©2025 Universitat Pompeu Fabra