ATM: Free Energy Calculations new application

Author	Message
Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1117 Credit: 40,876,970,595 RAC: 0 Level Scientific publications	Message 59751 - Posted: 18 Jan 2023, 19:50:34 UTC Last modified: 18 Jan 2023, 20:30:40 UTC Just starting the thread for discussion of this new application. ATM = AToM. after a little snafu with the first batch (incorrect config files), the latest batch seems to run on my system. no idea for runtime yet or if it will finish successfully. This is another Python-based application. the package ships with the python environment similar to how the PythonGPU Reinforcement Learning (RL) app does. Test Bench: Xeon E5-2697Av4 (16c/32t) 64GB DDR4-2400 RDIMM (ECC) RTX 3060 12GB Ubuntu 22.04.1 So far observed behavior: -uses ~97% of the GPU core, ~45% GPU memory bus, ~0-1% PCIe bus, close to full power use. -about 400-500MB VRAM used (low, like acemd3) -does not like to be paused and resumed, or BOINC stopped and restarted. it causes the task to fail unknown total runtime expectation since the one task I had failed when I restarted BOINC lol. ID: 59751 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1117 Credit: 40,876,970,595 RAC: 0 Level Scientific publications	Message 59752 - Posted: 18 Jan 2023, 20:36:00 UTC - in response to Message 59751. about the restart failure. looks like it fails trying to create a directory that already exists. mkdir: cannot create directory 'atm_tmp': File exists needs some work to allow for that. ID: 59752 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1117 Credit: 40,876,970,595 RAC: 0 Level Scientific publications	Message 59754 - Posted: 18 Jan 2023, 20:55:38 UTC another quality of life improvement should be adding a <weight> line to the main task in the job.xml file. right now with 2 tasks in the file, and no weights defined, I'm guessing it splits it 50/50 and it thinks the task is 50% done once the extraction phase is complete. ID: 59754 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1117 Credit: 40,876,970,595 RAC: 0 Level Scientific publications	Message 59755 - Posted: 18 Jan 2023, 21:33:51 UTC Last modified: 18 Jan 2023, 21:34:03 UTC task ran to completion in about an hour. but hit an error and threw it all away because the file size is too big. upload failure: <file_xfer_error> <file_name>T11_4-RAIMIS_TEST_ATM-0-1-RND7054_2_0</file_name> <error_code>-131 (file size too big)</error_code> </file_xfer_error> what a waste. ID: 59755 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 0 Level Scientific publications	Message 59771 - Posted: 19 Jan 2023, 18:06:13 UTC Have over a dozen of quick-failing ATM tasks. The wrapper does not have a correctly name tar file or something. 02:56:29 (1242346): wrapper: running /bin/tar (xf input.tar.bz2) /bin/tar: This does not look like a tar archive bzip2: (stdin) is not a bzip2 file. /bin/tar: Child returned status 2 /bin/tar: Error is not recoverable: exiting now ID: 59771 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1117 Credit: 40,876,970,595 RAC: 0 Level Scientific publications	Message 59792 - Posted: 24 Jan 2023, 16:01:40 UTC - in response to Message 59755. looks like the small batch of tasks that went out today are better setup. ran for about an hour and completed successfully without the file size issue when complete. great :) still would like a little more background info on these tasks, what they are doing, and the goal of the research. ID: 59792 · Rating: 0 · rate: / Reply Quote

FritzB Send message Joined: 7 Apr 15 Posts: 17 Credit: 3,095,057,945 RAC: 1,964 Level Scientific publications	Message 59899 - Posted: 10 Feb 2023, 22:25:09 UTC Last modified: 10 Feb 2023, 22:25:52 UTC This one https://www.gpugrid.net/workunit.php?wuid=27399736 is runnig for about 11 hours and it is stuck at 66,666% for at least 4 hours now. There is almost no load on the GPU. Just a few percent (3-5) once in a while, but constantly some load on the memory controller (10-30). Hope it will finish some day :) ID: 59899 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1117 Credit: 40,876,970,595 RAC: 0 Level Scientific publications	Message 59936 - Posted: 16 Feb 2023, 16:24:10 UTC still no official communication from the project about these tasks. the recent batches have been very hit or miss and exhibit much different behavior than my initial post. "TL2" tasks, ran for hours and hours with little to no GPU or CPU use. I aborted them and moved on. "TL3" tasks yesterday, also had little to no GPU or CPU use, but did complete in about 30 mins. "TL4" tasks today seem like a repeat of TL2. no GPU use, runs for hours with no progress. also weights need to be defined in the jobs.xml file so the tasks don't jump to 75% after a few seconds and then sit there for hours doing nothing. ID: 59936 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 59937 - Posted: 16 Feb 2023, 17:01:14 UTC Just been sent a TL4 from WU 27405970. I see you've aborted two previous tasks from the same WU, Ian, on two different machines. Did you get any CPU usage figures from previous runs? I think I'll start it up with the GTX 1660 plus one core, but I'll probably abort it myself if it doesn't show much response. ID: 59937 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1117 Credit: 40,876,970,595 RAC: 0 Level Scientific publications	Message 59938 - Posted: 16 Feb 2023, 17:04:18 UTC - in response to Message 59937. they spin up multiple processes like the Python tasks do. but i didnt catch them at the very beginning to see if they spike in use or anything like that. once they get going, they basically sit idle as far as the GPU and CPU go. little to no use at all. i just killed them rather than letting them sit there for hours occupying my GPU. ID: 59938 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 59939 - Posted: 16 Feb 2023, 17:30:46 UTC OK, I've set 3 CPUs for continuity from the current Python task, and I've put weights of 1-1-1-97 in the job file so I can see what's happening. My normal remote monitoring console shows the current average CPU usage, and I've put nvidia-smi on a five second loop. If either of those drops to zero, I'll abort it. Chocks away! ID: 59939 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 59941 - Posted: 16 Feb 2023, 18:15:43 UTC I see what you mean. Nearly half an hour in, CPU usage is showing around 25% of a single core, and GPU usage spiked once, to 41%, after about a quarter of an hour. It's one way of saving electricity, but I'd rather be doing something useful. Aborting. ID: 59941 · Rating: 0 · rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,412,649,587 RAC: 95 Level Scientific publications	Message 59961 - Posted: 22 Feb 2023, 17:28:34 UTC 1.13 ATM running fine for me. Keep aborting them and I'll run them for you. ID: 59961 · Rating: 0 · rate: / Reply Quote

zombie67 [MM] Send message Joined: 16 Jul 07 Posts: 209 Credit: 6,054,860,456 RAC: 15,024 Level Scientific publications	Message 59964 - Posted: 23 Feb 2023, 3:08:08 UTC FWIW, the first task I received completed successfully. http://www.gpugrid.net/workunit.php?wuid=27410175 Reno, NV Team: SETI.USA ID: 59964 · Rating: 0 · rate: / Reply Quote

FritzB Send message Joined: 7 Apr 15 Posts: 17 Credit: 3,095,057,945 RAC: 1,964 Level Scientific publications	Message 59967 - Posted: 23 Feb 2023, 8:18:47 UTC - in response to Message 59964. Last modified: 23 Feb 2023, 8:19:26 UTC I've also finished one: https://www.gpugrid.net/workunit.php?wuid=27410166 We're both using Linux Mint. It seems to crash on Win 10 machines (computer #600532 is mine, too). ID: 59967 · Rating: 0 · rate: / Reply Quote

zombie67 [MM] Send message Joined: 16 Jul 07 Posts: 209 Credit: 6,054,860,456 RAC: 15,024 Level Scientific publications	Message 59968 - Posted: 23 Feb 2023, 15:33:06 UTC Over night, I had 4 of these tasks cancelled by server. Reno, NV Team: SETI.USA ID: 59968 · Rating: 0 · rate: / Reply Quote

KAMasud Send message Joined: 27 Jul 11 Posts: 138 Credit: 539,953,398 RAC: 0 Level Scientific publications	Message 59969 - Posted: 23 Feb 2023, 16:50:20 UTC - in response to Message 59961. 1.13 ATM running fine for me. Keep aborting them and I'll run them for you. _______________ Same here. I quite enjoy completing these WUs. There should be a way to analyse these WUs as to why it is happening on certain machines. We are mostly running the same hardware and OS. It would be fun to see the results. - ID: 59969 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1117 Credit: 40,876,970,595 RAC: 0 Level Scientific publications	Message 59970 - Posted: 23 Feb 2023, 17:27:24 UTC Last modified: 23 Feb 2023, 17:29:36 UTC keep in mind these are bata tasks, and the batch being sent NOW are not necessarily the same as the batch sent last week and wont be the same as whatever is sent sometime in the future, until they get all the bugs worked out. the tasks last week basically ran with no perceived use of the GPU or CPU, so what were they doing? who knows. no official word from the project about these tasks at all. I wasn't willing to let the GPU/CPU be occupied for hours on end with the task spinning it's wheels when they could be doing something more useful. maybe this current batch has been tweaked from last week and thats why they are working OK, for those that have completed this latest batch, did they have any meaningful use of the GPU or CPU? it also seems this batch was released with a new Windows application (they were Linux only before) for testing. ID: 59970 · Rating: 0 · rate: / Reply Quote

KAMasud Send message Joined: 27 Jul 11 Posts: 138 Credit: 539,953,398 RAC: 0 Level Scientific publications	Message 59971 - Posted: 24 Feb 2023, 5:28:26 UTC - in response to Message 59970. keep in mind these are bata tasks, and the batch being sent NOW are not necessarily the same as the batch sent last week and wont be the same as whatever is sent sometime in the future, until they get all the bugs worked out. the tasks last week basically ran with no perceived use of the GPU or CPU, so what were they doing? who knows. no official word from the project about these tasks at all. I wasn't willing to let the GPU/CPU be occupied for hours on end with the task spinning it's wheels when they could be doing something more useful. maybe this current batch has been tweaked from last week and thats why they are working OK, for those that have completed this latest batch, did they have any meaningful use of the GPU or CPU? it also seems this batch was released with a new Windows application (they were Linux only before) for testing. _______________________ Well, most of us know that Abouh reads every word written on these threads and without much song and dance, makes changes. He is the Only Admin on all the projects who diligently attend. Maybe, quite possibly. No arguments with your tweaking statement. ID: 59971 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1171 Credit: 12,662,148,501 RAC: 10,668 Level Scientific publications	Message 59972 - Posted: 24 Feb 2023, 8:36:54 UTC - in response to Message 59971. Well, most of us know that Abouh reads every word written on these threads and without much song and dance, makes changes. He is the Only Admin on all the projects who diligently attend. Maybe, quite possibly. No arguments with your tweaking statement. well, Abouh is the only one from the project team who actively communicates with us volunteers - which is great. All others obviously don't care, and this has been like this over the years, unfortunately. For example: 9 days ago I asked in the ACEMD 4 thread when new ACEMD 4 task will be around, or whether this subproject is dead. No reply so far; whereas a reply could be very simple, not longer than just a line :-( You know what I want to say ... it's kind of disappointing at times :-( ID: 59972 · Rating: 0 · rate: / Reply Quote