Message boards :
Number crunching :
ATM: Free Energy Calculations new application
Message board moderation
| Author | Message |
|---|---|
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level ![]() Scientific publications
|
Just starting the thread for discussion of this new application. ATM = AToM. after a little snafu with the first batch (incorrect config files), the latest batch seems to run on my system. no idea for runtime yet or if it will finish successfully. This is another Python-based application. the package ships with the python environment similar to how the PythonGPU Reinforcement Learning (RL) app does. Test Bench: Xeon E5-2697Av4 (16c/32t) 64GB DDR4-2400 RDIMM (ECC) RTX 3060 12GB Ubuntu 22.04.1 So far observed behavior: -uses ~97% of the GPU core, ~45% GPU memory bus, ~0-1% PCIe bus, close to full power use. -about 400-500MB VRAM used (low, like acemd3) -does not like to be paused and resumed, or BOINC stopped and restarted. it causes the task to fail unknown total runtime expectation since the one task I had failed when I restarted BOINC lol.
|
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level ![]() Scientific publications
|
about the restart failure. looks like it fails trying to create a directory that already exists. mkdir: cannot create directory 'atm_tmp': File exists needs some work to allow for that.
|
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level ![]() Scientific publications
|
another quality of life improvement should be adding a <weight> line to the main task in the job.xml file. right now with 2 tasks in the file, and no weights defined, I'm guessing it splits it 50/50 and it thinks the task is 50% done once the extraction phase is complete.
|
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level ![]() Scientific publications
|
task ran to completion in about an hour. but hit an error and threw it all away because the file size is too big. upload failure: <file_xfer_error> <file_name>T11_4-RAIMIS_TEST_ATM-0-1-RND7054_2_0</file_name> <error_code>-131 (file size too big)</error_code> </file_xfer_error> what a waste.
|
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Have over a dozen of quick-failing ATM tasks. The wrapper does not have a correctly name tar file or something. 02:56:29 (1242346): wrapper: running /bin/tar (xf input.tar.bz2) /bin/tar: This does not look like a tar archive bzip2: (stdin) is not a bzip2 file. /bin/tar: Child returned status 2 /bin/tar: Error is not recoverable: exiting now |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level ![]() Scientific publications
|
looks like the small batch of tasks that went out today are better setup. ran for about an hour and completed successfully without the file size issue when complete. great :) still would like a little more background info on these tasks, what they are doing, and the goal of the research.
|
|
Send message Joined: 7 Apr 15 Posts: 17 Credit: 2,978,057,945 RAC: 73 Level ![]() Scientific publications
|
This one https://www.gpugrid.net/workunit.php?wuid=27399736 is runnig for about 11 hours and it is stuck at 66,666% for at least 4 hours now. There is almost no load on the GPU. Just a few percent (3-5) once in a while, but constantly some load on the memory controller (10-30). Hope it will finish some day :) |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level ![]() Scientific publications
|
still no official communication from the project about these tasks. the recent batches have been very hit or miss and exhibit much different behavior than my initial post. "TL2" tasks, ran for hours and hours with little to no GPU or CPU use. I aborted them and moved on. "TL3" tasks yesterday, also had little to no GPU or CPU use, but did complete in about 30 mins. "TL4" tasks today seem like a repeat of TL2. no GPU use, runs for hours with no progress. also weights need to be defined in the jobs.xml file so the tasks don't jump to 75% after a few seconds and then sit there for hours doing nothing.
|
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 428 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Just been sent a TL4 from WU 27405970. I see you've aborted two previous tasks from the same WU, Ian, on two different machines. Did you get any CPU usage figures from previous runs? I think I'll start it up with the GTX 1660 plus one core, but I'll probably abort it myself if it doesn't show much response. |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level ![]() Scientific publications
|
they spin up multiple processes like the Python tasks do. but i didnt catch them at the very beginning to see if they spike in use or anything like that. once they get going, they basically sit idle as far as the GPU and CPU go. little to no use at all. i just killed them rather than letting them sit there for hours occupying my GPU.
|
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 428 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
OK, I've set 3 CPUs for continuity from the current Python task, and I've put weights of 1-1-1-97 in the job file so I can see what's happening. My normal remote monitoring console shows the current average CPU usage, and I've put nvidia-smi on a five second loop. If either of those drops to zero, I'll abort it. Chocks away! |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 428 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I see what you mean. Nearly half an hour in, CPU usage is showing around 25% of a single core, and GPU usage spiked once, to 41%, after about a quarter of an hour. It's one way of saving electricity, but I'd rather be doing something useful. Aborting. |
|
Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,408,899,587 RAC: 0 Level ![]() Scientific publications ![]() ![]()
|
1.13 ATM running fine for me. Keep aborting them and I'll run them for you. |
|
Send message Joined: 16 Jul 07 Posts: 209 Credit: 5,496,860,456 RAC: 12,111 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
FWIW, the first task I received completed successfully. http://www.gpugrid.net/workunit.php?wuid=27410175 Reno, NV Team: SETI.USA |
|
Send message Joined: 7 Apr 15 Posts: 17 Credit: 2,978,057,945 RAC: 73 Level ![]() Scientific publications
|
I've also finished one: https://www.gpugrid.net/workunit.php?wuid=27410166 We're both using Linux Mint. It seems to crash on Win 10 machines (computer #600532 is mine, too). |
|
Send message Joined: 16 Jul 07 Posts: 209 Credit: 5,496,860,456 RAC: 12,111 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
|
|
Send message Joined: 27 Jul 11 Posts: 138 Credit: 539,953,398 RAC: 0 Level ![]() Scientific publications ![]()
|
1.13 ATM running fine for me. _______________ Same here. I quite enjoy completing these WUs. There should be a way to analyse these WUs as to why it is happening on certain machines. We are mostly running the same hardware and OS. It would be fun to see the results. - |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level ![]() Scientific publications
|
keep in mind these are bata tasks, and the batch being sent NOW are not necessarily the same as the batch sent last week and wont be the same as whatever is sent sometime in the future, until they get all the bugs worked out. the tasks last week basically ran with no perceived use of the GPU or CPU, so what were they doing? who knows. no official word from the project about these tasks at all. I wasn't willing to let the GPU/CPU be occupied for hours on end with the task spinning it's wheels when they could be doing something more useful. maybe this current batch has been tweaked from last week and thats why they are working OK, for those that have completed this latest batch, did they have any meaningful use of the GPU or CPU? it also seems this batch was released with a new Windows application (they were Linux only before) for testing.
|
|
Send message Joined: 27 Jul 11 Posts: 138 Credit: 539,953,398 RAC: 0 Level ![]() Scientific publications ![]()
|
keep in mind these are bata tasks, and the batch being sent NOW are not necessarily the same as the batch sent last week and wont be the same as whatever is sent sometime in the future, until they get all the bugs worked out. the tasks last week basically ran with no perceived use of the GPU or CPU, so what were they doing? who knows. no official word from the project about these tasks at all. I wasn't willing to let the GPU/CPU be occupied for hours on end with the task spinning it's wheels when they could be doing something more useful. _______________________ Well, most of us know that Abouh reads every word written on these threads and without much song and dance, makes changes. He is the Only Admin on all the projects who diligently attend. Maybe, quite possibly. No arguments with your tweaking statement. |
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Well, most of us know that Abouh reads every word written on these threads and without much song and dance, makes changes. He is the Only Admin on all the projects who diligently attend. Maybe, quite possibly. No arguments with your tweaking statement. well, Abouh is the only one from the project team who actively communicates with us volunteers - which is great. All others obviously don't care, and this has been like this over the years, unfortunately. For example: 9 days ago I asked in the ACEMD 4 thread when new ACEMD 4 task will be around, or whether this subproject is dead. No reply so far; whereas a reply could be very simple, not longer than just a line :-( You know what I want to say ... it's kind of disappointing at times :-( |
©2025 Universitat Pompeu Fabra