New version of ACEMD 2.17 on multi GPU hosts

Author	Message
ServicEnginIC Send message Joined: 24 Sep 10 Posts: 595 Credit: 12,249,686,510 RAC: 467,824 Level Scientific publications	Message 57272 - Posted: 5 Sep 2021, 14:41:13 UTC Last modified: 5 Sep 2021, 14:59:08 UTC A New version of ACEMD 2.17 was introduced on September 2nd 2021. This was announced by GDF at message #57257 Current and previous program versions can be consulted at Gpugrid apps page. This new version presents a peculiar behavior on multi GPU hosts, not seen on previous app versions, and first mentioned by Ian&Steve C. at message #57261. For a deeper study, I happened to have two tasks of this new app version 2.17 simultaneously running at this twin GTX 1650 GPU system. Device #0 at this host is an ASUS ROG-STRIX-GTX1650-O4G-GAMING graphics card. I took screenshot for Boinc Manager readings. As seen at this image, Boinc Manager "thinks" that the first task is running at device #0, and the second task in device #1. However, looking at Psensor monitoring utility, while device #0 is running at 100% activity, memory usage 77% and PCIe usage 61%, device #1 is really cold and inactive. This lead me to think that both tasks were running in fact at the same device #0, being its resources splitted between these two tasks. And it was confirmed when results came out: Both task #27078190 and task #27078191 are shown to have run at device #0. Then I happened to catch one more only task for the same system, that was executed in exclusive. Result for this task #27078213 can be seen here. While tasks #27078190 and #27078191 that were run concurrently took 32451 and 32473 seconds respectively, task #27078213 that run in exclusive, took 15328 seconds. That is, well less than half the execution times of the tasks executed concurrently. For additional confirmation. At other of my hosts, this triple GTX 1650 GPU system, were simultaneously executed these three tasks. Device #0 at this system is the same ASUS ROG-STRIX-GTX1650-O4G-GAMING graphics card than in previous mentioned twin GPU system. While executing, Boinc Manager was showing that they were running at devices #0, #1 and #2. But, as can be seen at this Psensor image, only device #0 is performing at full rate, while devices #1 and #2 are actually idle. After finishing, execution times for these three tasks were 46512, 46677 and 46719 seconds. If we take 15328 seconds as the run time for a single task executed in exclusive, above times are about 3x longer than this time. And the other two GPUs #1 and #2 have stayed unused in the meantime. For both Gpugrid and other GPU projects. In massive multi GPU hosts, like may be this impressive Ian&Steve C. 8x GPU one, the potential performance drop is multiplied by N. And maybe even Device #0 resources can result overflowed when attempting to simultaneously execute such a number of tasks... ID: 57272 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 57273 - Posted: 5 Sep 2021, 15:12:40 UTC Last modified: 5 Sep 2021, 15:15:37 UTC BOINC has had two different ways of signalling to a science app which device number it should utilise. OLD (deprecated) - pass the device number on the command line NEW (current) - pass the device number in an XML file init_data.xml in the slot directory. The situation is complicated here by the use of the wrapper app in the calling chain. Your screen shot (and every task I've looked at so far) is showing that the wrapper is calling the acemd3 application, and it looks like it's using the command line convention. That shouldn't be happening. I've still failed to catch any of these new tasks. I need to maintain my Linux hosts soon, so I'll look for a newer driver while I'm at it. But if anyone catches a live task, and can examine init_data.xml, we might get some extra clues. Edit - look for lines like <gpu_type>NVIDIA</gpu_type> <gpu_device_num>1</gpu_device_num> <gpu_opencl_dev_index>1</gpu_opencl_dev_index> around 40 lines into the file. ID: 57273 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 3,309 Level Scientific publications	Message 57274 - Posted: 5 Sep 2021, 17:27:49 UTC The fact that the BOINC “thinks” it’s running on the wrong device exposes two issues. 1. BOINC assigned it to the device it intended, but that was not properly communicated to the application. Maybe a disconnect through the wrapper process, or possibly it’s hard coded for device 0, as an artifact from dev testing. 2. This also shows that BOINC doesn’t do any verification of what device a process is ACTUALLY running on. It just trusts that the message was received. If it did verification it would show it running on the wrong device. It goes further than simply checking psensor for device activity. If you run nvidia-smi, you can see that the acemd is actually assigned to and running on device0, leaving the idle device with no process running. Since I only had one GPUGRID task running, and GPUGRID has a 100 resource share, with Einstein at 0 RS, when I rebooted BOINC, the GPUGRID process restarted on 0 and BOINC showed/assigned at 0 as well. This was likely a “when stars align” scenario. If I had more GPUGRID tasks, I would have undoubtedly continued to experience the issue. Regarding the method of passing the arguments, it should depend on what server version is being used (Einstein with its old server version has no problem with this method). it does appear to be some kind of command line argument, but at which step of the process (pre or post wrapper) is referred to by the stderr file is not clear. The command is being sent as --boinc input --device 0. This is the exact same command structure as used with no issue on previous apps/tasks. ID: 57274 · Rating: 0 · rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 595 Credit: 12,249,686,510 RAC: 467,824 Level Scientific publications	Message 57275 - Posted: 5 Sep 2021, 20:37:52 UTC Thank you both for sharing your knowledge once more. I'm prioritizing my twin GPU host for requesting new tasks (although at the cost of penalizing my other hosts chances). I'll consider your tips if getting some (currently scarce) Gpugrid task. Currently this system is running two Genefer 21 Primegrid tasks, also very GPU power demanding. Device #0 task properties Device #1 task properties I found their assignments as described by Richard Haselgrove in their init_data.xml files at directories /var/lib/boinc-client/slots/4 and /var/lib/boinc-client/slots/5 respectively. At this situation, nvidia-smi command returns as follows: ID: 57275 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 57276 - Posted: 6 Sep 2021, 7:46:17 UTC - in response to Message 57274. Last modified: 6 Sep 2021, 8:34:44 UTC Regarding the method of passing the arguments, it should depend on what server version is being used (Einstein with its old server version has no problem with this method). it does appear to be some kind of command line argument, but at which step of the process (pre or post wrapper) is referred to by the stderr file is not clear. The command is being sent as --boinc input --device 0. This is the exact same command structure as used with no issue on previous apps/tasks. We can discount the server theory - the snippet I posted from init_data.xml came from an Einstein task. What it actually depends on is the version of the BOINC API library linked in to the science application at compile time. They used API_VERSION_7.15.0 for the autodock v7.17 app: I forget exactly when the transition took place, but it was much, much, longer ago than that. The mechanism is very crude: search the application with a hex editor for api_version, and the numeric follows that string - the number in the previous paragraph is a direct paste using that method. Edit - the API_Version should also appear in the <app_version> tag in client_state.xml. If it's missing, the client is likely to revert to using the old, deprecated, calling syntax. ID: 57276 · Rating: 0 · rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 595 Credit: 12,249,686,510 RAC: 467,824 Level Scientific publications	Message 57280 - Posted: 7 Sep 2021, 19:25:36 UTC Today when returning home I've got a surprise waiting: Two new version of ACEMD 2.17 long tasks are running concurrently at mentioned twin GPU host. First of all, I've ceased requesting new tasks from all my hosts, to increase the chance for any newly generated tasks to be catched by other users. Then, I've checked that the mentioned behavior at the beginning of this thread is reproducing with these two tasks also. Task #32640189 Task #32640190 Boinc Manager reading Device #0 task properties (as shown by Boinc Manager) Device #1 task properties (as shown by Boinc Manager) init_data.xml slots/5 (Device #1 task) <gpu_type>NVIDIA</gpu_type> <gpu_device_num>1</gpu_device_num> <gpu_opencl_dev_index>1</gpu_opencl_dev_index> <gpu_usage>1.000000</gpu_usage> <ncpus>0.490000</ncpus> init_data.xml slots/6 (Device #0 task) <gpu_type>NVIDIA</gpu_type> <gpu_device_num>0</gpu_device_num> <gpu_opencl_dev_index>0</gpu_opencl_dev_index> <gpu_usage>1.000000</gpu_usage> <ncpus>0.490000</ncpus> At this situation, nvidia-smi command returns as follows: That is, although assignments at init_data.xml files look as expected, and Boinc Manager shows that both devices #0 and #1 are running their own tasks, really both tasks are running at the same device #0, and device #1 is idle. I've copied at a safe location for reference the following three whole folders: /var/lib/boinc-client/slots/5 /var/lib/boinc-client/slots/6 /var/lib/boinc-client/projects/www.gpugrid.net I can examine them in search for any other clue, but I need further advice for this... ID: 57280 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 57281 - Posted: 7 Sep 2021, 21:21:39 UTC - in response to Message 57280. I can think of three things I'd really like to see: 1) In the slot directories: is there an init_data.xml file, and does the content match the <device_number> reported by BOINC for that slot? 2) What command line was used to launch the ACEMD3 process? I'd use process explorer to see that in Windows. 3) Does the ACEMD3 v2.17 binary have an API_VERSION text string embedded within it? ID: 57281 · Rating: 0 · rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 595 Credit: 12,249,686,510 RAC: 467,824 Level Scientific publications	Message 57284 - Posted: 8 Sep 2021, 20:09:22 UTC - in response to Message 57281. 1) In the slot directories: is there an init_data.xml file, and does the content match the <device_number> reported by BOINC for that slot? Yes, they did exist, and they matched the devices reported by BOINC Manager. But task reported by BOINC Manager as running at device 1 difers the physical device where it actually ran as reported by wrapper (and nvidia-smi command). I'll upload images for future reference, given that tasks will vanish from Gpugrid database in short. - Slots/6: Task e1s6_I6-ADRIA_test_acemd3_newapp_KIXCMYB-0-2-RND9347_2. Reported by BOINC Manager as running at device 0, the same where it actually ran. - Slots/5: Task e1s1_I2-ADRIA_test_acemd3_newapp_KIXCMYB-1-2-RND8042_0. Reported by BOINC Manager as running at device 1, but it ran actually at device 0. As shown by nvidia-smi command, device 1 was actually idle (P8 state, 1% overall activity), while the two mentioned tasks were running concurrently at device 0. I (barely) rescued the last previous version 2.12 task that ran at a device other than 0 on any of my hosts. v2.12 Task 1_3-CRYPTICSCOUT_pocket_discovery_6cacc905_1fa2_4ed0_98e6_1139c20e13df-0-2-RND1223_2. It ran at device 2 at this triple GTX 1650 GPU host. v2.17 Task e1s5_I6-ADRIA_test_acemd3_newapp_KIXCMYB-0-2-RND2315_0. It ran at device 0 on the same host. There are subtle differences between them: - Wrapper version is different: Previous v2.12 wrapper version was 7.7.26016. Current v2.17 wrapper version is 7.5.25014 (older?) - At previous v2.12 version, wrapper was running directly the acemd3 process. At current v2.17 version, a preliminary decompressing stage seems to be executed, then bin/acemd3 process is run. - v2.12 wrapper device assignment dialog: (--boinc input --device 2); v2.17 wrapper device assignment dialog: (--boinc --device 0) - Any other difference that I can't see? ID: 57284 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 57289 - Posted: 9 Sep 2021, 10:16:04 UTC Thanks. I finally got my own task(s), so I can answer my own other questions. 2) I used htop to view the actual command lines. There are a huge number of acemd3 threads - at least 18 (??!!). but the command line is uniform: bin/acemd3 --boinc --device 0 (which was right for this particular task - BOINC had assigned it to device 0) 3) There's no API_VERSION string in the acemd3 binary, so it doesn't have the ability to read the init_data.xml file natively. But it does have strings for the parameters --boinc --device '--boinc' doesn't appear in the Wrapper documentation, but --device does. It appears to be correctly specified in the job.xml file as $GPU_DEVICE_NUM, per documentation. My task was 'Created 9 Sep 2021 \| 6:49:00 UTC', and correctly specifies wrapper version 26014 - the latest available precompiled from BOINC. 25014 may have been a typo, since corrected. 26016 was pre-compiled for Windows only. Now, I just have to wait for a new task, and force it to run on device 1 to see what happens then. [afterthought: the precompiled wrapper 26014 is very old - April 2015. I do hope they're recompiling from source] ID: 57289 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 57290 - Posted: 9 Sep 2021, 11:31:58 UTC - in response to Message 57284. - At previous v2.12 version, wrapper was running directly the acemd3 process. At current v2.17 version, a preliminary decompressing stage seems to be executed, then bin/acemd3 process is run. The decompressing stage produces two huge folders in the slot directory: bin lib The conda-pack.tar archive alone is 1 GB, and decompressing it, for each task, expands it to 1.8 GB. Watch out for your storage quotas and SSD write limits! But at least the libbost 1.74 filesystem library is in that lib folder, so hopefully that'll be an end to that class of error. ID: 57290 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 3,309 Level Scientific publications	Message 57291 - Posted: 9 Sep 2021, 14:56:13 UTC - in response to Message 57289. there seems to be a difference in command used for the device assignment from the past (I missed this before) old tasks/apps used this: --boinc input --device n the 2.17 app uses this: --boinc --device 0 feels like it's hard coded to 0 and not using what BOINC assigns. ID: 57291 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 57292 - Posted: 9 Sep 2021, 17:00:47 UTC - in response to Message 57291. The command line for acemd3 is set by the wrapper. I can't see a meaningful use for the word 'input' on that line. Keywords are set with '--' leading, values without. So the old syntax might have been --boinc [filename] --device [integer] It feels odd, but it's possible - there is a file in the workunit bundle which is given the logical name 'input' at runtime. In which case, why is it missing from the 217 app template, and what is the effect likely to be? (noting that the 217 apps are actually running with the input files supplied) Just possibly, the sequence might be Detected keyword --boinc Expecting filename No filename found Abandon parsing of command line Using default filename 'input' Using default device '0' Proceeding with task That might show up if I can make a task run standalone in a terminal window - though that'll be mightily fiddly with all these weird filenames around. ID: 57292 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 3,309 Level Scientific publications	Message 57293 - Posted: 9 Sep 2021, 18:10:18 UTC - in response to Message 57292. just noting the obvious difference between what worked in the past vs what doesn't work now. it worked before, but not now. what's different? the lack of "input" in the command. ID: 57293 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 57294 - Posted: 9 Sep 2021, 18:15:56 UTC - in response to Message 57293. And I'm trying to think like a computer, and trying to work out why the observed difference might cause the observed effect. ID: 57294 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 3,309 Level Scientific publications	Message 57295 - Posted: 9 Sep 2021, 18:19:44 UTC - in response to Message 57294. based on the assumption that the string "input" relates to a filename. rather could be a preprogrammed specific command, telling the app/wrapper to perform some function. without full knowledge of the code, it's just speculation at this point. ID: 57295 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 3,309 Level Scientific publications	Message 57296 - Posted: 9 Sep 2021, 18:45:07 UTC API_VERSION does appear in the wrapper binary too. you can see this by converting "API_VERSION" to a hex string and searching the binary with hexedit. which makes sense since I was under the impression that the whole point of the wrapper was to be a middle man between BOINC and the science app, allowing some greater flexibility in how the project packages and delivers their apps. older 2.12 apps have wrapper API_VERSION_7.7.0, the new 2.17 app has wrapper API_VERSION_7.5.0 ID: 57296 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 57297 - Posted: 9 Sep 2021, 21:04:47 UTC - in response to Message 57296. Yes, I'd expect that. The whole point of the wrapper is to be 'boinc-aware': it will need to read init_data.xml, and export the readings (like device number) into an external format like the one that the new version of acemd3 can act on. Or not, as the case may be. ID: 57297 · Rating: 0 · rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 595 Credit: 12,249,686,510 RAC: 467,824 Level Scientific publications	Message 57301 - Posted: 15 Sep 2021, 21:27:49 UTC Finally, this problem has been corrected by the deployment of New Version of ACEMD 2.18 on Sep 14 2021 \| 11:44:39 UTC The wrapper version packed with this new app is the same that already was with previous New Version of ACEMD 2.12: 7.7.26016 It seems that this has been the solution. To Gpugrid Project Developers, once more: Well done!!! ID: 57301 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 14,238 Level Scientific publications	Message 57302 - Posted: 16 Sep 2021, 1:21:23 UTC - in response to Message 57301. Have you actually received the new app with some new work? No luck here for the past few days. ID: 57302 · Rating: 0 · rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 595 Credit: 12,249,686,510 RAC: 467,824 Level Scientific publications	Message 57303 - Posted: 16 Sep 2021, 5:47:21 UTC - in response to Message 57302. Last modified: 16 Sep 2021, 5:48:14 UTC Have you actually received the new app with some new work? Yes, I did. e1s10_I18-ADRIA_test_acemd3_devicetest_KIXCMYB-0-2-RND3507_1 (Link to Gpugrid webpage) e1s10_I18-ADRIA_test_acemd3_devicetest_KIXCMYB-0-2-RND3507_1 (Image) This task was actually processed at device 1 on my twin GTX 1650 GPU host. I had previously reset Gpugrid project in BOINC Manager at that system. ID: 57303 · Rating: 0 · rate: / Reply Quote