More Acemd3 tests

Author	Message
ServicEnginIC Send message Joined: 24 Sep 10 Posts: 592 Credit: 11,972,186,510 RAC: 1,447 Level Scientific publications	Message 52679 - Posted: 20 Sep 2019, 15:30:36 UTC - in response to Message 52674. There is a loss in performance of %16 due to x1 but on the other hand, Windows with 1070Ti and a full x16 is slightly slower than the 1660Ti hanging on a 1x riser on Ubuntu! Both of my systems have swan_sync enabled and both run CUDA 10.0 Not sure about the other user. As seen in table from following link, GTX1660TI, SWAN_SYNC enabled, demands 33% of PCIE X16 bandwidth in my system. https://www.gpugrid.net/forum_thread.php?id=4987&nowrap=true#52633 ID: 52679 · Rating: 0 · rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,408,899,587 RAC: 0 Level Scientific publications	Message 52680 - Posted: 20 Sep 2019, 18:34:19 UTC - in response to Message 52678. Sorry, my bad. I'm talking about 1080 Ti's and you're running 2080 Ti's. No, acemd does not work for Turing GPUs. I get confused as my single 2080 Ti is on a Linux computer. ID: 52680 · Rating: 0 · rate: / Reply Quote

rod4x4 Send message Joined: 4 Aug 14 Posts: 266 Credit: 2,219,935,054 RAC: 0 Level Scientific publications	Message 52684 - Posted: 21 Sep 2019, 1:14:58 UTC Received a TEST work unit a43-TONI_TESTDHFR207c-23-30-RND4156_0 on a Win10 Host with GTX1060 GPU. Applied the following test: Let work unit run for 11 minutes 13 seconds suspended for 1 minute 20 seconds (approx) Resumed work unit. Results: Work unit had computational error several seconds after resuming Observations: Work unit predicted a run time of 36 minutes. This is an improvement on Work unit a89-TONI_TESTDHFR206b-23-30-RND6008_0 , which had a run time of 66 minutes. Speed issues seems to be improved. ACEMD3 task and Wrapper task disappeared from Task Manager after suspending task. After resumption / failure, the run time reverted to 2 minutes 12 seconds. STDerr Output time line reflects the full run time of 11 minutes, but Run Time summary only reflects 2 minutes 12 seconds. nvidia-smi reported 78% GPU utilization which is inline with CUDA80 tasks on this host. nvidia-smi reported similar Power usage as CUDA80 tasks on this host. Link to Work unit here: http://gpugrid.net/result.php?resultid=21396885 ID: 52684 · Rating: 0 · rate: / Reply Quote

rod4x4 Send message Joined: 4 Aug 14 Posts: 266 Credit: 2,219,935,054 RAC: 0 Level Scientific publications	Message 52686 - Posted: 21 Sep 2019, 2:07:53 UTC Received another TEST work unit a6-TONI_TESTDHFR207-2-3-RND1704 Same testing method as last post, this time allowed work unit to run 40 minutes 37 seconds before suspending. (54% complete) Task failed after resuming 1 minute later. The run time may not have improved as indicated in last post. After 40 minutes 37 seconds task was 54% completed. So Windows 10 tasks still seem to have a speed issue compared to Linus ACEMD3 tasks. Additional I did notice the ACEMD3 task and Wrapper task did reappear in Task Manager for a few seconds before the task failed. All other observations consistent with last post. Failed task here: http://gpugrid.net/result.php?resultid=21397083 ID: 52686 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level Scientific publications	Message 52687 - Posted: 21 Sep 2019, 5:09:34 UTC - in response to Message 52686. Failed task here: http://gpugrid.net/result.php?resultid=21397083 what caught my eye: in line 8 of the stderr it says "Detected memory leaks!" - whatever this means. ID: 52687 · Rating: 0 · rate: / Reply Quote

rod4x4 Send message Joined: 4 Aug 14 Posts: 266 Credit: 2,219,935,054 RAC: 0 Level Scientific publications	Message 52689 - Posted: 21 Sep 2019, 10:24:03 UTC - in response to Message 52687. what caught my eye: in line 8 of the stderr it says "Detected memory leaks!" - whatever this means. It is a programming error indicating memory is not allocated or de-allocated correctly. This is the suspend/resume bug they are looking to fix. ID: 52689 · Rating: 0 · rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 592 Credit: 11,972,186,510 RAC: 1,447 Level Scientific publications	Message 52695 - Posted: 22 Sep 2019, 10:00:57 UTC Last modified: 22 Sep 2019, 10:02:22 UTC My following W10 computer, GTX1050TI graphics card: https://www.gpugrid.net/show_host_detail.php?hostid=105442 Got an ACEMD3 V2.07 test WU: https://www.gpugrid.net/result.php?resultid=21400000 It was processed till the end, no pauses, and then errored out with indication "195 (0xc3) EXIT_CHILD_FAILED". The same WU but V2.06 was processed successfully by a second computer: WU: https://www.gpugrid.net/result.php?resultid=21400083 Computer: https://www.gpugrid.net/show_host_detail.php?hostid=459450 Something is still to be polished in V2.07 application code and/or scheduler... I also miss previously available information in old ACEMD WUs about graphics card model, clocks, temperatures reached... Perhaps it is not possible in new WUs due to Wrapper's philosophy (?) ID: 52695 · Rating: 0 · rate: / Reply Quote

Billy Ewell 1931 Send message Joined: 22 Oct 10 Posts: 42 Credit: 1,752,050,315 RAC: 57 Level Scientific publications	Message 52704 - Posted: 23 Sep 2019, 19:49:16 UTC e1s20_ubiquitin_50ns_3-ADRIA_FOLDUBQ_BANDIT_crystal_ss_contacts_50_ubiquitin_2-0-2-RND1315. This task had downloaded and failed without my immediate knowledge as I was doing some routine computer updating/restarting. Failure occurred at 52.31 seconds which is of course irrelevant. Machine: I7 Windows10 RTX2080. Do I understand that Toni still wants these failure reports? ID: 52704 · Rating: 0 · rate: / Reply Quote

JStateson Send message Joined: 31 Oct 08 Posts: 186 Credit: 3,578,903,157 RAC: 0 Level Scientific publications	Message 52709 - Posted: 23 Sep 2019, 23:58:38 UTC I lost a pair of those "new" tasks http://www.gpugrid.org/results.php?hostid=467730&offset=0&show_names=0&state=5&appid= I had to reboot to install to fix a problem with an app. I did stop the Boinc client before rebooting but the two "new" tasks did not survive the reboot. Looking at errors, my only other errors on this system were almost a year ago. ID: 52709 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level Scientific publications	Message 52718 - Posted: 25 Sep 2019, 5:29:55 UTC Just had my first failure on a restarted CUDA100 task that obeyed the set 60 minute run per project setting. Restarted on a different device and failed. Stderr output <core_client_version>7.16.2</core_client_version> <![CDATA[ <message> process exited with code 195 (0xc3, -61)</message> <stderr_txt> 18:14:25 (25157): wrapper (7.7.26016): starting 18:14:25 (25157): wrapper (7.7.26016): starting 18:14:25 (25157): wrapper: running acemd3 (--boinc input --device 0) 21:09:51 (23108): wrapper (7.7.26016): starting 21:09:51 (23108): wrapper (7.7.26016): starting 21:09:51 (23108): wrapper: running acemd3 (--boinc input --device 1) ERROR: /home/user/conda/conda-bld/acemd3_1566914012210/work/src/mdsim/context.cpp line 324: Cannot use a restart file on a different device! 21:09:55 (23108): acemd3 exited; CPU time 3.826201 21:09:55 (23108): app exit status: 0x9e 21:09:55 (23108): called boinc_finish(195) </stderr_txt> ]]> So this has a failure in common with the Windows wrapper app. https://www.gpugrid.net/result.php?resultid=21408545 ID: 52718 · Rating: 0 · rate: / Reply Quote

Toby Broom Send message Joined: 11 Dec 08 Posts: 26 Credit: 648,944,294 RAC: 584 Level Scientific publications	Message 52719 - Posted: 25 Sep 2019, 6:20:35 UTC I see the same as others on suspend. http://www.gpugrid.net/result.php?resultid=21408995 These WU's don't stress my Titan V much the load is about 75% and power 100W. My 10xx GPU's are another 10% load and double power draw. ID: 52719 · Rating: 0 · rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 52720 - Posted: 25 Sep 2019, 11:51:49 UTC Last modified: 25 Sep 2019, 11:53:35 UTC I noticed that the new ACEMD3 Windows app v2.06 does not update the boinc_task_state.xml file in the slot directory. It maybe related to the checkpoint + "resuming does not work" issue. BTW I don't know why this host received the v2.06, instead of the more recent v2.07. (NVIDIA GeForce GTX 1080 Ti (4095MB) driver: 436.15) ID: 52720 · Rating: 0 · rate: / Reply Quote

KAMasud Send message Joined: 27 Jul 11 Posts: 138 Credit: 539,953,398 RAC: 0 Level Scientific publications	Message 52721 - Posted: 25 Sep 2019, 17:49:47 UTC I restarted my machine and the WU crashed. e1s29_ubiquitin_50ns_5-ADRIA_FOLDUBQ_BANDIT_crystal_ss_contacts_50_ubiquitin_1-1-2-RND8951_2 Stderr output <core_client_version>7.14.2</core_client_version> <![CDATA[ <message> (unknown error) - exit code 195 (0xc3)</message> <stderr_txt> 19:49:47 (10992): wrapper (7.9.26016): starting 19:49:47 (10992): wrapper: running acemd3.exe (--boinc input --device 0) Detected memory leaks! Dumping objects -> ..\api\boinc_api.cpp(309) : {1755} normal block at 0x000001D0701589B0, 8 bytes long. Data: < p > 00 00 11 70 D0 01 00 00 ..\lib\diagnostics_win.cpp(417) : {202} normal block at 0x000001D07015C010, 1080 bytes long. Data: <@ > 40 02 00 00 CD CD CD CD E0 01 00 00 00 00 00 00 Object dump complete. 22:43:23 (11696): wrapper (7.9.26016): starting 22:43:23 (11696): wrapper: running acemd3.exe (--boinc input --device 0) # Engine failed: The periodic box size has decreased to less than twice the nonbonded cutoff. 22:43:27 (11696): acemd3.exe exited; CPU time 0.000000 22:43:27 (11696): app exit status: 0x1 22:43:27 (11696): called boinc_finish(195) 0 bytes in 0 Free Blocks. 506 bytes in 8 Normal Blocks. 1144 bytes in 1 CRT Blocks. 0 bytes in 0 Ignore Blocks. 0 bytes in 0 Client Blocks. Largest number used: 0 bytes. Total allocations: 141102 bytes. Dumping objects -> {1814} normal block at 0x0000016B06965620, 48 bytes long. Data: <ACEMD_PLUGIN_DIR> 41 43 45 4D 44 5F 50 4C 55 47 49 4E 5F 44 49 52 {1803} normal block at 0x0000016B069657E0, 48 bytes long. Data: <HOME=D:\ProgramD> 48 4F 4D 45 3D 44 3A 5C 50 72 6F 67 72 61 6D 44 {1792} normal block at 0x0000016B06965BD0, 48 bytes long. Data: <TMP=D:\ProgramDa> 54 4D 50 3D 44 3A 5C 50 72 6F 67 72 61 6D 44 61 {1781} normal block at 0x0000016B06965690, 48 bytes long. Data: <TEMP=D:\ProgramD> 54 45 4D 50 3D 44 3A 5C 50 72 6F 67 72 61 6D 44 {1770} normal block at 0x0000016B06965150, 48 bytes long. Data: <TMPDIR=D:\Progra> 54 4D 50 44 49 52 3D 44 3A 5C 50 72 6F 67 72 61 {1759} normal block at 0x0000016B069460C0, 141 bytes long. Data: <<project_prefere> 3C 70 72 6F 6A 65 63 74 5F 70 72 65 66 65 72 65 ..\api\boinc_api.cpp(309) : {1756} normal block at 0x0000016B06966A10, 8 bytes long. Data: < k > 00 00 93 06 6B 01 00 00 {981} normal block at 0x0000016B06944D40, 141 bytes long. Data: <<project_prefere> 3C 70 72 6F 6A 65 63 74 5F 70 72 65 66 65 72 65 {204} normal block at 0x0000016B069661F0, 8 bytes long. Data: < k > 10 90 96 06 6B 01 00 00 {197} normal block at 0x0000016B06965460, 48 bytes long. Data: <--boinc input --> 2D 2D 62 6F 69 6E 63 20 69 6E 70 75 74 20 2D 2D {196} normal block at 0x0000016B06966BA0, 16 bytes long. Data: < k > A8 C4 94 06 6B 01 00 00 00 00 00 00 00 00 00 00 {195} normal block at 0x0000016B069662E0, 16 bytes long. Data: < k > 80 C4 94 06 6B 01 00 00 00 00 00 00 00 00 00 00 {194} normal block at 0x0000016B06966BF0, 16 bytes long. Data: <X k > 58 C4 94 06 6B 01 00 00 00 00 00 00 00 00 00 00 {193} normal block at 0x0000016B06966010, 16 bytes long. Data: <0 k > 30 C4 94 06 6B 01 00 00 00 00 00 00 00 00 00 00 {192} normal block at 0x0000016B06966290, 16 bytes long. Data: < k > 08 C4 94 06 6B 01 00 00 00 00 00 00 00 00 00 00 {191} normal block at 0x0000016B06966F60, 16 bytes long. Data: < k > E0 C3 94 06 6B 01 00 00 00 00 00 00 00 00 00 00 {190} normal block at 0x0000016B06965540, 48 bytes long. Data: <ComSpec=C:\Windo> 43 6F 6D 53 70 65 63 3D 43 3A 5C 57 69 6E 64 6F {189} normal block at 0x0000016B06966150, 16 bytes long. Data: <@O k > 40 4F 96 06 6B 01 00 00 00 00 00 00 00 00 00 00 {188} normal block at 0x0000016B0695A5D0, 32 bytes long. Data: <SystemRoot=C:\Wi> 53 79 73 74 65 6D 52 6F 6F 74 3D 43 3A 5C 57 69 {187} normal block at 0x0000016B06966420, 16 bytes long. Data: < O k > 18 4F 96 06 6B 01 00 00 00 00 00 00 00 00 00 00 {185} normal block at 0x0000016B069666A0, 16 bytes long. Data: < N k > F0 4E 96 06 6B 01 00 00 00 00 00 00 00 00 00 00 {184} normal block at 0x0000016B06966650, 16 bytes long. Data: < N k > C8 4E 96 06 6B 01 00 00 00 00 00 00 00 00 00 00 {183} normal block at 0x0000016B06966C40, 16 bytes long. Data: < N k > A0 4E 96 06 6B 01 00 00 00 00 00 00 00 00 00 00 {182} normal block at 0x0000016B06966DD0, 16 bytes long. Data: <xN k > 78 4E 96 06 6B 01 00 00 00 00 00 00 00 00 00 00 {181} normal block at 0x0000016B06966510, 16 bytes long. Data: <PN k > 50 4E 96 06 6B 01 00 00 00 00 00 00 00 00 00 00 {180} normal block at 0x0000016B06964E50, 280 bytes long. Data: < e k PQ k > 10 65 96 06 6B 01 00 00 50 51 96 06 6B 01 00 00 {179} normal block at 0x0000016B06966100, 16 bytes long. Data: < k > C0 C3 94 06 6B 01 00 00 00 00 00 00 00 00 00 00 {178} normal block at 0x0000016B069661A0, 16 bytes long. Data: < k > 98 C3 94 06 6B 01 00 00 00 00 00 00 00 00 00 00 {177} normal block at 0x0000016B06966560, 16 bytes long. Data: <p k > 70 C3 94 06 6B 01 00 00 00 00 00 00 00 00 00 00 {176} normal block at 0x0000016B0694C370, 496 bytes long. Data: <`e k acemd3.e> 60 65 96 06 6B 01 00 00 61 63 65 6D 64 33 2E 65 {65} normal block at 0x0000016B069590E0, 16 bytes long. Data: < > 80 EA 11 A0 F6 7F 00 00 00 00 00 00 00 00 00 00 {64} normal block at 0x0000016B06959950, 16 bytes long. Data: <@ > 40 E9 11 A0 F6 7F 00 00 00 00 00 00 00 00 00 00 {63} normal block at 0x0000016B06959770, 16 bytes long. Data: < W > F8 57 0E A0 F6 7F 00 00 00 00 00 00 00 00 00 00 {62} normal block at 0x0000016B06959040, 16 bytes long. Data: < W > D8 57 0E A0 F6 7F 00 00 00 00 00 00 00 00 00 00 {61} normal block at 0x0000016B06959720, 16 bytes long. Data: <P > 50 04 0E A0 F6 7F 00 00 00 00 00 00 00 00 00 00 {60} normal block at 0x0000016B06959680, 16 bytes long. Data: <0 > 30 04 0E A0 F6 7F 00 00 00 00 00 00 00 00 00 00 {59} normal block at 0x0000016B069595E0, 16 bytes long. Data: < > E0 02 0E A0 F6 7F 00 00 00 00 00 00 00 00 00 00 {58} normal block at 0x0000016B06958FF0, 16 bytes long. Data: < > 10 04 0E A0 F6 7F 00 00 00 00 00 00 00 00 00 00 {57} normal block at 0x0000016B06958F00, 16 bytes long. Data: <p > 70 04 0E A0 F6 7F 00 00 00 00 00 00 00 00 00 00 {56} normal block at 0x0000016B06959590, 16 bytes long. Data: < > 18 C0 0C A0 F6 7F 00 00 00 00 00 00 00 00 00 00 Object dump complete. </stderr_txt> ]]> ID: 52721 · Rating: 0 · rate: / Reply Quote

RFGuy_KCCO Send message Joined: 13 Feb 14 Posts: 6 Credit: 1,068,161,100 RAC: 0 Level Scientific publications	Message 52730 - Posted: 27 Sep 2019, 2:02:26 UTC Last modified: 27 Sep 2019, 2:02:36 UTC No issues here with suspending and resuming tasks under Linux. Just suspended a WU and it resumed on the other GPU in that box without issue (both GPUs are RTX 2080's). ID: 52730 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level Scientific publications	Message 52731 - Posted: 27 Sep 2019, 7:09:48 UTC - in response to Message 52730. No issues here with suspending and resuming tasks under Linux. Just suspended a WU and it resumed on the other GPU in that box without issue (both GPUs are RTX 2080's). Curious. I wonder if my Linux failure was because the paused task did not start back up on the same type of card. ID: 52731 · Rating: 0 · rate: / Reply Quote

zombie67 [MM] Send message Joined: 16 Jul 07 Posts: 209 Credit: 5,496,860,456 RAC: 12,111 Level Scientific publications	Message 52732 - Posted: 27 Sep 2019, 13:31:41 UTC FWIW, my single machine with two GPUs will successfully process CUDA 101 tasks, but fail on CUDA 100 tasks. My other three machines with a single GPU will successfully process both CUDA 101 and CUDA 100 tasks. Reno, NV Team: SETI.USA ID: 52732 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level Scientific publications	Message 52733 - Posted: 27 Sep 2019, 19:20:10 UTC - in response to Message 52731. No issues here with suspending and resuming tasks under Linux. Just suspended a WU and it resumed on the other GPU in that box without issue (both GPUs are RTX 2080's). Curious. I wonder if my Linux failure was because the paused task did not start back up on the same type of card. Here is a workunit that was run for three different periods on two different cards. But they were the same card type and the WU successfully finished. https://www.gpugrid.net/result.php?resultid=21411774 <core_client_version>7.16.2</core_client_version> <![CDATA[ <stderr_txt> 03:44:01 (19192): wrapper (7.7.26016): starting 03:44:01 (19192): wrapper (7.7.26016): starting 03:44:01 (19192): wrapper: running acemd3 (--boinc input --device 0) 14:06:30 (1677): wrapper (7.7.26016): starting 14:06:30 (1677): wrapper (7.7.26016): starting 14:06:30 (1677): wrapper: running acemd3 (--boinc input --device 2) 19:30:32 (12479): wrapper (7.7.26016): starting 19:30:32 (12479): wrapper (7.7.26016): starting 19:30:32 (12479): wrapper: running acemd3 (--boinc input --device 0) 20:16:14 (12479): acemd3 exited; CPU time 2012.925385 20:16:14 (12479): called boinc_finish(0) So the wrapper app can handle being stopped and restarted on different cards AS LONG as they are the same card type. Two examples now of this fact. But when the WU is restarted on a different card type, something about the previous configuration is kept and does not match up with the new configuration. Could be something as simple as card name or maybe CC capabilities. ID: 52733 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level Scientific publications	Message 52734 - Posted: 27 Sep 2019, 23:40:00 UTC I have a request for help from Windows users. Does anyone want to try a development branch of the client that may be able to handle the pause/suspend issues on the acemd3 wrapper apps? I was browsing through the latest commits and and came upon PR#3307 which has the tantalizing description of Description of the Change On Windows, CreateProcess() is used to launch tasks, but this on its own does not handle child processes; if the parent task process exits, the workunit will be terminated. If <wait_for_children> is set in the job file, attach the task process to a job object instead, which can then be monitored to determine when all child processes are finished. Alternate Designs Release Notes Add <wait_for_children> option for tasks in job.xml This sounds like it may address some of the error messages I see in stderr.txt when a wrapper app is suspended or paused. And why Toni has asked whether the wrapper app and the child process acemd3 app are still in the Task Manager list. You can download the latest AppVeyor artifact here for the client. https://ci.appveyor.com/api/buildjobs/y4gd2lvbjjwoa54l/artifacts/deploy%2Fwin-client%2Fwin-client_PR3307_2019-09-26_8665946a.7z ID: 52734 · Rating: 0 · rate: / Reply Quote

JStateson Send message Joined: 31 Oct 08 Posts: 186 Credit: 3,578,903,157 RAC: 0 Level Scientific publications	Message 52735 - Posted: 27 Sep 2019, 23:57:42 UTC - in response to Message 52732. Last modified: 28 Sep 2019, 0:04:05 UTC FWIW, my single machine with two GPUs will successfully process CUDA 101 tasks, but fail on CUDA 100 tasks. My other three machines with a single GPU will successfully process both CUDA 101 and CUDA 100 tasks. I looked at output of both failing and passing tasks on your system with a pair of 1030. I did not see anything in the output identifying the type of coprocessor. however, that may be due to the bios missing code that identifies itself to boinc or more likely the app does not bother to identify the device or report temperatures like the older apps here. From other projects, MW for example, I see where the work units make timing calculations and adjusts parameters accordingly so as to time out tasks that are hung and other purposes. There are two different gt1030's. One is significantly slower than the other else they are identical. The newer versions are crippled. I was wondering if the pair you have together are matched. Just a guess as that could cause unexpected timing values if the apps simply checks the name and does not bother to recalculate parameters. ID: 52735 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level Scientific publications	Message 52736 - Posted: 28 Sep 2019, 0:11:09 UTC The one host that is getting the dominant amount of new acemd3 work just so happens to have three identical EVGA GTX 1070 Ti Black Edition cards and the tasks can apparently restart and run on any one of them after being switched off by the "switch between projects" standard 60 minute delimiter. ID: 52736 · Rating: 0 · rate: / Reply Quote