More Acemd3 tests

Message boards : News : More Acemd3 tests
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

AuthorMessage
Profile ServicEnginIC
Avatar

Send message
Joined: 24 Sep 10
Posts: 592
Credit: 11,972,186,510
RAC: 1,102,898
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52679 - Posted: 20 Sep 2019, 15:30:36 UTC - in response to Message 52674.  

There is a loss in performance of %16 due to x1 but on the other hand, Windows with 1070Ti and a full x16 is slightly slower than the 1660Ti hanging on a 1x riser on Ubuntu! Both of my systems have swan_sync enabled and both run CUDA 10.0 Not sure about the other user.

As seen in table from following link, GTX1660TI, SWAN_SYNC enabled, demands 33% of PCIE X16 bandwidth in my system.
https://www.gpugrid.net/forum_thread.php?id=4987&nowrap=true#52633
ID: 52679 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Aurum
Avatar

Send message
Joined: 12 Jul 17
Posts: 404
Credit: 17,408,899,587
RAC: 2
Level
Trp
Scientific publications
watwatwat
Message 52680 - Posted: 20 Sep 2019, 18:34:19 UTC - in response to Message 52678.  

Sorry, my bad. I'm talking about 1080 Ti's and you're running 2080 Ti's. No, acemd does not work for Turing GPUs.

I get confused as my single 2080 Ti is on a Linux computer.
ID: 52680 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
rod4x4

Send message
Joined: 4 Aug 14
Posts: 266
Credit: 2,219,935,054
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 52684 - Posted: 21 Sep 2019, 1:14:58 UTC

Received a TEST work unit a43-TONI_TESTDHFR207c-23-30-RND4156_0 on a Win10 Host with GTX1060 GPU.

Applied the following test:
Let work unit run for 11 minutes 13 seconds
suspended for 1 minute 20 seconds (approx)
Resumed work unit.

Results:
Work unit had computational error several seconds after resuming

Observations:
Work unit predicted a run time of 36 minutes. This is an improvement on Work unit a89-TONI_TESTDHFR206b-23-30-RND6008_0 , which had a run time of 66 minutes. Speed issues seems to be improved.
ACEMD3 task and Wrapper task disappeared from Task Manager after suspending task.
After resumption / failure, the run time reverted to 2 minutes 12 seconds.
STDerr Output time line reflects the full run time of 11 minutes, but Run Time summary only reflects 2 minutes 12 seconds.
nvidia-smi reported 78% GPU utilization which is inline with CUDA80 tasks on this host.
nvidia-smi reported similar Power usage as CUDA80 tasks on this host.

Link to Work unit here:
http://gpugrid.net/result.php?resultid=21396885
ID: 52684 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
rod4x4

Send message
Joined: 4 Aug 14
Posts: 266
Credit: 2,219,935,054
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 52686 - Posted: 21 Sep 2019, 2:07:53 UTC

Received another TEST work unit a6-TONI_TESTDHFR207-2-3-RND1704

Same testing method as last post, this time allowed work unit to run 40 minutes 37 seconds before suspending. (54% complete)
Task failed after resuming 1 minute later.

The run time may not have improved as indicated in last post. After 40 minutes 37 seconds task was 54% completed. So Windows 10 tasks still seem to have a speed issue compared to Linus ACEMD3 tasks.

Additional I did notice the ACEMD3 task and Wrapper task did reappear in Task Manager for a few seconds before the task failed.

All other observations consistent with last post.

Failed task here:
http://gpugrid.net/result.php?resultid=21397083
ID: 52686 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 960
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 52687 - Posted: 21 Sep 2019, 5:09:34 UTC - in response to Message 52686.  

Failed task here:
http://gpugrid.net/result.php?resultid=21397083

what caught my eye:
in line 8 of the stderr it says "Detected memory leaks!" - whatever this means.
ID: 52687 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
rod4x4

Send message
Joined: 4 Aug 14
Posts: 266
Credit: 2,219,935,054
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 52689 - Posted: 21 Sep 2019, 10:24:03 UTC - in response to Message 52687.  

what caught my eye:
in line 8 of the stderr it says "Detected memory leaks!" - whatever this means.


It is a programming error indicating memory is not allocated or de-allocated correctly.

This is the suspend/resume bug they are looking to fix.
ID: 52689 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ServicEnginIC
Avatar

Send message
Joined: 24 Sep 10
Posts: 592
Credit: 11,972,186,510
RAC: 1,102,898
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52695 - Posted: 22 Sep 2019, 10:00:57 UTC
Last modified: 22 Sep 2019, 10:02:22 UTC

My following W10 computer, GTX1050TI graphics card:
https://www.gpugrid.net/show_host_detail.php?hostid=105442
Got an ACEMD3 V2.07 test WU:
https://www.gpugrid.net/result.php?resultid=21400000
It was processed till the end, no pauses, and then errored out with indication "195 (0xc3) EXIT_CHILD_FAILED".
The same WU but V2.06 was processed successfully by a second computer:
WU: https://www.gpugrid.net/result.php?resultid=21400083
Computer: https://www.gpugrid.net/show_host_detail.php?hostid=459450
Something is still to be polished in V2.07 application code and/or scheduler...

I also miss previously available information in old ACEMD WUs about graphics card model, clocks, temperatures reached...
Perhaps it is not possible in new WUs due to Wrapper's philosophy (?)
ID: 52695 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Billy Ewell 1931

Send message
Joined: 22 Oct 10
Posts: 42
Credit: 1,752,050,315
RAC: 43,238
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52704 - Posted: 23 Sep 2019, 19:49:16 UTC

e1s20_ubiquitin_50ns_3-ADRIA_FOLDUBQ_BANDIT_crystal_ss_contacts_50_ubiquitin_2-0-2-RND1315.
This task had downloaded and failed without my immediate knowledge as I was doing some routine computer updating/restarting. Failure occurred at 52.31 seconds which is of course irrelevant. Machine: I7 Windows10 RTX2080.

Do I understand that Toni still wants these failure reports?
ID: 52704 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile JStateson
Avatar

Send message
Joined: 31 Oct 08
Posts: 186
Credit: 3,578,903,157
RAC: 1
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52709 - Posted: 23 Sep 2019, 23:58:38 UTC

I lost a pair of those "new" tasks

http://www.gpugrid.org/results.php?hostid=467730&offset=0&show_names=0&state=5&appid=

I had to reboot to install to fix a problem with an app. I did stop the Boinc client before rebooting but the two "new" tasks did not survive the reboot.

Looking at errors, my only other errors on this system were almost a year ago.
ID: 52709 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1416
Credit: 9,119,446,190
RAC: 678,713
Level
Tyr
Scientific publications
watwatwatwatwat
Message 52718 - Posted: 25 Sep 2019, 5:29:55 UTC

Just had my first failure on a restarted CUDA100 task that obeyed the set 60 minute run per project setting. Restarted on a different device and failed.

Stderr output
<core_client_version>7.16.2</core_client_version>
<![CDATA[
<message>
process exited with code 195 (0xc3, -61)</message>
<stderr_txt>
18:14:25 (25157): wrapper (7.7.26016): starting
18:14:25 (25157): wrapper (7.7.26016): starting
18:14:25 (25157): wrapper: running acemd3 (--boinc input --device 0)
21:09:51 (23108): wrapper (7.7.26016): starting
21:09:51 (23108): wrapper (7.7.26016): starting
21:09:51 (23108): wrapper: running acemd3 (--boinc input --device 1)
ERROR: /home/user/conda/conda-bld/acemd3_1566914012210/work/src/mdsim/context.cpp line 324: Cannot use a restart file on a different device!
21:09:55 (23108): acemd3 exited; CPU time 3.826201
21:09:55 (23108): app exit status: 0x9e
21:09:55 (23108): called boinc_finish(195)

</stderr_txt>
]]>

So this has a failure in common with the Windows wrapper app.

https://www.gpugrid.net/result.php?resultid=21408545
ID: 52718 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Toby Broom

Send message
Joined: 11 Dec 08
Posts: 26
Credit: 648,944,294
RAC: 455,316
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52719 - Posted: 25 Sep 2019, 6:20:35 UTC

I see the same as others on suspend.

http://www.gpugrid.net/result.php?resultid=21408995

These WU's don't stress my Titan V much the load is about 75% and power 100W. My 10xx GPU's are another 10% load and double power draw.
ID: 52719 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52720 - Posted: 25 Sep 2019, 11:51:49 UTC
Last modified: 25 Sep 2019, 11:53:35 UTC

I noticed that the new ACEMD3 Windows app v2.06 does not update the boinc_task_state.xml file in the slot directory.
It maybe related to the checkpoint + "resuming does not work" issue.
BTW I don't know why this host received the v2.06, instead of the more recent v2.07. (NVIDIA GeForce GTX 1080 Ti (4095MB) driver: 436.15)
ID: 52720 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
KAMasud

Send message
Joined: 27 Jul 11
Posts: 138
Credit: 539,953,398
RAC: 0
Level
Lys
Scientific publications
watwat
Message 52721 - Posted: 25 Sep 2019, 17:49:47 UTC

I restarted my machine and the WU crashed.

e1s29_ubiquitin_50ns_5-ADRIA_FOLDUBQ_BANDIT_crystal_ss_contacts_50_ubiquitin_1-1-2-RND8951_2

Stderr output
<core_client_version>7.14.2</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code 195 (0xc3)</message>
<stderr_txt>
19:49:47 (10992): wrapper (7.9.26016): starting
19:49:47 (10992): wrapper: running acemd3.exe (--boinc input --device 0)
Detected memory leaks!
Dumping objects ->
..\api\boinc_api.cpp(309) : {1755} normal block at 0x000001D0701589B0, 8 bytes long.
Data: < p > 00 00 11 70 D0 01 00 00
..\lib\diagnostics_win.cpp(417) : {202} normal block at 0x000001D07015C010, 1080 bytes long.
Data: <@ > 40 02 00 00 CD CD CD CD E0 01 00 00 00 00 00 00
Object dump complete.
22:43:23 (11696): wrapper (7.9.26016): starting
22:43:23 (11696): wrapper: running acemd3.exe (--boinc input --device 0)
# Engine failed: The periodic box size has decreased to less than twice the nonbonded cutoff.
22:43:27 (11696): acemd3.exe exited; CPU time 0.000000
22:43:27 (11696): app exit status: 0x1
22:43:27 (11696): called boinc_finish(195)
0 bytes in 0 Free Blocks.
506 bytes in 8 Normal Blocks.
1144 bytes in 1 CRT Blocks.
0 bytes in 0 Ignore Blocks.
0 bytes in 0 Client Blocks.
Largest number used: 0 bytes.
Total allocations: 141102 bytes.
Dumping objects ->
{1814} normal block at 0x0000016B06965620, 48 bytes long.
Data: <ACEMD_PLUGIN_DIR> 41 43 45 4D 44 5F 50 4C 55 47 49 4E 5F 44 49 52
{1803} normal block at 0x0000016B069657E0, 48 bytes long.
Data: <HOME=D:\ProgramD> 48 4F 4D 45 3D 44 3A 5C 50 72 6F 67 72 61 6D 44
{1792} normal block at 0x0000016B06965BD0, 48 bytes long.
Data: <TMP=D:\ProgramDa> 54 4D 50 3D 44 3A 5C 50 72 6F 67 72 61 6D 44 61
{1781} normal block at 0x0000016B06965690, 48 bytes long.
Data: <TEMP=D:\ProgramD> 54 45 4D 50 3D 44 3A 5C 50 72 6F 67 72 61 6D 44
{1770} normal block at 0x0000016B06965150, 48 bytes long.
Data: <TMPDIR=D:\Progra> 54 4D 50 44 49 52 3D 44 3A 5C 50 72 6F 67 72 61
{1759} normal block at 0x0000016B069460C0, 141 bytes long.
Data: <<project_prefere> 3C 70 72 6F 6A 65 63 74 5F 70 72 65 66 65 72 65
..\api\boinc_api.cpp(309) : {1756} normal block at 0x0000016B06966A10, 8 bytes long.
Data: < k > 00 00 93 06 6B 01 00 00
{981} normal block at 0x0000016B06944D40, 141 bytes long.
Data: <<project_prefere> 3C 70 72 6F 6A 65 63 74 5F 70 72 65 66 65 72 65
{204} normal block at 0x0000016B069661F0, 8 bytes long.
Data: < k > 10 90 96 06 6B 01 00 00
{197} normal block at 0x0000016B06965460, 48 bytes long.
Data: <--boinc input --> 2D 2D 62 6F 69 6E 63 20 69 6E 70 75 74 20 2D 2D
{196} normal block at 0x0000016B06966BA0, 16 bytes long.
Data: < k > A8 C4 94 06 6B 01 00 00 00 00 00 00 00 00 00 00
{195} normal block at 0x0000016B069662E0, 16 bytes long.
Data: < k > 80 C4 94 06 6B 01 00 00 00 00 00 00 00 00 00 00
{194} normal block at 0x0000016B06966BF0, 16 bytes long.
Data: <X k > 58 C4 94 06 6B 01 00 00 00 00 00 00 00 00 00 00
{193} normal block at 0x0000016B06966010, 16 bytes long.
Data: <0 k > 30 C4 94 06 6B 01 00 00 00 00 00 00 00 00 00 00
{192} normal block at 0x0000016B06966290, 16 bytes long.
Data: < k > 08 C4 94 06 6B 01 00 00 00 00 00 00 00 00 00 00
{191} normal block at 0x0000016B06966F60, 16 bytes long.
Data: < k > E0 C3 94 06 6B 01 00 00 00 00 00 00 00 00 00 00
{190} normal block at 0x0000016B06965540, 48 bytes long.
Data: <ComSpec=C:\Windo> 43 6F 6D 53 70 65 63 3D 43 3A 5C 57 69 6E 64 6F
{189} normal block at 0x0000016B06966150, 16 bytes long.
Data: <@O k > 40 4F 96 06 6B 01 00 00 00 00 00 00 00 00 00 00
{188} normal block at 0x0000016B0695A5D0, 32 bytes long.
Data: <SystemRoot=C:\Wi> 53 79 73 74 65 6D 52 6F 6F 74 3D 43 3A 5C 57 69
{187} normal block at 0x0000016B06966420, 16 bytes long.
Data: < O k > 18 4F 96 06 6B 01 00 00 00 00 00 00 00 00 00 00
{185} normal block at 0x0000016B069666A0, 16 bytes long.
Data: < N k > F0 4E 96 06 6B 01 00 00 00 00 00 00 00 00 00 00
{184} normal block at 0x0000016B06966650, 16 bytes long.
Data: < N k > C8 4E 96 06 6B 01 00 00 00 00 00 00 00 00 00 00
{183} normal block at 0x0000016B06966C40, 16 bytes long.
Data: < N k > A0 4E 96 06 6B 01 00 00 00 00 00 00 00 00 00 00
{182} normal block at 0x0000016B06966DD0, 16 bytes long.
Data: <xN k > 78 4E 96 06 6B 01 00 00 00 00 00 00 00 00 00 00
{181} normal block at 0x0000016B06966510, 16 bytes long.
Data: <PN k > 50 4E 96 06 6B 01 00 00 00 00 00 00 00 00 00 00
{180} normal block at 0x0000016B06964E50, 280 bytes long.
Data: < e k PQ k > 10 65 96 06 6B 01 00 00 50 51 96 06 6B 01 00 00
{179} normal block at 0x0000016B06966100, 16 bytes long.
Data: < k > C0 C3 94 06 6B 01 00 00 00 00 00 00 00 00 00 00
{178} normal block at 0x0000016B069661A0, 16 bytes long.
Data: < k > 98 C3 94 06 6B 01 00 00 00 00 00 00 00 00 00 00
{177} normal block at 0x0000016B06966560, 16 bytes long.
Data: <p k > 70 C3 94 06 6B 01 00 00 00 00 00 00 00 00 00 00
{176} normal block at 0x0000016B0694C370, 496 bytes long.
Data: <`e k acemd3.e> 60 65 96 06 6B 01 00 00 61 63 65 6D 64 33 2E 65
{65} normal block at 0x0000016B069590E0, 16 bytes long.
Data: < > 80 EA 11 A0 F6 7F 00 00 00 00 00 00 00 00 00 00
{64} normal block at 0x0000016B06959950, 16 bytes long.
Data: <@ > 40 E9 11 A0 F6 7F 00 00 00 00 00 00 00 00 00 00
{63} normal block at 0x0000016B06959770, 16 bytes long.
Data: < W > F8 57 0E A0 F6 7F 00 00 00 00 00 00 00 00 00 00
{62} normal block at 0x0000016B06959040, 16 bytes long.
Data: < W > D8 57 0E A0 F6 7F 00 00 00 00 00 00 00 00 00 00
{61} normal block at 0x0000016B06959720, 16 bytes long.
Data: <P > 50 04 0E A0 F6 7F 00 00 00 00 00 00 00 00 00 00
{60} normal block at 0x0000016B06959680, 16 bytes long.
Data: <0 > 30 04 0E A0 F6 7F 00 00 00 00 00 00 00 00 00 00
{59} normal block at 0x0000016B069595E0, 16 bytes long.
Data: < > E0 02 0E A0 F6 7F 00 00 00 00 00 00 00 00 00 00
{58} normal block at 0x0000016B06958FF0, 16 bytes long.
Data: < > 10 04 0E A0 F6 7F 00 00 00 00 00 00 00 00 00 00
{57} normal block at 0x0000016B06958F00, 16 bytes long.
Data: <p > 70 04 0E A0 F6 7F 00 00 00 00 00 00 00 00 00 00
{56} normal block at 0x0000016B06959590, 16 bytes long.
Data: < > 18 C0 0C A0 F6 7F 00 00 00 00 00 00 00 00 00 00
Object dump complete.

</stderr_txt>
]]>
ID: 52721 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
RFGuy_KCCO

Send message
Joined: 13 Feb 14
Posts: 6
Credit: 1,068,161,100
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwat
Message 52730 - Posted: 27 Sep 2019, 2:02:26 UTC
Last modified: 27 Sep 2019, 2:02:36 UTC

No issues here with suspending and resuming tasks under Linux. Just suspended a WU and it resumed on the other GPU in that box without issue (both GPUs are RTX 2080's).
ID: 52730 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1416
Credit: 9,119,446,190
RAC: 678,713
Level
Tyr
Scientific publications
watwatwatwatwat
Message 52731 - Posted: 27 Sep 2019, 7:09:48 UTC - in response to Message 52730.  

No issues here with suspending and resuming tasks under Linux. Just suspended a WU and it resumed on the other GPU in that box without issue (both GPUs are RTX 2080's).

Curious. I wonder if my Linux failure was because the paused task did not start back up on the same type of card.
ID: 52731 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
zombie67 [MM]

Send message
Joined: 16 Jul 07
Posts: 209
Credit: 5,496,860,456
RAC: 8,582,660
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52732 - Posted: 27 Sep 2019, 13:31:41 UTC

FWIW, my single machine with two GPUs will successfully process CUDA 101 tasks, but fail on CUDA 100 tasks. My other three machines with a single GPU will successfully process both CUDA 101 and CUDA 100 tasks.
Reno, NV
Team: SETI.USA
ID: 52732 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1416
Credit: 9,119,446,190
RAC: 678,713
Level
Tyr
Scientific publications
watwatwatwatwat
Message 52733 - Posted: 27 Sep 2019, 19:20:10 UTC - in response to Message 52731.  

No issues here with suspending and resuming tasks under Linux. Just suspended a WU and it resumed on the other GPU in that box without issue (both GPUs are RTX 2080's).

Curious. I wonder if my Linux failure was because the paused task did not start back up on the same type of card.

Here is a workunit that was run for three different periods on two different cards. But they were the same card type and the WU successfully finished.
https://www.gpugrid.net/result.php?resultid=21411774

<core_client_version>7.16.2</core_client_version>
<![CDATA[
<stderr_txt>
03:44:01 (19192): wrapper (7.7.26016): starting
03:44:01 (19192): wrapper (7.7.26016): starting
03:44:01 (19192): wrapper: running acemd3 (--boinc input --device 0)
14:06:30 (1677): wrapper (7.7.26016): starting
14:06:30 (1677): wrapper (7.7.26016): starting
14:06:30 (1677): wrapper: running acemd3 (--boinc input --device 2)
19:30:32 (12479): wrapper (7.7.26016): starting
19:30:32 (12479): wrapper (7.7.26016): starting
19:30:32 (12479): wrapper: running acemd3 (--boinc input --device 0)
20:16:14 (12479): acemd3 exited; CPU time 2012.925385
20:16:14 (12479): called boinc_finish(0)


So the wrapper app can handle being stopped and restarted on different cards AS LONG as they are the same card type. Two examples now of this fact.

But when the WU is restarted on a different card type, something about the previous configuration is kept and does not match up with the new configuration. Could be something as simple as card name or maybe CC capabilities.
ID: 52733 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1416
Credit: 9,119,446,190
RAC: 678,713
Level
Tyr
Scientific publications
watwatwatwatwat
Message 52734 - Posted: 27 Sep 2019, 23:40:00 UTC

I have a request for help from Windows users. Does anyone want to try a development branch of the client that may be able to handle the pause/suspend issues on the acemd3 wrapper apps?

I was browsing through the latest commits and and came upon PR#3307 which has the tantalizing description of

Description of the Change
On Windows, CreateProcess() is used to launch tasks, but this on its own does not handle child processes; if the parent task process exits, the workunit will be terminated. If <wait_for_children> is set in the job file, attach the task process to a job object instead, which can then be monitored to determine when all child processes are finished.

Alternate Designs

Release Notes

Add <wait_for_children> option for tasks in job.xml


This sounds like it may address some of the error messages I see in stderr.txt when a wrapper app is suspended or paused. And why Toni has asked whether the wrapper app and the child process acemd3 app are still in the Task Manager list.

You can download the latest AppVeyor artifact here for the client.

https://ci.appveyor.com/api/buildjobs/y4gd2lvbjjwoa54l/artifacts/deploy%2Fwin-client%2Fwin-client_PR3307_2019-09-26_8665946a.7z
ID: 52734 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile JStateson
Avatar

Send message
Joined: 31 Oct 08
Posts: 186
Credit: 3,578,903,157
RAC: 1
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52735 - Posted: 27 Sep 2019, 23:57:42 UTC - in response to Message 52732.  
Last modified: 28 Sep 2019, 0:04:05 UTC

FWIW, my single machine with two GPUs will successfully process CUDA 101 tasks, but fail on CUDA 100 tasks. My other three machines with a single GPU will successfully process both CUDA 101 and CUDA 100 tasks.


I looked at output of both failing and passing tasks on your system with a pair of 1030. I did not see anything in the output identifying the type of coprocessor. however, that may be due to the bios missing code that identifies itself to boinc or more likely the app does not bother to identify the device or report temperatures like the older apps here.

From other projects, MW for example, I see where the work units make timing calculations and adjusts parameters accordingly so as to time out tasks that are hung and other purposes.

There are two different gt1030's. One is significantly slower than the other else they are identical. The newer versions are crippled. I was wondering if the pair you have together are matched. Just a guess as that could cause unexpected timing values if the apps simply checks the name and does not bother to recalculate parameters.
ID: 52735 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1416
Credit: 9,119,446,190
RAC: 678,713
Level
Tyr
Scientific publications
watwatwatwatwat
Message 52736 - Posted: 28 Sep 2019, 0:11:09 UTC

The one host that is getting the dominant amount of new acemd3 work just so happens to have three identical EVGA GTX 1070 Ti Black Edition cards and the tasks can apparently restart and run on any one of them after being switched off by the "switch between projects" standard 60 minute delimiter.
ID: 52736 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

Message boards : News : More Acemd3 tests

©2025 Universitat Pompeu Fabra