ubuntu cuda100 not surviving restart of client

Message boards : Number crunching : ubuntu cuda100 not surviving restart of client
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile JStateson
Avatar

Send message
Joined: 31 Oct 08
Posts: 186
Credit: 3,578,903,157
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 53165 - Posted: 27 Nov 2019, 20:09:31 UTC
Last modified: 27 Nov 2019, 20:09:54 UTC

Restarted the client and lost all 3 Linux cuda 100 tasks. Did not realize this was a problem.

I probably should have suspended them all before doing a restart of boinc. This is unfortunate as I don't always get gpugrid Linux tasks and the few I get I hate to lose this way.


ID: 53165 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 53173 - Posted: 27 Nov 2019, 22:50:09 UTC - in response to Message 53165.  
Last modified: 27 Nov 2019, 22:51:22 UTC

Restarted the client and lost all 3 Linux cuda 100 tasks. Did not realize this was a problem.

I probably should have suspended them all before doing a restart of boinc. This is unfortunate as I don't always get gpugrid Linux tasks and the few I get I hate to lose this way.
The reason for this error is in the stderr output of the task:
<core_client_version>7.16.1</core_client_version>
<![CDATA[
<message>
process exited with code 195 (0xc3, -61)</message>
<stderr_txt>
09:41:49 (11866): wrapper (7.7.26016): starting
09:41:49 (11866): wrapper (7.7.26016): starting
09:41:49 (11866): wrapper: running acemd3 (--boinc input --device 1)
13:57:59 (13231): wrapper (7.7.26016): starting
13:57:59 (13231): wrapper (7.7.26016): starting
13:57:59 (13231): wrapper: running acemd3 (--boinc input --device 0)
ERROR: /home/user/conda/conda-bld/acemd3_1570536635323/work/src/mdsim/context.cpp line 322:
 Cannot use a restart file on a different device!
13:58:05 (13231): acemd3 exited; CPU time 5.243312
13:58:05 (13231): app exit status: 0x9e
13:58:05 (13231): called boinc_finish(195)

</stderr_txt>
]]>
This could happen only on hosts with multiple GPUs (this is a known bug of the ACEMD3 app).
To resolve this you should
1. make notes of task-device pairs
2. suspend all GPUGrid tasks (first the ones which are not running ["ready to start"])
3. restart your host
4. resume your GPUGrid tasks in the order of the device numbers (the task was running on device 0 should be resumed first and so on)
ID: 53173 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile JStateson
Avatar

Send message
Joined: 31 Oct 08
Posts: 186
Credit: 3,578,903,157
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 53174 - Posted: 27 Nov 2019, 23:29:14 UTC - in response to Message 53173.  
Last modified: 27 Nov 2019, 23:33:28 UTC

This could happen only on hosts with multiple GPUs (this is a known bug of the ACEMD3 app).
To resolve this you should
1. make notes of task-device pairs
2. suspend all GPUGrid tasks (first the ones which are not running ["ready to start"])
3. restart your host
4. resume your GPUGrid tasks in the order of the device numbers (the task was running on device 0 should be resumed first and so on)


Thanks, was not aware of that! Going to be a real problem as there is a windows 10 "feature 1909" pending. However, ubuntu will be unaffected.

Not sure if you noticed, but my "El Cheapo" P102-100 mining card "D1" is far and away the faster of the 1660Ti "D0" and especially the GTX-1070 "D2"

GPUGRID	2.10 New version of ACEMD (cuda100)	0.983C + 1NV (d1)	99.87	02:30:22 (02:30:10)	04:16:50	57.000	Running	tb85-nvidia	test449-TONI_GSNTEST3-6-100-RND1891_0	12/2/2019 9:53:34 AM			JStateson	
GPUGRID	2.10 New version of ACEMD (cuda100)	0.983C + 1NV (d0)	99.91	02:30:20 (02:30:12)	04:40:43	53.000	Running	tb85-nvidia	initial_1911-ELISA_GSN4V1-9-100-RND1684_0	12/2/2019 11:52:22 AM			JStateson	
GPUGRID	2.10 New version of ACEMD (cuda100)	0.983C + 1NV (d2)	99.89	02:30:19 (02:30:09)	05:28:30	45.000	Running	tb85-nvidia	initial_1243-ELISA_GSN4V1-1-100-RND2537_0	12/2/2019 1:44:26 PM			JStateson


start time for all 3 above was 2:30:19 within 3 seconds. The mining card will finish an hour ahead of the 1660Ti and 2 hours ahead of the 1070 is my guess
ID: 53174 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : ubuntu cuda100 not surviving restart of client

©2025 Universitat Pompeu Fabra