195 (0xc3) EXIT_CHILD_FAILED

Message boards : Number crunching : 195 (0xc3) EXIT_CHILD_FAILED
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4

AuthorMessage
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 428
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57973 - Posted: 29 Nov 2021, 21:04:16 UTC - in response to Message 57971.  

That's because another failure doesn't reset the failure count. We need to find out where that's stored, and reduce it to less than 10.
ID: 57973 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57975 - Posted: 29 Nov 2021, 21:37:29 UTC - in response to Message 57959.  

after yesterday's snafu, I picked up two cuda101 tasks this morning on my Linux Ubuntu 20.04 3080Ti system. currently running ok. been running about 20 mins now, and is utilizing the GPU @99% so it's definitely working. I basically executed a project reset yesterday on this host, so I don't think my previous modifications to swap out the 101 app for 1121 carried over.
That's easy to check: the CUDA1121 is 990MB while the CUDA101 is 491MB (503406KB).
I think it's impossible to run the CUDA101 on RTX3000 series, as that was the main reason demanding a CUDA11 client not so long ago.

ID: 57975 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 57976 - Posted: 30 Nov 2021, 2:50:49 UTC - in response to Message 57975.  

after yesterday's snafu, I picked up two cuda101 tasks this morning on my Linux Ubuntu 20.04 3080Ti system. currently running ok. been running about 20 mins now, and is utilizing the GPU @99% so it's definitely working. I basically executed a project reset yesterday on this host, so I don't think my previous modifications to swap out the 101 app for 1121 carried over.
That's easy to check: the CUDA1121 is 990MB while the CUDA101 is 491MB (503406KB).
I think it's impossible to run the CUDA101 on RTX3000 series, as that was the main reason demanding a CUDA11 client not so long ago.



my gpugrid project folder contains two compressed files for acemd3.

x86_64-pc-linux-gnu__cuda101.zip.<alphanumeric> (515.5 MB)
x86_64-pc-linux-gnu__cuda1121.zip.<alphanumeric> (1.0 GB)

so it seems it did indeed use the cuda101 code on my 3080Ti and both tasks succeeded.

https://www.gpugrid.net/result.php?resultid=32707549
https://www.gpugrid.net/result.php?resultid=32701203

since both apps use the same filename of just 'acemd3', it's possible some bug is causing the wrong (or is it right? lol) one to be used or something to that effect.
ID: 57976 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 7 Mar 14
Posts: 18
Credit: 6,575,125,525
RAC: 2
Level
Tyr
Scientific publications
watwatwatwatwat
Message 57978 - Posted: 30 Nov 2021, 8:08:50 UTC - in response to Message 57976.  

since both apps use the same filename of just 'acemd3', it's possible some bug is causing the wrong (or is it right? lol) one to be used or something to that effect.

Don't forget it could be this...
http://www.gpugrid.net/forum_thread.php?id=5256&nowrap=true#57473

Have completed 5 of the recent cuda101 tasks on Ampere hosts now, a sixth is running and a seventh lined up. Have seen no failures as yet.
ID: 57978 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57985 - Posted: 1 Dec 2021, 9:43:31 UTC - in response to Message 57978.  

since both apps use the same filename of just 'acemd3', it's possible some bug is causing the wrong (or is it right? lol) one to be used or something to that effect.

Don't forget it could be this...
http://www.gpugrid.net/forum_thread.php?id=5256&nowrap=true#57473

Have completed 5 of the recent cuda101 tasks on Ampere hosts now, a sixth is running and a seventh lined up. Have seen no failures as yet.
I guess that you still use the "special" BOINC manager (compiled for SETI@home), and that handles apps in a different way. That would explain this anomaly.
ID: 57985 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 7 Mar 14
Posts: 18
Credit: 6,575,125,525
RAC: 2
Level
Tyr
Scientific publications
watwatwatwatwat
Message 57987 - Posted: 1 Dec 2021, 9:59:34 UTC - in response to Message 57985.  

I guess that you still use the "special" BOINC manager (compiled for SETI@home), and that handles apps in a different way. That would explain this anomaly.

No.
No modified manager or client here, just the bog standard BOINC 7.16.6
ID: 57987 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 428
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57988 - Posted: 1 Dec 2021, 10:21:40 UTC - in response to Message 57987.  
Last modified: 1 Dec 2021, 10:24:55 UTC

... just the bog standard BOINC 7.16.6

You are recommended to upgrade to v7.16.20 - it's pretty good code, and - importantly - it has updated SSL security certificates needed by some BOINC projects.

(Edit - the above advice applies only to Windows machines. If you're running Linux, you can ignore it. Your computers are hidden, so I don't know which applies)
ID: 57988 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 58071 - Posted: 11 Dec 2021, 15:16:28 UTC

a few hours ago, I had another task which failed after a few seconds with

195 (0xc3) EXIT_CHILD_FAILED

ACEMD failed:
Particle coordinate is nan


https://www.gpugrid.net/workunit.php?wuid=27099407

As can be seen, the task failed on a total of 8 different hosts.
I am questioning the rationale behind sending out a faulty task 8 x :-(((
ID: 58071 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4

Message boards : Number crunching : 195 (0xc3) EXIT_CHILD_FAILED

©2025 Universitat Pompeu Fabra