All tasks failed with Exit status 195 (0xc3) EXIT_CHILD_FAILED

Message boards : Number crunching : All tasks failed with Exit status 195 (0xc3) EXIT_CHILD_FAILED
Message board moderation

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
Greg _BE

Send message
Joined: 30 Jun 14
Posts: 153
Credit: 129,654,684
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwat
Message 59048 - Posted: 29 Jul 2022, 15:52:47 UTC - in response to Message 59047.  

Disk space limits can be solved by tweaking BOINC's limits.

They're quite separate and distinct from the memory (RAM) problems you were having here earlier.



Ok thanks...fixed
ID: 59048 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Greg _BE

Send message
Joined: 30 Jun 14
Posts: 153
Credit: 129,654,684
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwat
Message 59050 - Posted: 30 Jul 2022, 8:18:40 UTC

New problem...I stopped last night with 98% done and about a hour and half to go on the end of the task. I do all the normal shut down procedures, suspend all computing, shut down client, exit program. When I restart this morning the task has gone to hell. Time to finish 159 days and 2% done and time remaining counts UP and not down.

CPU time
6d 11:39:36
CPU time since checkpoint
00:14:10
Elapsed time
3d 06:14:26
Estimated time remaining
159d 17:47:48
Fraction done
2.000%

Now after several restarts the time remaining goes down, but still 159 days.


I had another task that was also close to done, but the server considered it timed out. I guess I missed the deadline.

I'll let this task run for a bit longer, but to me it looks all messed up.
I don't see anything wrong in stderr or boinc_task_state
ID: 59050 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Greg _BE

Send message
Joined: 30 Jun 14
Posts: 153
Credit: 129,654,684
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwat
Message 59051 - Posted: 30 Jul 2022, 10:01:14 UTC - in response to Message 59050.  

New problem...I stopped last night with 98% done and about a hour and half to go on the end of the task. I do all the normal shut down procedures, suspend all computing, shut down client, exit program. When I restart this morning the task has gone to hell. Time to finish 159 days and 2% done and time remaining counts UP and not down.

CPU time
6d 11:39:36
CPU time since checkpoint
00:14:10
Elapsed time
3d 06:14:26
Estimated time remaining
159d 17:47:48
Fraction done
2.000%

Now after several restarts the time remaining goes down, but still 159 days.

It settled down now. 47 minutes left.



I had another task that was also close to done, but the server considered it timed out. I guess I missed the deadline.

I'll let this task run for a bit longer, but to me it looks all messed up.
I don't see anything wrong in stderr or boinc_task_state

ID: 59051 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Greg _BE

Send message
Joined: 30 Jun 14
Posts: 153
Credit: 129,654,684
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwat
Message 59052 - Posted: 30 Jul 2022, 10:05:52 UTC

Different question: When looking at Boinc Tasks program and looking at the CPU%, why do I see 197% and 131% CPU usage? Is that just how these tasks work?
I thought CPU was for control and guidance only? This almost looks like it is processing as well.
ID: 59052 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1423
Credit: 9,189,196,190
RAC: 1,326,743
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59053 - Posted: 31 Jul 2022, 1:19:35 UTC - in response to Message 59052.  

It is normal for tasks to temporarily revert to 2% completion upon restart.

But they quickly jump back to their original completion done percentage at the point they were stopped in just a few minutes.

And then continue on till finish.

At least that is what they always do on all my Linux hosts.

But I have seen similar comments from others running Windows. Probably best not to chance stopping them on Windows.

The application does in fact use the cpu. Quite a bit in fact. The task will jump back and forth from running on the cpu to a quick spurt on the gpu and then back to the cpu.

The tasks spawn 32 individual python processes on the cpu so you are really using more than 100% of a single cpu core. That is what BoincTasks is detecting and showing.

From The reason Reinforcement Learning agents do not currently use the whole potential of the cards is because the interactions between the AI agent and the simulated environment are performed on CPU while the agent "learning" process is the one that uses the GPU intermittently.
Message 59980
ID: 59053 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Billy Ewell 1931

Send message
Joined: 22 Oct 10
Posts: 42
Credit: 1,758,800,315
RAC: 40,420
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59057 - Posted: 4 Aug 2022, 17:43:39 UTC - in response to Message 59018.  

The failure rate on the GPU tasks has reached the point where I feel it is a waste to even try to explain the processes of the failures: 97 out of 101 tasks have failed on either a GTX 1060 or an RTX 3080 and I aborted the RTX task after it wasted 5 days+ of running time, exceeded the return time limit, and still had double-digit days remaining. The three tasks that succeeded used only about 1800 to 3500 seconds of run time.

My patience has expired and I am terminating tasking on Grid for a couple of weeks or so and perhaps the problem can be solved using internal GPUs.

Added Comment: Just for the hell of it: I downloaded a new task just now on the GTX 1060 machine and the initial time to compute was shown as 30 DAYS; OH SURE!!!This does not constitute a sound confidence builder.

Billy Ewell 1931 (Yes, my year of birth)



ID: 59057 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1423
Credit: 9,189,196,190
RAC: 1,326,743
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59058 - Posted: 4 Aug 2022, 21:20:06 UTC - in response to Message 59057.  

Sorry to hear you go.

The estimated time to complete values can be completely ignored at GPUGrid.

BOINC does not have the mechanism to compute the time remaining values of the dual cpu-gpu nature of these tasks and cannot estimate the time to complete correctly.

On modern gpus of at least Pascal generation, the tasks complete well within the standard 5 day deadlines. Typical compute times of around 20 minutes to 12 hours.

Windows needs to be set up correctly however to run these tasks properly.

The Windows pagefile size needs to be increased to around 35-50GB for the tasks to run and finish properly.
ID: 59058 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Greg _BE

Send message
Joined: 30 Jun 14
Posts: 153
Credit: 129,654,684
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwat
Message 59059 - Posted: 4 Aug 2022, 22:36:10 UTC - in response to Message 59057.  

Billy, scroll down this thread a bit.
There is a post where Keith gives some upper and lower limits to the page file size. This cleared things up for me really fast.

I run a 1080 and a 1050 and once I did the page file setting I have never had an error on either card. Run time is about 3 days on these cards, but I am sharing them with Folding At Home, so that might slow things down a bit.
ID: 59059 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1423
Credit: 9,189,196,190
RAC: 1,326,743
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59060 - Posted: 5 Aug 2022, 3:32:42 UTC - in response to Message 59059.  

Thanks for the confirmation Greg that the Python tasks CAN in fact be properly run to completion well within their deadlines AS LONG as Windows is configured correctly.

Glad to hear you are successfully processing this new work and contributing to cutting edge science.
ID: 59060 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
kotenok2000

Send message
Joined: 18 Jul 13
Posts: 79
Credit: 218,028,292
RAC: 180,230
Level
Leu
Scientific publications
wat
Message 59072 - Posted: 6 Aug 2022, 20:28:50 UTC - in response to Message 59045.  
Last modified: 6 Aug 2022, 20:29:26 UTC

Try to install boinc on rocky linux 8 in vmware workstation player . It is free for home use.
ID: 59072 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Greg _BE

Send message
Joined: 30 Jun 14
Posts: 153
Credit: 129,654,684
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwat
Message 59075 - Posted: 7 Aug 2022, 13:54:34 UTC - in response to Message 59060.  

Thanks for the confirmation Greg that the Python tasks CAN in fact be properly run to completion well within their deadlines AS LONG as Windows is configured correctly.

Glad to hear you are successfully processing this new work and contributing to cutting edge science.


Just chugging along now. Once that swap space issue was taken care of, no problems. This is a Win10 machine with AMD Ryzen.
ID: 59075 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile God is Love, Jesus proves it. ...

Send message
Joined: 23 Mar 15
Posts: 1
Credit: 21,695,263
RAC: 0
Level
Pro
Scientific publications
wat
Message 59089 - Posted: 9 Aug 2022, 19:05:17 UTC

Adria, please fix the bug in your WUs.
error code 195
ID: 59089 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 2
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59090 - Posted: 9 Aug 2022, 19:10:10 UTC - in response to Message 59089.  

(unknown error) - exit code 195 (0xc3)</message>
<stderr_txt>
23:23:38 (19824): wrapper (7.9.26016): starting
23:23:38 (19824): wrapper: running bin/acemd3.exe (--boinc --device 0)
ACEMD failed:
Error invoking kernel: CUDA_ERROR_LAUNCH_FAILED (719)
23:50:50 (19824): bin/acemd3.exe exited; CPU time 1611.078125

A GeForce GTX 1660 Ti should be OK: check your drivers.
ID: 59090 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Billy Ewell 1931

Send message
Joined: 22 Oct 10
Posts: 42
Credit: 1,758,800,315
RAC: 40,420
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59103 - Posted: 12 Aug 2022, 1:58:25 UTC - in response to Message 59059.  

Keith: Thanks for the input but I am personally cautious in changing items for fear I will screw up what I cannot fix.

Here are the current page filing settings on automatic and I have changed nothing so far. This is as currently specified:

Minimum allowed----16 MB

Recommended--------4957 MB

Currently----------45056 MB

As I understand the suggestion is I unclick the automatic setting option and set the Minimum as 35 and the others as ?????.

Await your reply: Bill
ID: 59103 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
kotenok2000

Send message
Joined: 18 Jul 13
Posts: 79
Credit: 218,028,292
RAC: 180,230
Level
Leu
Scientific publications
wat
Message 59106 - Posted: 12 Aug 2022, 12:21:16 UTC - in response to Message 59103.  

Try to set it 51200
ID: 59106 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1423
Credit: 9,189,196,190
RAC: 1,326,743
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59108 - Posted: 12 Aug 2022, 15:31:56 UTC - in response to Message 59103.  

Keith: Thanks for the input but I am personally cautious in changing items for fear I will screw up what I cannot fix.

Here are the current page filing settings on automatic and I have changed nothing so far. This is as currently specified:

Minimum allowed----16 MB

Recommended--------4957 MB

Currently----------45056 MB

As I understand the suggestion is I unclick the automatic setting option and set the Minimum as 35 and the others as ?????.

Await your reply: Bill


Those setting pages are enumerated in MB's, not GB's, which it needs to be for Python tasks.

So you need to add X1000 to your 35 IOW 35000 MB's
ID: 59108 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Billy Ewell 1931

Send message
Joined: 22 Oct 10
Posts: 42
Credit: 1,758,800,315
RAC: 40,420
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59135 - Posted: 19 Aug 2022, 17:59:12 UTC

Keith Myers and Kotenok2000:

Once I reset the pagefiles to the recommended values I have processed bunches of tasks without a skip. Thanks for the great advice. BET

The bottom number is 35000MB and the top is 51200MB.

It would seem practical to me for the admins/techs to incorporate the pagefiles criteria in such a way that all contributors will find it easy to find the instructions and likewise easy to modify their machines.
ID: 59135 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1423
Credit: 9,189,196,190
RAC: 1,326,743
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59137 - Posted: 19 Aug 2022, 23:05:57 UTC

I just made a post about the pagefile mod needed for Python task in the FAQ section.

Just need a admin to make it sticky.
ID: 59137 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Life v lies: Dont be a DNA-den...

Send message
Joined: 14 Feb 20
Posts: 16
Credit: 27,395,983
RAC: 0
Level
Val
Scientific publications
wat
Message 59151 - Posted: 23 Aug 2022, 16:31:52 UTC

If the GPUGrid project is willing to ask for and accept the in-kind donations of people's GPU time, then GPUGrid has an obligation to do what they can to resolve problematic tasks and code

If WUs require mods to the defaults in config files, etc., people should NOT have to hunt around in forum posts to glean a solution.

BOINC manager does have a Notices tab, and it is negligent of GPUGrid not to post needed instructions there, or at least a direct link to the specific forum post, for resolution
...in particular when the problem is not an isolated issue to just a few PCs

Other projects DO extend the coutesy to communicate via the Notices tab.

LLP, PhD, Prof. Engr.
ID: 59151 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2

Message boards : Number crunching : All tasks failed with Exit status 195 (0xc3) EXIT_CHILD_FAILED

©2026 Universitat Pompeu Fabra