195 (0xc3) EXIT_CHILD_FAILED

Message boards : Number crunching : 195 (0xc3) EXIT_CHILD_FAILED
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
Profile ServicEnginIC
Avatar

Send message
Joined: 24 Sep 10
Posts: 592
Credit: 11,972,186,510
RAC: 1,447
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57644 - Posted: 24 Oct 2021, 16:25:46 UTC - in response to Message 57642.  

Faulty data - a bad task. Not your fault.

+1

Windows Operating System:
EXCEPTIONAL CONDITION: src\mdio\bincoord.c, line 193: "nelems != 1"

Linux Operating System:
EXCEPTIONAL CONDITION: /home/user/conda/conda-bld/acemd3_1618916459379/work/src/mdio/bincoord.c, line 193: "nelems != 1"

When this warning appears, it usually implies that there is some definition error at the task initial parameters.
It is a badly constructed task at origin, and it will fail at every host that receive it.
Watching at Work Unit #27084895, from which this task hangs, It has previously failed at several other hosts, both Windows and Linux Operating Systems.
The destiny for a Work Unit like this is getting extinguished after the maximum allowed failed tasks (7) is reached...
ID: 57644 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Greger

Send message
Joined: 6 Jan 15
Posts: 76
Credit: 25,499,534,331
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 57865 - Posted: 24 Nov 2021, 17:42:50 UTC

New task from batch e1s34_0-ADRIA_GPCR2021_APJ_b0-0-1-RND2256 appears to break at start with Not A Number for coordinate:

ACEMD failed:
    Particle coordinate is nan


Errorcode: process exited with code 195 (0xc3, -61)

WU: https://www.gpugrid.net/workunit.php?wuid=27087091

ID: 57865 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 428
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57866 - Posted: 24 Nov 2021, 17:59:04 UTC
Last modified: 24 Nov 2021, 18:38:56 UTC

Same here with two different Linux machines:

e1s25_0-ADRIA_GPCR2021_APJ_b0-0-1-RND8388_5
e1s174_0-ADRIA_GPCR2021_APJ_b0-0-1-RND2767_0

The first task has failed for multiple users: I was the first to attempt the second task, but it looks to have gone the same way.

And the same error under Windows:

e1s401_0-ADRIA_GPCR2021_APJ_b0-0-1-RND6370_3
ID: 57866 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
curiously_indifferent

Send message
Joined: 20 Nov 17
Posts: 21
Credit: 1,589,581,263
RAC: 0
Level
His
Scientific publications
watwatwat
Message 57867 - Posted: 24 Nov 2021, 18:16:08 UTC

Same here - RTX2060 card. Fails after 10 seconds.

https://www.gpugrid.net/result.php?resultid=32662557

https://www.gpugrid.net/result.php?resultid=32663017

https://www.gpugrid.net/result.php?resultid=32663196

https://www.gpugrid.net/result.php?resultid=32663734
ID: 57867 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ZUSE
Avatar

Send message
Joined: 10 Jun 20
Posts: 7
Credit: 980,066,632
RAC: 0
Level
Glu
Scientific publications
wat
Message 57868 - Posted: 24 Nov 2021, 18:45:20 UTC

have the same problem with 4x Nvidia T600
ID: 57868 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 891
Level
Tyr
Scientific publications
watwatwatwatwat
Message 57869 - Posted: 24 Nov 2021, 20:10:02 UTC - in response to Message 57868.  

Looks we all have the same issue with NaN. I've bombed through a couple dozen today for wasted download cap.
ID: 57869 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
bozz4science

Send message
Joined: 22 May 20
Posts: 110
Credit: 115,525,136
RAC: 0
Level
Cys
Scientific publications
wat
Message 57870 - Posted: 24 Nov 2021, 23:43:48 UTC

Same on my end. Had almost 80 tasks on a single machine today that all failed with said NaN error. Why were so many faulty tasks sent out in the first place?
ID: 57870 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 891
Level
Tyr
Scientific publications
watwatwatwatwat
Message 57871 - Posted: 25 Nov 2021, 2:00:29 UTC

Thrown away 150 bad tasks today.
ID: 57871 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ServicEnginIC
Avatar

Send message
Joined: 24 Sep 10
Posts: 592
Credit: 11,972,186,510
RAC: 1,447
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57872 - Posted: 25 Nov 2021, 6:40:41 UTC

Zero valid tasks returned overnight, it's clearly a faulty constructed batch.
At least, it seems that we'll have new work to crunch when it is corrected.
ID: 57872 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
joeybuddy96

Send message
Joined: 1 Apr 20
Posts: 9
Credit: 146,536,770
RAC: 0
Level
Cys
Scientific publications
wat
Message 57873 - Posted: 25 Nov 2021, 21:29:53 UTC

I got 16 errored tasks: 8 of cuda101, 8 of cuda1121. No tasks successfully completed. " Particle coordinate is nan" and " The requested CUDA device could not be loaded".
ID: 57873 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 891
Level
Tyr
Scientific publications
watwatwatwatwat
Message 57875 - Posted: 26 Nov 2021, 18:46:48 UTC

Looks like they are sending out corrected tasks now from that last batch.

Have several running now correctly.
ID: 57875 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ServicEnginIC
Avatar

Send message
Joined: 24 Sep 10
Posts: 592
Credit: 11,972,186,510
RAC: 1,447
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57876 - Posted: 26 Nov 2021, 22:25:15 UTC - in response to Message 57875.  

Right,
Problems seem to have been solved in this new batch of ADRIA tasks.
I'm also estimating that this batch is considerably slighter than precedent ones, and my GTX 1660 Ti will be hitting full bonus with its current task.
ID: 57876 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Billy Ewell 1931

Send message
Joined: 22 Oct 10
Posts: 42
Credit: 1,752,050,315
RAC: 57
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57877 - Posted: 27 Nov 2021, 5:01:18 UTC - in response to Message 57876.  

Only partial success!
My Xeon powered machine with a GTX 1060 was reinitiated about 2 hours ago and is performing without a fault.

My I7 machine with a RTX 2070 and my I3 machine with a GTX 1060 were likewise restarted with GPU Grid and several tasks have all repeatedly failed within 10-13 seconds of starting.

I changed all drivers without any impact.

I suppose we let our boxes run until all tasks that choose to fail have succeeded and the few that are successful are recorded as winners.
ID: 57877 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 57878 - Posted: 27 Nov 2021, 9:04:18 UTC

here the tasks failed within a few seconds:

https://www.gpugrid.net/result.php?resultid=32703425
https://www.gpugrid.net/result.php?resultid=32698761

excerpt from stderr:

ACEMD failed:
Error compiling program: nvrtc: error: invalid value for --gpu-architecture (-arch)


so I guess these new WUs are not running on Ampere cards (here: 2 x RTX3070).

However, before the WUs were crunched perfectly with these cards, regardless of the cuda version.

Too bad :-(
ID: 57878 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 428
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57879 - Posted: 27 Nov 2021, 9:05:10 UTC - in response to Message 57877.  

My I7 machine with a RTX 2070 and my I3 machine with a GTX 1060 were likewise restarted with GPU Grid and several tasks have all repeatedly failed within 10-13 seconds of starting.

Those two machines are running the same tasks - with 'Bandit' in their name - that are running successfully on other machines.

So the problem lies in your machine setup, not in the tasks themselves. A lot of other machines have the same problem - you'be not alone.

Your tasks are failing with 'app exit status: 0xc0000135' - in all likelihood, you are missing a Microsoft runtime DLL file. Please refer to message 57353
ID: 57879 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 428
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57880 - Posted: 27 Nov 2021, 9:08:06 UTC - in response to Message 57878.  

excerpt from stderr:

ACEMD failed:
Error compiling program: nvrtc: error: invalid value for --gpu-architecture (-arch)


so I guess these new WUs are not running on Ampere cards (here: 2 x RTX3070).

It's not the tasks that fail on Ampere cards - it's the CUDA 101 app. The 1121 app should be OK.
ID: 57880 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 57881 - Posted: 27 Nov 2021, 9:31:07 UTC - in response to Message 57880.  

excerpt from stderr:

ACEMD failed:
Error compiling program: nvrtc: error: invalid value for --gpu-architecture (-arch)


so I guess these new WUs are not running on Ampere cards (here: 2 x RTX3070).

It's not the tasks that fail on Ampere cards - it's the CUDA 101 app. The 1121 app should be OK.


This was exactly the problem when the previous batch of WUs started: CUDA101 apps could not be crunched on Ampere cards, 1121 WUs went well.

Then, someone here from the forum posted instructions how to change the content of a specific file in the - I guess - GPUGRID project folder (I forgot which file that was), I followed this instruction, and from than on my RTX3070 cards were crunching both CUDA versions.

Since this is no longer the case with the current batch, I suppose something must be different with these new WUs.

From what I remember, about half of the WUs I had crunched over several weeks were 101, about the other half was 1121.

Of course, it is rather impractical to just try downloading task after task and hoping that a 1121 will show up some time. As known, after a number of unsuccessful WUs, downloads of new ones are blocked for a day.

I would have expected that the GPUGRID people have repaired this specific problem in the meantime. Which obviously is not the case. So they keep blocking an increasing number of hosts, which to me does not make any sense at all :-(
ID: 57881 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 57882 - Posted: 27 Nov 2021, 9:38:01 UTC - in response to Message 57881.  


Then, someone here from the forum posted instructions how to change the content of a specific file in the - I guess - GPUGRID project folder (I forgot which file that was), I followed this instruction, and from than on my RTX3070 cards were crunching both CUDA versions.

I now remember: it was the Conda-pack.zip... file of which the content had to be changed.
ID: 57882 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 428
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57883 - Posted: 27 Nov 2021, 9:53:27 UTC - in response to Message 57881.  

I would have expected that the GPUGRID people have repaired this specific problem in the meantime. Which obviously is not the case. So they keep blocking an increasing number of hosts, which to me does not make any sense at all :-(

I agree completely. But since the project doesn't seem to be (effectively) learning the lessons from previous mistakes, the best we can do is to perform the analysis for them, draw attention to the precise causes, and do what we can to ensure that at least some scientific research is completed successfully. Just burning up tasks with failures, until the maximum error limit for the WU is reached, doesn't help anyone.

The file you need to change is acemd3.exe - it can be found in your current conda-pack.zip.xxxxxx file, in the GPUGrid project folder. Check whether a newer version of that file has been downloaded since you last modified it. Mine is currently dated 05 October - later than our last major discussion on the subject.

That zip pack should also contain vcruntime140_1.dll, but I don't know if simply placing it in the zip would help - it might need to be specifically added to the upacking instruction list as well.
ID: 57883 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 57884 - Posted: 27 Nov 2021, 10:54:44 UTC - in response to Message 57883.  

the Conda-pack file which is currently in the GPUGRID folder is named "Conda-pack.zip.1d5...358" and is dated 4.10.21.
However, the file for which I was changing the content was called "conda-pack.zip.aeb48...86a", but this file, for some reason, is no longer existing in the GPUGRID folder.

Maybe it was deleted this morning when GPUGRID was updated this morning when I was retrieving new tasks.
I am sure that both files were in the GPUGRID folder before.

In fact, I remember that the change I had made in October was to override the content of the "conda-pack.zip.aeb48...86a" with the content of the file "Conda-pack.zip.1d5...358".

What is new since this morning, among other files, is "windows_x86_64_cuda101.zip.c0d...b21", dated 27.11.(=date of download this morning).
Whether a similar file was in the GPUGRID folder before or not, and may have been deleted this morning - I do not know.

So what I could do is to copy the "conda-pack.zip.aeb48...86a", of which I had saved a copy in the "documents" folder, to the GPUGRID folder. Whether it helps or not, I will only see after retrying a new task (if it happens to be again CUDA101).

Any other suggestions ?
ID: 57884 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · Next

Message boards : Number crunching : 195 (0xc3) EXIT_CHILD_FAILED

©2025 Universitat Pompeu Fabra