Message boards :
Number crunching :
195 (0xc3) EXIT_CHILD_FAILED
Message board moderation
Previous · 1 · 2 · 3 · 4 · Next
| Author | Message |
|---|---|
ServicEnginICSend message Joined: 24 Sep 10 Posts: 592 Credit: 11,972,186,510 RAC: 1,447 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Faulty data - a bad task. Not your fault. +1 Windows Operating System: EXCEPTIONAL CONDITION: src\mdio\bincoord.c, line 193: "nelems != 1" Linux Operating System: EXCEPTIONAL CONDITION: /home/user/conda/conda-bld/acemd3_1618916459379/work/src/mdio/bincoord.c, line 193: "nelems != 1" When this warning appears, it usually implies that there is some definition error at the task initial parameters. It is a badly constructed task at origin, and it will fail at every host that receive it. Watching at Work Unit #27084895, from which this task hangs, It has previously failed at several other hosts, both Windows and Linux Operating Systems. The destiny for a Work Unit like this is getting extinguished after the maximum allowed failed tasks (7) is reached... |
|
Send message Joined: 6 Jan 15 Posts: 76 Credit: 25,499,534,331 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
New task from batch e1s34_0-ADRIA_GPCR2021_APJ_b0-0-1-RND2256 appears to break at start with Not A Number for coordinate: ACEMD failed:
Particle coordinate is nanErrorcode: process exited with code 195 (0xc3, -61) WU: https://www.gpugrid.net/workunit.php?wuid=27087091 |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 428 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Same here with two different Linux machines: e1s25_0-ADRIA_GPCR2021_APJ_b0-0-1-RND8388_5 e1s174_0-ADRIA_GPCR2021_APJ_b0-0-1-RND2767_0 The first task has failed for multiple users: I was the first to attempt the second task, but it looks to have gone the same way. And the same error under Windows: e1s401_0-ADRIA_GPCR2021_APJ_b0-0-1-RND6370_3 |
|
Send message Joined: 20 Nov 17 Posts: 21 Credit: 1,589,581,263 RAC: 0 Level ![]() Scientific publications ![]() ![]()
|
Same here - RTX2060 card. Fails after 10 seconds. https://www.gpugrid.net/result.php?resultid=32662557 https://www.gpugrid.net/result.php?resultid=32663017 https://www.gpugrid.net/result.php?resultid=32663196 https://www.gpugrid.net/result.php?resultid=32663734 |
|
Send message Joined: 10 Jun 20 Posts: 7 Credit: 980,066,632 RAC: 0 Level ![]() Scientific publications
|
have the same problem with 4x Nvidia T600 |
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Looks we all have the same issue with NaN. I've bombed through a couple dozen today for wasted download cap. |
|
Send message Joined: 22 May 20 Posts: 110 Credit: 115,525,136 RAC: 0 Level ![]() Scientific publications
|
Same on my end. Had almost 80 tasks on a single machine today that all failed with said NaN error. Why were so many faulty tasks sent out in the first place? |
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Thrown away 150 bad tasks today. |
ServicEnginICSend message Joined: 24 Sep 10 Posts: 592 Credit: 11,972,186,510 RAC: 1,447 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Zero valid tasks returned overnight, it's clearly a faulty constructed batch. At least, it seems that we'll have new work to crunch when it is corrected. |
|
Send message Joined: 1 Apr 20 Posts: 9 Credit: 146,536,770 RAC: 0 Level ![]() Scientific publications
|
I got 16 errored tasks: 8 of cuda101, 8 of cuda1121. No tasks successfully completed. " Particle coordinate is nan" and " The requested CUDA device could not be loaded".
|
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Looks like they are sending out corrected tasks now from that last batch. Have several running now correctly. |
ServicEnginICSend message Joined: 24 Sep 10 Posts: 592 Credit: 11,972,186,510 RAC: 1,447 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Right, Problems seem to have been solved in this new batch of ADRIA tasks. I'm also estimating that this batch is considerably slighter than precedent ones, and my GTX 1660 Ti will be hitting full bonus with its current task. |
|
Send message Joined: 22 Oct 10 Posts: 42 Credit: 1,752,050,315 RAC: 57 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Only partial success! My Xeon powered machine with a GTX 1060 was reinitiated about 2 hours ago and is performing without a fault. My I7 machine with a RTX 2070 and my I3 machine with a GTX 1060 were likewise restarted with GPU Grid and several tasks have all repeatedly failed within 10-13 seconds of starting. I changed all drivers without any impact. I suppose we let our boxes run until all tasks that choose to fail have succeeded and the few that are successful are recorded as winners. |
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
here the tasks failed within a few seconds: https://www.gpugrid.net/result.php?resultid=32703425 https://www.gpugrid.net/result.php?resultid=32698761 excerpt from stderr: ACEMD failed: Error compiling program: nvrtc: error: invalid value for --gpu-architecture (-arch) so I guess these new WUs are not running on Ampere cards (here: 2 x RTX3070). However, before the WUs were crunched perfectly with these cards, regardless of the cuda version. Too bad :-( |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 428 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
My I7 machine with a RTX 2070 and my I3 machine with a GTX 1060 were likewise restarted with GPU Grid and several tasks have all repeatedly failed within 10-13 seconds of starting. Those two machines are running the same tasks - with 'Bandit' in their name - that are running successfully on other machines. So the problem lies in your machine setup, not in the tasks themselves. A lot of other machines have the same problem - you'be not alone. Your tasks are failing with 'app exit status: 0xc0000135' - in all likelihood, you are missing a Microsoft runtime DLL file. Please refer to message 57353 |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 428 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
excerpt from stderr: It's not the tasks that fail on Ampere cards - it's the CUDA 101 app. The 1121 app should be OK. |
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
excerpt from stderr: This was exactly the problem when the previous batch of WUs started: CUDA101 apps could not be crunched on Ampere cards, 1121 WUs went well. Then, someone here from the forum posted instructions how to change the content of a specific file in the - I guess - GPUGRID project folder (I forgot which file that was), I followed this instruction, and from than on my RTX3070 cards were crunching both CUDA versions. Since this is no longer the case with the current batch, I suppose something must be different with these new WUs. From what I remember, about half of the WUs I had crunched over several weeks were 101, about the other half was 1121. Of course, it is rather impractical to just try downloading task after task and hoping that a 1121 will show up some time. As known, after a number of unsuccessful WUs, downloads of new ones are blocked for a day. I would have expected that the GPUGRID people have repaired this specific problem in the meantime. Which obviously is not the case. So they keep blocking an increasing number of hosts, which to me does not make any sense at all :-( |
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I now remember: it was the Conda-pack.zip... file of which the content had to be changed. |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 428 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I would have expected that the GPUGRID people have repaired this specific problem in the meantime. Which obviously is not the case. So they keep blocking an increasing number of hosts, which to me does not make any sense at all :-( I agree completely. But since the project doesn't seem to be (effectively) learning the lessons from previous mistakes, the best we can do is to perform the analysis for them, draw attention to the precise causes, and do what we can to ensure that at least some scientific research is completed successfully. Just burning up tasks with failures, until the maximum error limit for the WU is reached, doesn't help anyone. The file you need to change is acemd3.exe - it can be found in your current conda-pack.zip.xxxxxx file, in the GPUGrid project folder. Check whether a newer version of that file has been downloaded since you last modified it. Mine is currently dated 05 October - later than our last major discussion on the subject. That zip pack should also contain vcruntime140_1.dll, but I don't know if simply placing it in the zip would help - it might need to be specifically added to the upacking instruction list as well. |
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
the Conda-pack file which is currently in the GPUGRID folder is named "Conda-pack.zip.1d5...358" and is dated 4.10.21. However, the file for which I was changing the content was called "conda-pack.zip.aeb48...86a", but this file, for some reason, is no longer existing in the GPUGRID folder. Maybe it was deleted this morning when GPUGRID was updated this morning when I was retrieving new tasks. I am sure that both files were in the GPUGRID folder before. In fact, I remember that the change I had made in October was to override the content of the "conda-pack.zip.aeb48...86a" with the content of the file "Conda-pack.zip.1d5...358". What is new since this morning, among other files, is "windows_x86_64_cuda101.zip.c0d...b21", dated 27.11.(=date of download this morning). Whether a similar file was in the GPUGRID folder before or not, and may have been deleted this morning - I do not know. So what I could do is to copy the "conda-pack.zip.aeb48...86a", of which I had saved a copy in the "documents" folder, to the GPUGRID folder. Whether it helps or not, I will only see after retrying a new task (if it happens to be again CUDA101). Any other suggestions ? |
©2025 Universitat Pompeu Fabra