195 (0xc3) EXIT_CHILD

Author	Message
ServicEnginIC Send message Joined: 24 Sep 10 Posts: 593 Credit: 12,146,936,510 RAC: 4,406,248 Level Scientific publications	Message 57644 - Posted: 24 Oct 2021, 16:25:46 UTC - in response to Message 57642. Faulty data - a bad task. Not your fault. +1 Windows Operating System: EXCEPTIONAL CONDITION: src\mdio\bincoord.c, line 193: "nelems != 1" Linux Operating System: EXCEPTIONAL CONDITION: /home/user/conda/conda-bld/acemd3_1618916459379/work/src/mdio/bincoord.c, line 193: "nelems != 1" When this warning appears, it usually implies that there is some definition error at the task initial parameters. It is a badly constructed task at origin, and it will fail at every host that receive it. Watching at Work Unit #27084895, from which this task hangs, It has previously failed at several other hosts, both Windows and Linux Operating Systems. The destiny for a Work Unit like this is getting extinguished after the maximum allowed failed tasks (7) is reached... ID: 57644 · Rating: 0 · rate: / Reply Quote

Greger Send message Joined: 6 Jan 15 Posts: 76 Credit: 25,499,534,331 RAC: 0 Level Scientific publications	Message 57865 - Posted: 24 Nov 2021, 17:42:50 UTC New task from batch e1s34_0-ADRIA_GPCR2021_APJ_b0-0-1-RND2256 appears to break at start with Not A Number for coordinate: ACEMD failed: Particle coordinate is nan Errorcode: process exited with code 195 (0xc3, -61) WU: https://www.gpugrid.net/workunit.php?wuid=27087091 ID: 57865 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 2 Level Scientific publications	Message 57866 - Posted: 24 Nov 2021, 17:59:04 UTC Last modified: 24 Nov 2021, 18:38:56 UTC Same here with two different Linux machines: e1s25_0-ADRIA_GPCR2021_APJ_b0-0-1-RND8388_5 e1s174_0-ADRIA_GPCR2021_APJ_b0-0-1-RND2767_0 The first task has failed for multiple users: I was the first to attempt the second task, but it looks to have gone the same way. And the same error under Windows: e1s401_0-ADRIA_GPCR2021_APJ_b0-0-1-RND6370_3 ID: 57866 · Rating: 0 · rate: / Reply Quote

curiously_indifferent Send message Joined: 20 Nov 17 Posts: 21 Credit: 1,589,581,263 RAC: 0 Level Scientific publications	Message 57867 - Posted: 24 Nov 2021, 18:16:08 UTC Same here - RTX2060 card. Fails after 10 seconds. https://www.gpugrid.net/result.php?resultid=32662557 https://www.gpugrid.net/result.php?resultid=32663017 https://www.gpugrid.net/result.php?resultid=32663196 https://www.gpugrid.net/result.php?resultid=32663734 ID: 57867 · Rating: 0 · rate: / Reply Quote

ZUSE Send message Joined: 10 Jun 20 Posts: 7 Credit: 980,066,632 RAC: 0 Level Scientific publications	Message 57868 - Posted: 24 Nov 2021, 18:45:20 UTC have the same problem with 4x Nvidia T600 ID: 57868 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1423 Credit: 9,188,446,190 RAC: 1,336,521 Level Scientific publications	Message 57869 - Posted: 24 Nov 2021, 20:10:02 UTC - in response to Message 57868. Looks we all have the same issue with NaN. I've bombed through a couple dozen today for wasted download cap. ID: 57869 · Rating: 0 · rate: / Reply Quote

bozz4science Send message Joined: 22 May 20 Posts: 110 Credit: 115,525,136 RAC: 0 Level Scientific publications	Message 57870 - Posted: 24 Nov 2021, 23:43:48 UTC Same on my end. Had almost 80 tasks on a single machine today that all failed with said NaN error. Why were so many faulty tasks sent out in the first place? ID: 57870 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1423 Credit: 9,188,446,190 RAC: 1,336,521 Level Scientific publications	Message 57871 - Posted: 25 Nov 2021, 2:00:29 UTC Thrown away 150 bad tasks today. ID: 57871 · Rating: 0 · rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 593 Credit: 12,146,936,510 RAC: 4,406,248 Level Scientific publications	Message 57872 - Posted: 25 Nov 2021, 6:40:41 UTC Zero valid tasks returned overnight, it's clearly a faulty constructed batch. At least, it seems that we'll have new work to crunch when it is corrected. ID: 57872 · Rating: 0 · rate: / Reply Quote

joeybuddy96 Send message Joined: 1 Apr 20 Posts: 9 Credit: 146,536,770 RAC: 0 Level Scientific publications	Message 57873 - Posted: 25 Nov 2021, 21:29:53 UTC I got 16 errored tasks: 8 of cuda101, 8 of cuda1121. No tasks successfully completed. " Particle coordinate is nan" and " The requested CUDA device could not be loaded". ID: 57873 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1423 Credit: 9,188,446,190 RAC: 1,336,521 Level Scientific publications	Message 57875 - Posted: 26 Nov 2021, 18:46:48 UTC Looks like they are sending out corrected tasks now from that last batch. Have several running now correctly. ID: 57875 · Rating: 0 · rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 593 Credit: 12,146,936,510 RAC: 4,406,248 Level Scientific publications	Message 57876 - Posted: 26 Nov 2021, 22:25:15 UTC - in response to Message 57875. Right, Problems seem to have been solved in this new batch of ADRIA tasks. I'm also estimating that this batch is considerably slighter than precedent ones, and my GTX 1660 Ti will be hitting full bonus with its current task. ID: 57876 · Rating: 0 · rate: / Reply Quote

Billy Ewell 1931 Send message Joined: 22 Oct 10 Posts: 42 Credit: 1,758,800,315 RAC: 40,420 Level Scientific publications	Message 57877 - Posted: 27 Nov 2021, 5:01:18 UTC - in response to Message 57876. Only partial success! My Xeon powered machine with a GTX 1060 was reinitiated about 2 hours ago and is performing without a fault. My I7 machine with a RTX 2070 and my I3 machine with a GTX 1060 were likewise restarted with GPU Grid and several tasks have all repeatedly failed within 10-13 seconds of starting. I changed all drivers without any impact. I suppose we let our boxes run until all tasks that choose to fail have succeeded and the few that are successful are recorded as winners. ID: 57877 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1168 Credit: 12,311,898,501 RAC: 271,810 Level Scientific publications	Message 57878 - Posted: 27 Nov 2021, 9:04:18 UTC here the tasks failed within a few seconds: https://www.gpugrid.net/result.php?resultid=32703425 https://www.gpugrid.net/result.php?resultid=32698761 excerpt from stderr: ACEMD failed: Error compiling program: nvrtc: error: invalid value for --gpu-architecture (-arch) so I guess these new WUs are not running on Ampere cards (here: 2 x RTX3070). However, before the WUs were crunched perfectly with these cards, regardless of the cuda version. Too bad :-( ID: 57878 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 2 Level Scientific publications	Message 57879 - Posted: 27 Nov 2021, 9:05:10 UTC - in response to Message 57877. My I7 machine with a RTX 2070 and my I3 machine with a GTX 1060 were likewise restarted with GPU Grid and several tasks have all repeatedly failed within 10-13 seconds of starting. Those two machines are running the same tasks - with 'Bandit' in their name - that are running successfully on other machines. So the problem lies in your machine setup, not in the tasks themselves. A lot of other machines have the same problem - you'be not alone. Your tasks are failing with 'app exit status: 0xc0000135' - in all likelihood, you are missing a Microsoft runtime DLL file. Please refer to message 57353 ID: 57879 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 2 Level Scientific publications	Message 57880 - Posted: 27 Nov 2021, 9:08:06 UTC - in response to Message 57878. excerpt from stderr: ACEMD failed: Error compiling program: nvrtc: error: invalid value for --gpu-architecture (-arch) so I guess these new WUs are not running on Ampere cards (here: 2 x RTX3070). It's not the tasks that fail on Ampere cards - it's the CUDA 101 app. The 1121 app should be OK. ID: 57880 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1168 Credit: 12,311,898,501 RAC: 271,810 Level Scientific publications	Message 57881 - Posted: 27 Nov 2021, 9:31:07 UTC - in response to Message 57880. excerpt from stderr: ACEMD failed: Error compiling program: nvrtc: error: invalid value for --gpu-architecture (-arch) so I guess these new WUs are not running on Ampere cards (here: 2 x RTX3070). It's not the tasks that fail on Ampere cards - it's the CUDA 101 app. The 1121 app should be OK. This was exactly the problem when the previous batch of WUs started: CUDA101 apps could not be crunched on Ampere cards, 1121 WUs went well. Then, someone here from the forum posted instructions how to change the content of a specific file in the - I guess - GPUGRID project folder (I forgot which file that was), I followed this instruction, and from than on my RTX3070 cards were crunching both CUDA versions. Since this is no longer the case with the current batch, I suppose something must be different with these new WUs. From what I remember, about half of the WUs I had crunched over several weeks were 101, about the other half was 1121. Of course, it is rather impractical to just try downloading task after task and hoping that a 1121 will show up some time. As known, after a number of unsuccessful WUs, downloads of new ones are blocked for a day. I would have expected that the GPUGRID people have repaired this specific problem in the meantime. Which obviously is not the case. So they keep blocking an increasing number of hosts, which to me does not make any sense at all :-( ID: 57881 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1168 Credit: 12,311,898,501 RAC: 271,810 Level Scientific publications	Message 57882 - Posted: 27 Nov 2021, 9:38:01 UTC - in response to Message 57881. Then, someone here from the forum posted instructions how to change the content of a specific file in the - I guess - GPUGRID project folder (I forgot which file that was), I followed this instruction, and from than on my RTX3070 cards were crunching both CUDA versions. I now remember: it was the Conda-pack.zip... file of which the content had to be changed. ID: 57882 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 2 Level Scientific publications	Message 57883 - Posted: 27 Nov 2021, 9:53:27 UTC - in response to Message 57881. I would have expected that the GPUGRID people have repaired this specific problem in the meantime. Which obviously is not the case. So they keep blocking an increasing number of hosts, which to me does not make any sense at all :-( I agree completely. But since the project doesn't seem to be (effectively) learning the lessons from previous mistakes, the best we can do is to perform the analysis for them, draw attention to the precise causes, and do what we can to ensure that at least some scientific research is completed successfully. Just burning up tasks with failures, until the maximum error limit for the WU is reached, doesn't help anyone. The file you need to change is acemd3.exe - it can be found in your current conda-pack.zip.xxxxxx file, in the GPUGrid project folder. Check whether a newer version of that file has been downloaded since you last modified it. Mine is currently dated 05 October - later than our last major discussion on the subject. That zip pack should also contain vcruntime140_1.dll, but I don't know if simply placing it in the zip would help - it might need to be specifically added to the upacking instruction list as well. ID: 57883 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1168 Credit: 12,311,898,501 RAC: 271,810 Level Scientific publications	Message 57884 - Posted: 27 Nov 2021, 10:54:44 UTC - in response to Message 57883. the Conda-pack file which is currently in the GPUGRID folder is named "Conda-pack.zip.1d5...358" and is dated 4.10.21. However, the file for which I was changing the content was called "conda-pack.zip.aeb48...86a", but this file, for some reason, is no longer existing in the GPUGRID folder. Maybe it was deleted this morning when GPUGRID was updated this morning when I was retrieving new tasks. I am sure that both files were in the GPUGRID folder before. In fact, I remember that the change I had made in October was to override the content of the "conda-pack.zip.aeb48...86a" with the content of the file "Conda-pack.zip.1d5...358". What is new since this morning, among other files, is "windows_x86_64_cuda101.zip.c0d...b21", dated 27.11.(=date of download this morning). Whether a similar file was in the GPUGRID folder before or not, and may have been deleted this morning - I do not know. So what I could do is to copy the "conda-pack.zip.aeb48...86a", of which I had saved a copy in the "documents" folder, to the GPUGRID folder. Whether it helps or not, I will only see after retrying a new task (if it happens to be again CUDA101). Any other suggestions ? ID: 57884 · Rating: 0 · rate: / Reply Quote

195 (0xc3) EXIT_CHILD_FAILED