195 (0xc3) EXIT_CHILD_FAILED

Message boards : Number crunching : 195 (0xc3) EXIT_CHILD_FAILED
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 · Next

AuthorMessage
Profile Michael H.W. Weber

Send message
Joined: 9 Feb 16
Posts: 78
Credit: 656,229,684
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwat
Message 57480 - Posted: 5 Oct 2021, 8:42:33 UTC
Last modified: 5 Oct 2021, 8:55:54 UTC

My RTX 3080 machine completed a first task successfully. Afterwards, two more tasks crashed with an 195 (0xc3) EXIT_CHILD_FAILED error message and the following log (after only a few seconds of run time):

Name	e2s184_e1s254p0f959-ADRIA_AdB_KIXCMYB_HIP-0-2-RND9959_9
Arbeitspaket	27080023
Erstellt	4 Oct 2021 | 9:59:05 UTC
Gesendet	4 Oct 2021 | 10:48:16 UTC
Empfangen	4 Oct 2021 | 22:07:40 UTC
Serverstatus	Abgeschlossen
Resultat	Berechnungsfehler
Clientstatus	Berechnungsfehler
Endstatus	195 (0xc3) EXIT_CHILD_FAILED
Computer ID	584499
Ablaufdatum	9 Oct 2021 | 10:48:16 UTC
Laufzeit	25.51
CPU Zeit	0.00
Prüfungsstatus	Ungültig
Punkte	0.00
Anwendungsversion	New version of ACEMD v2.18 (cuda101)

Stderr Ausgabe

<core_client_version>7.16.11</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code 195 (0xc3)</message>
<stderr_txt>
00:05:49 (30732): wrapper (7.9.26016): starting
00:05:49 (30732): wrapper: running bin/acemd3.exe (--boinc --device 0)
ACEMD failed:
    Error compiling program: nvrtc: error: invalid value for --gpu-architecture (-arch)

00:05:59 (30732): bin/acemd3.exe exited; CPU time 0.000000
00:05:59 (30732): app exit status: 0x1
00:05:59 (30732): called boinc_finish(195)
0 bytes in 0 Free Blocks.
186 bytes in 4 Normal Blocks.
1144 bytes in 1 CRT Blocks.
0 bytes in 0 Ignore Blocks.
0 bytes in 0 Client Blocks.
Largest number used: 0 bytes.
Total allocations: 239849 bytes.
Dumping objects ->
{323256} normal block at 0x000001B7D23D3BC0, 85 bytes long.
 Data: <<project_prefere> 3C 70 72 6F 6A 65 63 74 5F 70 72 65 66 65 72 65 
..\api\boinc_api.cpp(309) : {323253} normal block at 0x000001B7D23D4940, 8 bytes long.
 Data: <  1&#210;&#183;   > 00 00 31 D2 B7 01 00 00 
{322608} normal block at 0x000001B7D23D3C60, 85 bytes long.
 Data: <<project_prefere> 3C 70 72 6F 6A 65 63 74 5F 70 72 65 66 65 72 65 
{321994} normal block at 0x000001B7D23D46C0, 8 bytes long.
 Data: <@&#202;?&#210;&#183;   > 40 CA 3F D2 B7 01 00 00 
..\zip\boinc_zip.cpp(122) : {146} normal block at 0x000001B7D23D3090, 260 bytes long.
 Data: <                > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
{133} normal block at 0x000001B7D23D48A0, 16 bytes long.
 Data: < &#248;<&#210;&#183;           > 10 F8 3C D2 B7 01 00 00 00 00 00 00 00 00 00 00 
{132} normal block at 0x000001B7D23CF810, 40 bytes long.
 Data: <&#160;H=&#210;&#183;   conda-pa> A0 48 3D D2 B7 01 00 00 63 6F 6E 64 61 2D 70 61 
{125} normal block at 0x000001B7D23CF340, 48 bytes long.
 Data: <--boinc --device> 2D 2D 62 6F 69 6E 63 20 2D 2D 64 65 76 69 63 65 
{124} normal block at 0x000001B7D23D49E0, 16 bytes long.
 Data: <XN=&#210;&#183;           > 58 4E 3D D2 B7 01 00 00 00 00 00 00 00 00 00 00 
{123} normal block at 0x000001B7D23D4C60, 16 bytes long.
 Data: <0N=&#210;&#183;           > 30 4E 3D D2 B7 01 00 00 00 00 00 00 00 00 00 00 
{122} normal block at 0x000001B7D23D4850, 16 bytes long.
 Data: < N=&#210;&#183;           > 08 4E 3D D2 B7 01 00 00 00 00 00 00 00 00 00 00 
{121} normal block at 0x000001B7D23D3DB0, 16 bytes long.
 Data: <&#224;M=&#210;&#183;           > E0 4D 3D D2 B7 01 00 00 00 00 00 00 00 00 00 00 
{120} normal block at 0x000001B7D23D4030, 16 bytes long.
 Data: <&#184;M=&#210;&#183;           > B8 4D 3D D2 B7 01 00 00 00 00 00 00 00 00 00 00 
{119} normal block at 0x000001B7D23D4080, 16 bytes long.
 Data: < M=&#210;&#183;           > 90 4D 3D D2 B7 01 00 00 00 00 00 00 00 00 00 00 
{118} normal block at 0x000001B7D23D4120, 16 bytes long.
 Data: <pM=&#210;&#183;           > 70 4D 3D D2 B7 01 00 00 00 00 00 00 00 00 00 00 
{117} normal block at 0x000001B7D23D4990, 16 bytes long.
 Data: <HM=&#210;&#183;           > 48 4D 3D D2 B7 01 00 00 00 00 00 00 00 00 00 00 
{116} normal block at 0x000001B7D23D42B0, 16 bytes long.
 Data: < M=&#210;&#183;           > 20 4D 3D D2 B7 01 00 00 00 00 00 00 00 00 00 00 
{115} normal block at 0x000001B7D23D4D20, 496 bytes long.
 Data: <&#176;B=&#210;&#183;   bin/acem> B0 42 3D D2 B7 01 00 00 62 69 6E 2F 61 63 65 6D 
{65} normal block at 0x000001B7D23C2D80, 16 bytes long.
 Data: < &#234;&#151;{&#247;           > 80 EA 97 7B F7 7F 00 00 00 00 00 00 00 00 00 00 
{64} normal block at 0x000001B7D23C2B50, 16 bytes long.
 Data: <@&#233;&#151;{&#247;           > 40 E9 97 7B F7 7F 00 00 00 00 00 00 00 00 00 00 
{63} normal block at 0x000001B7D23C2B00, 16 bytes long.
 Data: <&#248;W&#148;{&#247;           > F8 57 94 7B F7 7F 00 00 00 00 00 00 00 00 00 00 
{62} normal block at 0x000001B7D23C2AB0, 16 bytes long.
 Data: <&#216;W&#148;{&#247;           > D8 57 94 7B F7 7F 00 00 00 00 00 00 00 00 00 00 
{61} normal block at 0x000001B7D23C3370, 16 bytes long.
 Data: <P &#148;{&#247;           > 50 04 94 7B F7 7F 00 00 00 00 00 00 00 00 00 00 
{60} normal block at 0x000001B7D23C2A60, 16 bytes long.
 Data: <0 &#148;{&#247;           > 30 04 94 7B F7 7F 00 00 00 00 00 00 00 00 00 00 
{59} normal block at 0x000001B7D23C3500, 16 bytes long.
 Data: <&#224; &#148;{&#247;           > E0 02 94 7B F7 7F 00 00 00 00 00 00 00 00 00 00 
{58} normal block at 0x000001B7D23C3640, 16 bytes long.
 Data: <  &#148;{&#247;           > 10 04 94 7B F7 7F 00 00 00 00 00 00 00 00 00 00 
{57} normal block at 0x000001B7D23C2A10, 16 bytes long.
 Data: <p &#148;{&#247;           > 70 04 94 7B F7 7F 00 00 00 00 00 00 00 00 00 00 
{56} normal block at 0x000001B7D23C3870, 16 bytes long.
 Data: < &#192;&#146;{&#247;           > 18 C0 92 7B F7 7F 00 00 00 00 00 00 00 00 00 00 
Object dump complete.

</stderr_txt>
]]>

Name	e4s109_e1s39p0f745-ADRIA_AdB_KIXCMYB_HIP-1-2-RND2493_0
Arbeitspaket	27081645
Erstellt	4 Oct 2021 | 22:12:32 UTC
Gesendet	4 Oct 2021 | 22:14:12 UTC
Empfangen	4 Oct 2021 | 22:16:12 UTC
Serverstatus	Abgeschlossen
Resultat	Berechnungsfehler
Clientstatus	Berechnungsfehler
Endstatus	195 (0xc3) EXIT_CHILD_FAILED
Computer ID	584499
Ablaufdatum	9 Oct 2021 | 22:14:12 UTC
Laufzeit	7.26
CPU Zeit	0.00
Prüfungsstatus	Ungültig
Punkte	0.00
Anwendungsversion	New version of ACEMD v2.18 (cuda101)
Stderr Ausgabe

<core_client_version>7.16.11</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code 195 (0xc3)</message>
<stderr_txt>
00:14:24 (14320): wrapper (7.9.26016): starting
00:14:24 (14320): wrapper: running bin/acemd3.exe (--boinc --device 0)
ACEMD failed:
    Error compiling program: nvrtc: error: invalid value for --gpu-architecture (-arch)

00:14:26 (14320): bin/acemd3.exe exited; CPU time 0.000000
00:14:26 (14320): app exit status: 0x1
00:14:26 (14320): called boinc_finish(195)
0 bytes in 0 Free Blocks.
186 bytes in 4 Normal Blocks.
1144 bytes in 1 CRT Blocks.
0 bytes in 0 Ignore Blocks.
0 bytes in 0 Client Blocks.
Largest number used: 0 bytes.
Total allocations: 241603 bytes.
Dumping objects ->
{323256} normal block at 0x000002061D1C3BC0, 85 bytes long.
 Data: <<project_prefere> 3C 70 72 6F 6A 65 63 74 5F 70 72 65 66 65 72 65 
..\api\boinc_api.cpp(309) : {323253} normal block at 0x000002061D1C43F0, 8 bytes long.
 Data: <        > 00 00 02 1D 06 02 00 00 
{322608} normal block at 0x000002061D1C3C60, 85 bytes long.
 Data: <<project_prefere> 3C 70 72 6F 6A 65 63 74 5F 70 72 65 66 65 72 65 
{321994} normal block at 0x000002061D1C42B0, 8 bytes long.
 Data: <@&#202;      > 40 CA 1E 1D 06 02 00 00 
..\zip\boinc_zip.cpp(122) : {146} normal block at 0x000002061D1C3090, 260 bytes long.
 Data: <                > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
{133} normal block at 0x000002061D1C3EF0, 16 bytes long.
 Data: <&#208;&#242;              > D0 F2 1B 1D 06 02 00 00 00 00 00 00 00 00 00 00 
{132} normal block at 0x000002061D1BF2D0, 40 bytes long.
 Data: <&#240;>      conda-pa> F0 3E 1C 1D 06 02 00 00 63 6F 6E 64 61 2D 70 61 
{125} normal block at 0x000002061D1BF180, 48 bytes long.
 Data: <--boinc --device> 2D 2D 62 6F 69 6E 63 20 2D 2D 64 65 76 69 63 65 
{124} normal block at 0x000002061D1C4940, 16 bytes long.
 Data: <XN              > 58 4E 1C 1D 06 02 00 00 00 00 00 00 00 00 00 00 
{123} normal block at 0x000002061D1C4490, 16 bytes long.
 Data: <0N              > 30 4E 1C 1D 06 02 00 00 00 00 00 00 00 00 00 00 
{122} normal block at 0x000002061D1C4800, 16 bytes long.
 Data: < N              > 08 4E 1C 1D 06 02 00 00 00 00 00 00 00 00 00 00 
{121} normal block at 0x000002061D1C47B0, 16 bytes long.
 Data: <&#224;M              > E0 4D 1C 1D 06 02 00 00 00 00 00 00 00 00 00 00 
{120} normal block at 0x000002061D1C48A0, 16 bytes long.
 Data: <&#184;M              > B8 4D 1C 1D 06 02 00 00 00 00 00 00 00 00 00 00 
{119} normal block at 0x000002061D1C4710, 16 bytes long.
 Data: < M              > 90 4D 1C 1D 06 02 00 00 00 00 00 00 00 00 00 00 
{118} normal block at 0x000002061D1C48F0, 16 bytes long.
 Data: <pM              > 70 4D 1C 1D 06 02 00 00 00 00 00 00 00 00 00 00 
{117} normal block at 0x000002061D1C4990, 16 bytes long.
 Data: <HM              > 48 4D 1C 1D 06 02 00 00 00 00 00 00 00 00 00 00 
{116} normal block at 0x000002061D1C4A80, 16 bytes long.
 Data: < M              > 20 4D 1C 1D 06 02 00 00 00 00 00 00 00 00 00 00 
{115} normal block at 0x000002061D1C4D20, 496 bytes long.
 Data: < J      bin/acem> 80 4A 1C 1D 06 02 00 00 62 69 6E 2F 61 63 65 6D 
{65} normal block at 0x000002061D1B36E0, 16 bytes long.
 Data: < &#234;&#151;{&#247;           > 80 EA 97 7B F7 7F 00 00 00 00 00 00 00 00 00 00 
{64} normal block at 0x000002061D1B3410, 16 bytes long.
 Data: <@&#233;&#151;{&#247;           > 40 E9 97 7B F7 7F 00 00 00 00 00 00 00 00 00 00 
{63} normal block at 0x000002061D1B3820, 16 bytes long.
 Data: <&#248;W&#148;{&#247;           > F8 57 94 7B F7 7F 00 00 00 00 00 00 00 00 00 00 
{62} normal block at 0x000002061D1B33C0, 16 bytes long.
 Data: <&#216;W&#148;{&#247;           > D8 57 94 7B F7 7F 00 00 00 00 00 00 00 00 00 00 
{61} normal block at 0x000002061D1B3190, 16 bytes long.
 Data: <P &#148;{&#247;           > 50 04 94 7B F7 7F 00 00 00 00 00 00 00 00 00 00 
{60} normal block at 0x000002061D1B3000, 16 bytes long.
 Data: <0 &#148;{&#247;           > 30 04 94 7B F7 7F 00 00 00 00 00 00 00 00 00 00 
{59} normal block at 0x000002061D1B2FB0, 16 bytes long.
 Data: <&#224; &#148;{&#247;           > E0 02 94 7B F7 7F 00 00 00 00 00 00 00 00 00 00 
{58} normal block at 0x000002061D1B3320, 16 bytes long.
 Data: <  &#148;{&#247;           > 10 04 94 7B F7 7F 00 00 00 00 00 00 00 00 00 00 
{57} normal block at 0x000002061D1B2F60, 16 bytes long.
 Data: <p &#148;{&#247;           > 70 04 94 7B F7 7F 00 00 00 00 00 00 00 00 00 00 
{56} normal block at 0x000002061D1B3140, 16 bytes long.
 Data: < &#192;&#146;{&#247;           > 18 C0 92 7B F7 7F 00 00 00 00 00 00 00 00 00 00 
Object dump complete.

</stderr_txt>
]]>

Any idea what is going on?

Very annoying is the fact that after these two consecutive crashes, it took the GPUGRID server 4 hours to send out a new task (which is now in progress) - making my machine uselessly idling for hours.

Michael.
President of Rechenkraft.net - Germany's first and largest distributed computing organization.
ID: 57480 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 428
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57481 - Posted: 5 Oct 2021, 8:59:03 UTC - in response to Message 57480.  

Your computers are hidden, so I can't be certain, but your problem seems to be

Error compiling program: nvrtc: error: invalid value for --gpu-architecture (-arch)

There are two versions of the new GPUGrid application: cuda1121 and cuda101. You will be able to see in your task list which worked, and which didn't work.

Despite some posts to the contrary, the general consensus is that cuda1121 works on an RTX 3080, and cuda101 doesn't. And despite an assurance from the project that they have prevented the cuda101 application being sent to RTX cards, clearly they haven't.

There's nothing you can do to prevent the wrong application being sent to your card: just take comfort from the fact that cuda101 tasks will fail very quickly, and you won't waste computing power on the tasks. The only hit you and the project are taking is the waste of bandwidth downloading the inappropriate tasks.
ID: 57481 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Michael H.W. Weber

Send message
Joined: 9 Feb 16
Posts: 78
Credit: 656,229,684
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwat
Message 57482 - Posted: 5 Oct 2021, 9:13:09 UTC

Thank you Richard - only cuda1121 works for me.

Michael.
President of Rechenkraft.net - Germany's first and largest distributed computing organization.
ID: 57482 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 57484 - Posted: 5 Oct 2021, 11:43:49 UTC - in response to Message 57481.  

And despite an assurance from the project that they have prevented the cuda101 application being sent to RTX cards, clearly they haven't.

I endorse this statement. I have been sent cuda101 tasks to my two RTX3070, the latest one this morning.

There's nothing you can do to prevent the wrong application being sent to your card: just take comfort from the fact that cuda101 tasks will fail very quickly, and you won't waste computing power on the tasks. The only hit you and the project are taking is the waste of bandwidth downloading the inappropriate tasks.

however, there is more to it: What might also happen is that if one deletes erronously downloaded cuda101 tasks from the BOINC task list too often, one will not receive any more tasks for the next 24 hours.

Hence, this problem should be solved by the project team ASAP !
ID: 57484 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
bozz4science

Send message
Joined: 22 May 20
Posts: 110
Credit: 115,525,136
RAC: 0
Level
Cys
Scientific publications
wat
Message 57486 - Posted: 5 Oct 2021, 13:02:45 UTC

I also don't quite understand what information determines the app version to be sent out. This task f.ex. has been sent out 6 times before my host caught it. Once it was 1121 app version, all others were sent out as the 101 app version. It did fail on all previous hosts and went through 3 Ampere cards (3060Ti, 3070 & 3090). Seems to be quite an annoyance for anyone with the latest cards. And older cards take some serious chewing on the new tasks. Mine takes a little over 31hrs. This project could be working much more efficiently if it were able to fully capture the potential of these RTX 3000 series cards.

ID: 57486 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 428
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57487 - Posted: 5 Oct 2021, 13:30:47 UTC - in response to Message 57486.  

But there are two different failure modes - three of each: three missing DLLs (probably vcruntime140_1), and three wrong architecture (cuda101 on RTX)

You need all three to align - right version, on right architecture, with right software support - before it'll run. One out of eight is about the right probability.
ID: 57487 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
bozz4science

Send message
Joined: 22 May 20
Posts: 110
Credit: 115,525,136
RAC: 0
Level
Cys
Scientific publications
wat
Message 57488 - Posted: 5 Oct 2021, 13:36:57 UTC

That's sounds about right. Only meant to highlight the Ampere cards that all failed obviously due to the wrong version having been sent to these hosts, but somehow older gen cards getting the 1121 app version instead on some occasions.

ID: 57488 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 57489 - Posted: 5 Oct 2021, 13:41:17 UTC

All the more reason to just retire the cuda101 app, and force everyone to update their drivers to use the cuda1121 app
ID: 57489 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 428
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57490 - Posted: 5 Oct 2021, 13:45:41 UTC - in response to Message 57489.  

All the more reason to just retire the cuda101 app, and force everyone to update their drivers to use the cuda1121 app

I disagree. People should be allowed their own choice of driver (you don't know why they've kept an older one), but the project should manage the minimum limits better.
ID: 57490 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 57491 - Posted: 5 Oct 2021, 13:56:55 UTC - in response to Message 57490.  
Last modified: 5 Oct 2021, 14:32:33 UTC

IMO, the "choice" of driver in the ranges of CUDA101 and CUDA1121 compatibility will be arbitrary. the list of supported products is exactly the same so it's not like some older GPU wont be supported anymore with the newer driver. Nvidia drivers are very stable and it's pretty rare that a new driver fully breaks something. CUDA101 requires driver 418.xx, CUDA1121 requires driver 461.xx, there's not a huge difference here. but even still there's a large range of "choice" between the minimum driver required for cuda1121 and what is the current latest driver release. they don't need to be bleeding edge. CUDA 11.2 was introduced almost a year ago, and it's currently up to CUDA 11.4 Update 2.

If you have some software issue that actively prevents installing a new driver, then fix your software issues.

there's really no good reason not to update if you're already on hardware and drivers new enough to support the CUDA101 app. it's not a huge change to get more recent drivers, and the observed negative impacts from the project maintaining two app versions far outweigh the impact of requiring a user to update their drivers.
ID: 57491 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 428
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57492 - Posted: 5 Oct 2021, 13:59:51 UTC - in response to Message 57491.  

it's not a huge change to get more recent drivers, and the observed negative impacts from the project maintaining two app versions far outweigh the impact of requiring a user to update their drivers.

It can be if your computer is managed by your employer's domain controller, and group policy prevents you updating it yourself. Just an example.
ID: 57492 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 57493 - Posted: 5 Oct 2021, 14:04:46 UTC - in response to Message 57492.  

it's not a huge change to get more recent drivers, and the observed negative impacts from the project maintaining two app versions far outweigh the impact of requiring a user to update their drivers.

It can be if your computer is managed by your employer's domain controller, and group policy prevents you updating it yourself. Just an example.


in this case, it's MORE likely that these systems will (*should*) be updated to recent, as any competent SA will (*should*) be keeping everything on the up and up in terms of security patches, and there has been a stronger push from Nvidia in this regard lately.
ID: 57493 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 428
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57494 - Posted: 5 Oct 2021, 14:15:13 UTC

I think we're far enough off topic. Let's leave it there.
ID: 57494 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Michael H.W. Weber

Send message
Joined: 9 Feb 16
Posts: 78
Credit: 656,229,684
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwat
Message 57544 - Posted: 8 Oct 2021, 12:13:48 UTC

Well, this project's incapability of delivering the proper GPU app/plan class to the corresponding GPU systems simply results in a massive loss of project overall performance: Due to the repetitive "compute errors" the clients do not receive further tasks for a while and idle around for hours. I figured that this way instead of two tasks per day, I can deliver only one.
Well, not my problem. A second project is occupying the idle time now.

Michael.
President of Rechenkraft.net - Germany's first and largest distributed computing organization.
ID: 57544 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile bcavnaugh

Send message
Joined: 8 Nov 13
Posts: 56
Credit: 1,002,640,163
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 57572 - Posted: 11 Oct 2021, 2:26:22 UTC
Last modified: 11 Oct 2021, 2:59:44 UTC

Glad at least one of my Host is running but all the other are NOT!
"[img]Not Working[/img]"
2080 (441.20) running 1080 (431.86) not running also 2080 (431.86) not running
What NVIDIA Driver must me have to Run GPUGRID?
As you can see even with the new or current runtimes it still fails
14.29.30135.0 Current VS 2022 the version with the tasks is 14.28.29325.2
https://live.staticflickr.com/65535/51574059037_5ae789d24d_b.jpg

Update:
For me Driver 441.20 seems to work on all my Host,Yahoo!
ID: 57572 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
jjch

Send message
Joined: 10 Nov 13
Posts: 101
Credit: 15,773,211,122
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57575 - Posted: 11 Oct 2021, 4:26:33 UTC

Nvidia driver version 441.20 is a bit old. I am currently running version 471.11 on my Windows hosts.
ID: 57575 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Michael H.W. Weber

Send message
Joined: 9 Feb 16
Posts: 78
Credit: 656,229,684
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwat
Message 57598 - Posted: 13 Oct 2021, 10:29:25 UTC - in response to Message 57481.  
Last modified: 13 Oct 2021, 10:31:09 UTC


Despite some posts to the contrary, the general consensus is that cuda1121 works on an RTX 3080, and cuda101 doesn't. And despite an assurance from the project that they have prevented the cuda101 application being sent to RTX cards, clearly they haven't.

I second that.

There's nothing you can do to prevent the wrong application being sent to your card: just take comfort from the fact that cuda101 tasks will fail very quickly, and you won't waste computing power on the tasks.

Unfortunately, that is exactly NOT the case.
Here an example of a task which ran for almost 15 hrs before failing with an error:

Task: https://www.gpugrid.net/result.php?resultid=32653715

Name	e7s106_e5s196p1f1036-ADRIA_AdB_KIXCMYB_HIP-1-2-RND0214_4
Arbeitspaket	27082868
Erstellt	11 Oct 2021 | 6:23:21 UTC
Gesendet	11 Oct 2021 | 6:24:56 UTC
Empfangen	12 Oct 2021 | 9:32:05 UTC
Serverstatus	Abgeschlossen
Resultat	Berechnungsfehler
Clientstatus	Berechnungsfehler
Endstatus	195 (0xc3) EXIT_CHILD_FAILED
Computer ID	588794
Ablaufdatum	16 Oct 2021 | 6:24:56 UTC
Laufzeit	53,608.48
CPU Zeit	53,473.36
Prüfungsstatus	Ungültig
Punkte	0.00
Anwendungsversion	New version of ACEMD v2.18 (cuda101)
Stderr Ausgabe

<core_client_version>7.16.11</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code 195 (0xc3)</message>
<stderr_txt>
08:25:11 (11620): wrapper (7.9.26016): starting
08:25:11 (11620): wrapper: running bin/acemd3.exe (--boinc --device 0)
Detected memory leaks!
Dumping objects ->
..\api\boinc_api.cpp(309) : {323250} normal block at 0x000002C574996E70, 8 bytes long.
 Data: <  $v    > 00 00 24 76 C5 02 00 00 
..\lib\diagnostics_win.cpp(417) : {321999} normal block at 0x000002C576431310, 1080 bytes long.
 Data: <                > FC 08 00 00 CD CD CD CD 0C 01 00 00 00 00 00 00 
..\zip\boinc_zip.cpp(122) : {149} normal block at 0x000002C57499EBA0, 260 bytes long.
 Data: <                > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
Object dump complete.
10:58:41 (4808): wrapper (7.9.26016): starting
10:58:41 (4808): wrapper: running bin/acemd3.exe (--boinc --device 0)
Detected memory leaks!
Dumping objects ->
..\api\boinc_api.cpp(309) : {323286} normal block at 0x000001DA265261F0, 8 bytes long.
 Data: <  J&    > 00 00 4A 26 DA 01 00 00 
..\lib\diagnostics_win.cpp(417) : {322035} normal block at 0x000001DA26591B80, 1080 bytes long.
 Data: <h               > 68 1A 00 00 CD CD CD CD 20 01 00 00 00 00 00 00 
..\zip\boinc_zip.cpp(122) : {149} normal block at 0x000001DA2652EB00, 260 bytes long.
 Data: <                > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
Object dump complete.
11:30:47 (13592): wrapper (7.9.26016): starting
11:30:47 (13592): wrapper: running bin/acemd3.exe (--boinc --device 0)
ACEMD failed:
    Error compiling program: nvrtc: error: invalid value for --gpu-architecture (-arch)

11:30:49 (13592): bin/acemd3.exe exited; CPU time 0.000000
11:30:49 (13592): app exit status: 0x1
11:30:49 (13592): called boinc_finish(195)
0 bytes in 0 Free Blocks.
298 bytes in 4 Normal Blocks.
1144 bytes in 1 CRT Blocks.
0 bytes in 0 Ignore Blocks.
0 bytes in 0 Client Blocks.
Largest number used: 0 bytes.
Total allocations: 130740 bytes.
Dumping objects ->
{323289} normal block at 0x0000014269701A70, 141 bytes long.
 Data: <<project_prefere> 3C 70 72 6F 6A 65 63 74 5F 70 72 65 66 65 72 65 
..\api\boinc_api.cpp(309) : {323286} normal block at 0x00000142696C62F0, 8 bytes long.
 Data: <  eiB   > 00 00 65 69 42 01 00 00 
{322649} normal block at 0x00000142697020F0, 141 bytes long.
 Data: <<project_prefere> 3C 70 72 6F 6A 65 63 74 5F 70 72 65 66 65 72 65 
{322036} normal block at 0x00000142696C6890, 8 bytes long.
 Data: <p siB   > 70 1B 73 69 42 01 00 00 
..\zip\boinc_zip.cpp(122) : {149} normal block at 0x00000142696CE940, 260 bytes long.
 Data: <                > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
{136} normal block at 0x00000142696C7060, 16 bytes long.
 Data: <p&#171;liB           > 70 AB 6C 69 42 01 00 00 00 00 00 00 00 00 00 00 
{135} normal block at 0x00000142696CAB70, 40 bytes long.
 Data: <`pliB   conda-pa> 60 70 6C 69 42 01 00 00 63 6F 6E 64 61 2D 70 61 
{128} normal block at 0x00000142696CAB00, 48 bytes long.
 Data: <--boinc --device> 2D 2D 62 6F 69 6E 63 20 2D 2D 64 65 76 69 63 65 
{127} normal block at 0x00000142696C6FC0, 16 bytes long.
 Data: <8&#248;liB           > 38 F8 6C 69 42 01 00 00 00 00 00 00 00 00 00 00 
{126} normal block at 0x00000142696C6AC0, 16 bytes long.
 Data: < &#248;liB           > 10 F8 6C 69 42 01 00 00 00 00 00 00 00 00 00 00 
{125} normal block at 0x00000142696C6A70, 16 bytes long.
 Data: <&#232;&#247;liB           > E8 F7 6C 69 42 01 00 00 00 00 00 00 00 00 00 00 
{124} normal block at 0x00000142696C6A20, 16 bytes long.
 Data: <&#192;&#247;liB           > C0 F7 6C 69 42 01 00 00 00 00 00 00 00 00 00 00 
{123} normal block at 0x00000142696C6C00, 16 bytes long.
 Data: < &#247;liB           > 98 F7 6C 69 42 01 00 00 00 00 00 00 00 00 00 00 
{122} normal block at 0x00000142696C6980, 16 bytes long.
 Data: <p&#247;liB           > 70 F7 6C 69 42 01 00 00 00 00 00 00 00 00 00 00 
{121} normal block at 0x00000142696C70B0, 16 bytes long.
 Data: <P&#247;liB           > 50 F7 6C 69 42 01 00 00 00 00 00 00 00 00 00 00 
{120} normal block at 0x00000142696C6930, 16 bytes long.
 Data: <(&#247;liB           > 28 F7 6C 69 42 01 00 00 00 00 00 00 00 00 00 00 
{119} normal block at 0x00000142696C6570, 16 bytes long.
 Data: < &#247;liB           > 00 F7 6C 69 42 01 00 00 00 00 00 00 00 00 00 00 
{118} normal block at 0x00000142696CF700, 496 bytes long.
 Data: <peliB   bin/acem> 70 65 6C 69 42 01 00 00 62 69 6E 2F 61 63 65 6D 
{68} normal block at 0x00000142696C62A0, 16 bytes long.
 Data: < &#234;&#188; &#246;           > 80 EA BC 1A F6 7F 00 00 00 00 00 00 00 00 00 00 
{67} normal block at 0x00000142696C6CF0, 16 bytes long.
 Data: <@&#233;&#188; &#246;           > 40 E9 BC 1A F6 7F 00 00 00 00 00 00 00 00 00 00 
{66} normal block at 0x00000142696C6480, 16 bytes long.
 Data: <&#248;W&#185; &#246;           > F8 57 B9 1A F6 7F 00 00 00 00 00 00 00 00 00 00 
{65} normal block at 0x00000142696C6520, 16 bytes long.
 Data: <&#216;W&#185; &#246;           > D8 57 B9 1A F6 7F 00 00 00 00 00 00 00 00 00 00 
{64} normal block at 0x00000142696C6840, 16 bytes long.
 Data: <P &#185; &#246;           > 50 04 B9 1A F6 7F 00 00 00 00 00 00 00 00 00 00 
{63} normal block at 0x00000142696C6B60, 16 bytes long.
 Data: <0 &#185; &#246;           > 30 04 B9 1A F6 7F 00 00 00 00 00 00 00 00 00 00 
{62} normal block at 0x00000142696C6390, 16 bytes long.
 Data: <&#224; &#185; &#246;           > E0 02 B9 1A F6 7F 00 00 00 00 00 00 00 00 00 00 
{61} normal block at 0x00000142696C6250, 16 bytes long.
 Data: <  &#185; &#246;           > 10 04 B9 1A F6 7F 00 00 00 00 00 00 00 00 00 00 
{60} normal block at 0x00000142696C66B0, 16 bytes long.
 Data: <p &#185; &#246;           > 70 04 B9 1A F6 7F 00 00 00 00 00 00 00 00 00 00 
{59} normal block at 0x00000142696C67F0, 16 bytes long.
 Data: < &#192;&#183; &#246;           > 18 C0 B7 1A F6 7F 00 00 00 00 00 00 00 00 00 00 
Object dump complete.

</stderr_txt>
]]>

Currently, I have another one with cuda101 (falsely selected by the server for this client) which is now running for several hours.

Michael.
President of Rechenkraft.net - Germany's first and largest distributed computing organization.
ID: 57598 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Michael H.W. Weber

Send message
Joined: 9 Feb 16
Posts: 78
Credit: 656,229,684
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwat
Message 57605 - Posted: 14 Oct 2021, 12:35:20 UTC - in response to Message 57598.  


Currently, I have another one with cuda101 (falsely selected by the server for this client) which is now running for several hours.

...it actually caused my machine to crash and was re-starting after re-boot.
So I aborted it.

Michael.
President of Rechenkraft.net - Germany's first and largest distributed computing organization.
ID: 57605 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 57642 - Posted: 24 Oct 2021, 15:16:50 UTC

I had an "195 (0xc3) EXIT_CHILD_FAILED" case this afternoon, a few seconds after start:

https://www.gpugrid.net/result.php?resultid=32657585

anyone any idea what the reason might have been?
ID: 57642 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 428
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57643 - Posted: 24 Oct 2021, 15:52:09 UTC - in response to Message 57642.  

Look at your own link:

EXCEPTIONAL CONDITION: src\mdio\bincoord.c, line 193: "nelems != 1"

Faulty data - a bad task. Not your fault.
ID: 57643 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · 3 · 4 · Next

Message boards : Number crunching : 195 (0xc3) EXIT_CHILD_FAILED

©2025 Universitat Pompeu Fabra