Advanced search

Message boards : Number crunching : 195 (0xc3) EXIT_CHILD_FAILED

Author Message
Profile Michael H.W. Weber
Send message
Joined: 9 Feb 16
Posts: 71
Credit: 607,916,391
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwat
Message 57480 - Posted: 5 Oct 2021 | 8:42:33 UTC
Last modified: 5 Oct 2021 | 8:55:54 UTC

My RTX 3080 machine completed a first task successfully. Afterwards, two more tasks crashed with an 195 (0xc3) EXIT_CHILD_FAILED error message and the following log (after only a few seconds of run time):

Name e2s184_e1s254p0f959-ADRIA_AdB_KIXCMYB_HIP-0-2-RND9959_9
Arbeitspaket 27080023
Erstellt 4 Oct 2021 | 9:59:05 UTC
Gesendet 4 Oct 2021 | 10:48:16 UTC
Empfangen 4 Oct 2021 | 22:07:40 UTC
Serverstatus Abgeschlossen
Resultat Berechnungsfehler
Clientstatus Berechnungsfehler
Endstatus 195 (0xc3) EXIT_CHILD_FAILED
Computer ID 584499
Ablaufdatum 9 Oct 2021 | 10:48:16 UTC
Laufzeit 25.51
CPU Zeit 0.00
Prüfungsstatus Ungültig
Punkte 0.00
Anwendungsversion New version of ACEMD v2.18 (cuda101)

Stderr Ausgabe

<core_client_version>7.16.11</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code 195 (0xc3)</message>
<stderr_txt>
00:05:49 (30732): wrapper (7.9.26016): starting
00:05:49 (30732): wrapper: running bin/acemd3.exe (--boinc --device 0)
ACEMD failed:
Error compiling program: nvrtc: error: invalid value for --gpu-architecture (-arch)

00:05:59 (30732): bin/acemd3.exe exited; CPU time 0.000000
00:05:59 (30732): app exit status: 0x1
00:05:59 (30732): called boinc_finish(195)
0 bytes in 0 Free Blocks.
186 bytes in 4 Normal Blocks.
1144 bytes in 1 CRT Blocks.
0 bytes in 0 Ignore Blocks.
0 bytes in 0 Client Blocks.
Largest number used: 0 bytes.
Total allocations: 239849 bytes.
Dumping objects ->
{323256} normal block at 0x000001B7D23D3BC0, 85 bytes long.
Data: <<project_prefere> 3C 70 72 6F 6A 65 63 74 5F 70 72 65 66 65 72 65
..\api\boinc_api.cpp(309) : {323253} normal block at 0x000001B7D23D4940, 8 bytes long.
Data: < 1&#210;&#183; > 00 00 31 D2 B7 01 00 00
{322608} normal block at 0x000001B7D23D3C60, 85 bytes long.
Data: <<project_prefere> 3C 70 72 6F 6A 65 63 74 5F 70 72 65 66 65 72 65
{321994} normal block at 0x000001B7D23D46C0, 8 bytes long.
Data: <@&#202;?&#210;&#183; > 40 CA 3F D2 B7 01 00 00
..\zip\boinc_zip.cpp(122) : {146} normal block at 0x000001B7D23D3090, 260 bytes long.
Data: < > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
{133} normal block at 0x000001B7D23D48A0, 16 bytes long.
Data: < &#248;<&#210;&#183; > 10 F8 3C D2 B7 01 00 00 00 00 00 00 00 00 00 00
{132} normal block at 0x000001B7D23CF810, 40 bytes long.
Data: <&#160;H=&#210;&#183; conda-pa> A0 48 3D D2 B7 01 00 00 63 6F 6E 64 61 2D 70 61
{125} normal block at 0x000001B7D23CF340, 48 bytes long.
Data: <--boinc --device> 2D 2D 62 6F 69 6E 63 20 2D 2D 64 65 76 69 63 65
{124} normal block at 0x000001B7D23D49E0, 16 bytes long.
Data: <XN=&#210;&#183; > 58 4E 3D D2 B7 01 00 00 00 00 00 00 00 00 00 00
{123} normal block at 0x000001B7D23D4C60, 16 bytes long.
Data: <0N=&#210;&#183; > 30 4E 3D D2 B7 01 00 00 00 00 00 00 00 00 00 00
{122} normal block at 0x000001B7D23D4850, 16 bytes long.
Data: < N=&#210;&#183; > 08 4E 3D D2 B7 01 00 00 00 00 00 00 00 00 00 00
{121} normal block at 0x000001B7D23D3DB0, 16 bytes long.
Data: <&#224;M=&#210;&#183; > E0 4D 3D D2 B7 01 00 00 00 00 00 00 00 00 00 00
{120} normal block at 0x000001B7D23D4030, 16 bytes long.
Data: <&#184;M=&#210;&#183; > B8 4D 3D D2 B7 01 00 00 00 00 00 00 00 00 00 00
{119} normal block at 0x000001B7D23D4080, 16 bytes long.
Data: < M=&#210;&#183; > 90 4D 3D D2 B7 01 00 00 00 00 00 00 00 00 00 00
{118} normal block at 0x000001B7D23D4120, 16 bytes long.
Data: <pM=&#210;&#183; > 70 4D 3D D2 B7 01 00 00 00 00 00 00 00 00 00 00
{117} normal block at 0x000001B7D23D4990, 16 bytes long.
Data: <HM=&#210;&#183; > 48 4D 3D D2 B7 01 00 00 00 00 00 00 00 00 00 00
{116} normal block at 0x000001B7D23D42B0, 16 bytes long.
Data: < M=&#210;&#183; > 20 4D 3D D2 B7 01 00 00 00 00 00 00 00 00 00 00
{115} normal block at 0x000001B7D23D4D20, 496 bytes long.
Data: <&#176;B=&#210;&#183; bin/acem> B0 42 3D D2 B7 01 00 00 62 69 6E 2F 61 63 65 6D
{65} normal block at 0x000001B7D23C2D80, 16 bytes long.
Data: < &#234;&#151;{&#247; > 80 EA 97 7B F7 7F 00 00 00 00 00 00 00 00 00 00
{64} normal block at 0x000001B7D23C2B50, 16 bytes long.
Data: <@&#233;&#151;{&#247; > 40 E9 97 7B F7 7F 00 00 00 00 00 00 00 00 00 00
{63} normal block at 0x000001B7D23C2B00, 16 bytes long.
Data: <&#248;W&#148;{&#247; > F8 57 94 7B F7 7F 00 00 00 00 00 00 00 00 00 00
{62} normal block at 0x000001B7D23C2AB0, 16 bytes long.
Data: <&#216;W&#148;{&#247; > D8 57 94 7B F7 7F 00 00 00 00 00 00 00 00 00 00
{61} normal block at 0x000001B7D23C3370, 16 bytes long.
Data: <P &#148;{&#247; > 50 04 94 7B F7 7F 00 00 00 00 00 00 00 00 00 00
{60} normal block at 0x000001B7D23C2A60, 16 bytes long.
Data: <0 &#148;{&#247; > 30 04 94 7B F7 7F 00 00 00 00 00 00 00 00 00 00
{59} normal block at 0x000001B7D23C3500, 16 bytes long.
Data: <&#224; &#148;{&#247; > E0 02 94 7B F7 7F 00 00 00 00 00 00 00 00 00 00
{58} normal block at 0x000001B7D23C3640, 16 bytes long.
Data: < &#148;{&#247; > 10 04 94 7B F7 7F 00 00 00 00 00 00 00 00 00 00
{57} normal block at 0x000001B7D23C2A10, 16 bytes long.
Data: <p &#148;{&#247; > 70 04 94 7B F7 7F 00 00 00 00 00 00 00 00 00 00
{56} normal block at 0x000001B7D23C3870, 16 bytes long.
Data: < &#192;&#146;{&#247; > 18 C0 92 7B F7 7F 00 00 00 00 00 00 00 00 00 00
Object dump complete.

</stderr_txt>
]]>

Name e4s109_e1s39p0f745-ADRIA_AdB_KIXCMYB_HIP-1-2-RND2493_0
Arbeitspaket 27081645
Erstellt 4 Oct 2021 | 22:12:32 UTC
Gesendet 4 Oct 2021 | 22:14:12 UTC
Empfangen 4 Oct 2021 | 22:16:12 UTC
Serverstatus Abgeschlossen
Resultat Berechnungsfehler
Clientstatus Berechnungsfehler
Endstatus 195 (0xc3) EXIT_CHILD_FAILED
Computer ID 584499
Ablaufdatum 9 Oct 2021 | 22:14:12 UTC
Laufzeit 7.26
CPU Zeit 0.00
Prüfungsstatus Ungültig
Punkte 0.00
Anwendungsversion New version of ACEMD v2.18 (cuda101)
Stderr Ausgabe

<core_client_version>7.16.11</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code 195 (0xc3)</message>
<stderr_txt>
00:14:24 (14320): wrapper (7.9.26016): starting
00:14:24 (14320): wrapper: running bin/acemd3.exe (--boinc --device 0)
ACEMD failed:
Error compiling program: nvrtc: error: invalid value for --gpu-architecture (-arch)

00:14:26 (14320): bin/acemd3.exe exited; CPU time 0.000000
00:14:26 (14320): app exit status: 0x1
00:14:26 (14320): called boinc_finish(195)
0 bytes in 0 Free Blocks.
186 bytes in 4 Normal Blocks.
1144 bytes in 1 CRT Blocks.
0 bytes in 0 Ignore Blocks.
0 bytes in 0 Client Blocks.
Largest number used: 0 bytes.
Total allocations: 241603 bytes.
Dumping objects ->
{323256} normal block at 0x000002061D1C3BC0, 85 bytes long.
Data: <<project_prefere> 3C 70 72 6F 6A 65 63 74 5F 70 72 65 66 65 72 65
..\api\boinc_api.cpp(309) : {323253} normal block at 0x000002061D1C43F0, 8 bytes long.
Data: < > 00 00 02 1D 06 02 00 00
{322608} normal block at 0x000002061D1C3C60, 85 bytes long.
Data: <<project_prefere> 3C 70 72 6F 6A 65 63 74 5F 70 72 65 66 65 72 65
{321994} normal block at 0x000002061D1C42B0, 8 bytes long.
Data: <@&#202; > 40 CA 1E 1D 06 02 00 00
..\zip\boinc_zip.cpp(122) : {146} normal block at 0x000002061D1C3090, 260 bytes long.
Data: < > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
{133} normal block at 0x000002061D1C3EF0, 16 bytes long.
Data: <&#208;&#242; > D0 F2 1B 1D 06 02 00 00 00 00 00 00 00 00 00 00
{132} normal block at 0x000002061D1BF2D0, 40 bytes long.
Data: <&#240;> conda-pa> F0 3E 1C 1D 06 02 00 00 63 6F 6E 64 61 2D 70 61
{125} normal block at 0x000002061D1BF180, 48 bytes long.
Data: <--boinc --device> 2D 2D 62 6F 69 6E 63 20 2D 2D 64 65 76 69 63 65
{124} normal block at 0x000002061D1C4940, 16 bytes long.
Data: <XN > 58 4E 1C 1D 06 02 00 00 00 00 00 00 00 00 00 00
{123} normal block at 0x000002061D1C4490, 16 bytes long.
Data: <0N > 30 4E 1C 1D 06 02 00 00 00 00 00 00 00 00 00 00
{122} normal block at 0x000002061D1C4800, 16 bytes long.
Data: < N > 08 4E 1C 1D 06 02 00 00 00 00 00 00 00 00 00 00
{121} normal block at 0x000002061D1C47B0, 16 bytes long.
Data: <&#224;M > E0 4D 1C 1D 06 02 00 00 00 00 00 00 00 00 00 00
{120} normal block at 0x000002061D1C48A0, 16 bytes long.
Data: <&#184;M > B8 4D 1C 1D 06 02 00 00 00 00 00 00 00 00 00 00
{119} normal block at 0x000002061D1C4710, 16 bytes long.
Data: < M > 90 4D 1C 1D 06 02 00 00 00 00 00 00 00 00 00 00
{118} normal block at 0x000002061D1C48F0, 16 bytes long.
Data: <pM > 70 4D 1C 1D 06 02 00 00 00 00 00 00 00 00 00 00
{117} normal block at 0x000002061D1C4990, 16 bytes long.
Data: <HM > 48 4D 1C 1D 06 02 00 00 00 00 00 00 00 00 00 00
{116} normal block at 0x000002061D1C4A80, 16 bytes long.
Data: < M > 20 4D 1C 1D 06 02 00 00 00 00 00 00 00 00 00 00
{115} normal block at 0x000002061D1C4D20, 496 bytes long.
Data: < J bin/acem> 80 4A 1C 1D 06 02 00 00 62 69 6E 2F 61 63 65 6D
{65} normal block at 0x000002061D1B36E0, 16 bytes long.
Data: < &#234;&#151;{&#247; > 80 EA 97 7B F7 7F 00 00 00 00 00 00 00 00 00 00
{64} normal block at 0x000002061D1B3410, 16 bytes long.
Data: <@&#233;&#151;{&#247; > 40 E9 97 7B F7 7F 00 00 00 00 00 00 00 00 00 00
{63} normal block at 0x000002061D1B3820, 16 bytes long.
Data: <&#248;W&#148;{&#247; > F8 57 94 7B F7 7F 00 00 00 00 00 00 00 00 00 00
{62} normal block at 0x000002061D1B33C0, 16 bytes long.
Data: <&#216;W&#148;{&#247; > D8 57 94 7B F7 7F 00 00 00 00 00 00 00 00 00 00
{61} normal block at 0x000002061D1B3190, 16 bytes long.
Data: <P &#148;{&#247; > 50 04 94 7B F7 7F 00 00 00 00 00 00 00 00 00 00
{60} normal block at 0x000002061D1B3000, 16 bytes long.
Data: <0 &#148;{&#247; > 30 04 94 7B F7 7F 00 00 00 00 00 00 00 00 00 00
{59} normal block at 0x000002061D1B2FB0, 16 bytes long.
Data: <&#224; &#148;{&#247; > E0 02 94 7B F7 7F 00 00 00 00 00 00 00 00 00 00
{58} normal block at 0x000002061D1B3320, 16 bytes long.
Data: < &#148;{&#247; > 10 04 94 7B F7 7F 00 00 00 00 00 00 00 00 00 00
{57} normal block at 0x000002061D1B2F60, 16 bytes long.
Data: <p &#148;{&#247; > 70 04 94 7B F7 7F 00 00 00 00 00 00 00 00 00 00
{56} normal block at 0x000002061D1B3140, 16 bytes long.
Data: < &#192;&#146;{&#247; > 18 C0 92 7B F7 7F 00 00 00 00 00 00 00 00 00 00
Object dump complete.

</stderr_txt>
]]>

Any idea what is going on?

Very annoying is the fact that after these two consecutive crashes, it took the GPUGRID server 4 hours to send out a new task (which is now in progress) - making my machine uselessly idling for hours.

Michael.
____________
President of Rechenkraft.net - Germany's first and largest distributed computing organization.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,601,861,851
RAC: 8,767,822
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57481 - Posted: 5 Oct 2021 | 8:59:03 UTC - in response to Message 57480.

Your computers are hidden, so I can't be certain, but your problem seems to be

Error compiling program: nvrtc: error: invalid value for --gpu-architecture (-arch)

There are two versions of the new GPUGrid application: cuda1121 and cuda101. You will be able to see in your task list which worked, and which didn't work.

Despite some posts to the contrary, the general consensus is that cuda1121 works on an RTX 3080, and cuda101 doesn't. And despite an assurance from the project that they have prevented the cuda101 application being sent to RTX cards, clearly they haven't.

There's nothing you can do to prevent the wrong application being sent to your card: just take comfort from the fact that cuda101 tasks will fail very quickly, and you won't waste computing power on the tasks. The only hit you and the project are taking is the waste of bandwidth downloading the inappropriate tasks.

Profile Michael H.W. Weber
Send message
Joined: 9 Feb 16
Posts: 71
Credit: 607,916,391
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwat
Message 57482 - Posted: 5 Oct 2021 | 9:13:09 UTC

Thank you Richard - only cuda1121 works for me.

Michael.
____________
President of Rechenkraft.net - Germany's first and largest distributed computing organization.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1090
Credit: 6,603,906,926
RAC: 21,893,126
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 57484 - Posted: 5 Oct 2021 | 11:43:49 UTC - in response to Message 57481.

And despite an assurance from the project that they have prevented the cuda101 application being sent to RTX cards, clearly they haven't.

I endorse this statement. I have been sent cuda101 tasks to my two RTX3070, the latest one this morning.

There's nothing you can do to prevent the wrong application being sent to your card: just take comfort from the fact that cuda101 tasks will fail very quickly, and you won't waste computing power on the tasks. The only hit you and the project are taking is the waste of bandwidth downloading the inappropriate tasks.

however, there is more to it: What might also happen is that if one deletes erronously downloaded cuda101 tasks from the BOINC task list too often, one will not receive any more tasks for the next 24 hours.

Hence, this problem should be solved by the project team ASAP !

bozz4science
Send message
Joined: 22 May 20
Posts: 109
Credit: 68,936,176
RAC: 0
Level
Thr
Scientific publications
wat
Message 57486 - Posted: 5 Oct 2021 | 13:02:45 UTC

I also don't quite understand what information determines the app version to be sent out. This task f.ex. has been sent out 6 times before my host caught it. Once it was 1121 app version, all others were sent out as the 101 app version. It did fail on all previous hosts and went through 3 Ampere cards (3060Ti, 3070 & 3090). Seems to be quite an annoyance for anyone with the latest cards. And older cards take some serious chewing on the new tasks. Mine takes a little over 31hrs. This project could be working much more efficiently if it were able to fully capture the potential of these RTX 3000 series cards.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,601,861,851
RAC: 8,767,822
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57487 - Posted: 5 Oct 2021 | 13:30:47 UTC - in response to Message 57486.

But there are two different failure modes - three of each: three missing DLLs (probably vcruntime140_1), and three wrong architecture (cuda101 on RTX)

You need all three to align - right version, on right architecture, with right software support - before it'll run. One out of eight is about the right probability.

bozz4science
Send message
Joined: 22 May 20
Posts: 109
Credit: 68,936,176
RAC: 0
Level
Thr
Scientific publications
wat
Message 57488 - Posted: 5 Oct 2021 | 13:36:57 UTC

That's sounds about right. Only meant to highlight the Ampere cards that all failed obviously due to the wrong version having been sent to these hosts, but somehow older gen cards getting the 1121 app version instead on some occasions.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1031
Credit: 35,497,232,483
RAC: 64,632,171
Level
Trp
Scientific publications
wat
Message 57489 - Posted: 5 Oct 2021 | 13:41:17 UTC

All the more reason to just retire the cuda101 app, and force everyone to update their drivers to use the cuda1121 app
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,601,861,851
RAC: 8,767,822
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57490 - Posted: 5 Oct 2021 | 13:45:41 UTC - in response to Message 57489.

All the more reason to just retire the cuda101 app, and force everyone to update their drivers to use the cuda1121 app

I disagree. People should be allowed their own choice of driver (you don't know why they've kept an older one), but the project should manage the minimum limits better.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1031
Credit: 35,497,232,483
RAC: 64,632,171
Level
Trp
Scientific publications
wat
Message 57491 - Posted: 5 Oct 2021 | 13:56:55 UTC - in response to Message 57490.
Last modified: 5 Oct 2021 | 14:32:33 UTC

IMO, the "choice" of driver in the ranges of CUDA101 and CUDA1121 compatibility will be arbitrary. the list of supported products is exactly the same so it's not like some older GPU wont be supported anymore with the newer driver. Nvidia drivers are very stable and it's pretty rare that a new driver fully breaks something. CUDA101 requires driver 418.xx, CUDA1121 requires driver 461.xx, there's not a huge difference here. but even still there's a large range of "choice" between the minimum driver required for cuda1121 and what is the current latest driver release. they don't need to be bleeding edge. CUDA 11.2 was introduced almost a year ago, and it's currently up to CUDA 11.4 Update 2.

If you have some software issue that actively prevents installing a new driver, then fix your software issues.

there's really no good reason not to update if you're already on hardware and drivers new enough to support the CUDA101 app. it's not a huge change to get more recent drivers, and the observed negative impacts from the project maintaining two app versions far outweigh the impact of requiring a user to update their drivers.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,601,861,851
RAC: 8,767,822
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57492 - Posted: 5 Oct 2021 | 13:59:51 UTC - in response to Message 57491.

it's not a huge change to get more recent drivers, and the observed negative impacts from the project maintaining two app versions far outweigh the impact of requiring a user to update their drivers.

It can be if your computer is managed by your employer's domain controller, and group policy prevents you updating it yourself. Just an example.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1031
Credit: 35,497,232,483
RAC: 64,632,171
Level
Trp
Scientific publications
wat
Message 57493 - Posted: 5 Oct 2021 | 14:04:46 UTC - in response to Message 57492.

it's not a huge change to get more recent drivers, and the observed negative impacts from the project maintaining two app versions far outweigh the impact of requiring a user to update their drivers.

It can be if your computer is managed by your employer's domain controller, and group policy prevents you updating it yourself. Just an example.


in this case, it's MORE likely that these systems will (*should*) be updated to recent, as any competent SA will (*should*) be keeping everything on the up and up in terms of security patches, and there has been a stronger push from Nvidia in this regard lately.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,601,861,851
RAC: 8,767,822
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57494 - Posted: 5 Oct 2021 | 14:15:13 UTC

I think we're far enough off topic. Let's leave it there.

Profile Michael H.W. Weber
Send message
Joined: 9 Feb 16
Posts: 71
Credit: 607,916,391
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwat
Message 57544 - Posted: 8 Oct 2021 | 12:13:48 UTC

Well, this project's incapability of delivering the proper GPU app/plan class to the corresponding GPU systems simply results in a massive loss of project overall performance: Due to the repetitive "compute errors" the clients do not receive further tasks for a while and idle around for hours. I figured that this way instead of two tasks per day, I can deliver only one.
Well, not my problem. A second project is occupying the idle time now.

Michael.
____________
President of Rechenkraft.net - Germany's first and largest distributed computing organization.

Profile bcavnaugh
Send message
Joined: 8 Nov 13
Posts: 56
Credit: 1,002,640,163
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 57572 - Posted: 11 Oct 2021 | 2:26:22 UTC
Last modified: 11 Oct 2021 | 2:59:44 UTC

Glad at least one of my Host is running but all the other are NOT!
"[img]Not Working[/img]"
2080 (441.20) running 1080 (431.86) not running also 2080 (431.86) not running
What NVIDIA Driver must me have to Run GPUGRID?
As you can see even with the new or current runtimes it still fails
14.29.30135.0 Current VS 2022 the version with the tasks is 14.28.29325.2
https://live.staticflickr.com/65535/51574059037_5ae789d24d_b.jpg

Update:
For me Driver 441.20 seems to work on all my Host,Yahoo!

jjch
Send message
Joined: 10 Nov 13
Posts: 98
Credit: 15,288,150,388
RAC: 1,732,962
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57575 - Posted: 11 Oct 2021 | 4:26:33 UTC

Nvidia driver version 441.20 is a bit old. I am currently running version 471.11 on my Windows hosts.

Profile Michael H.W. Weber
Send message
Joined: 9 Feb 16
Posts: 71
Credit: 607,916,391
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwat
Message 57598 - Posted: 13 Oct 2021 | 10:29:25 UTC - in response to Message 57481.
Last modified: 13 Oct 2021 | 10:31:09 UTC


Despite some posts to the contrary, the general consensus is that cuda1121 works on an RTX 3080, and cuda101 doesn't. And despite an assurance from the project that they have prevented the cuda101 application being sent to RTX cards, clearly they haven't.

I second that.

There's nothing you can do to prevent the wrong application being sent to your card: just take comfort from the fact that cuda101 tasks will fail very quickly, and you won't waste computing power on the tasks.

Unfortunately, that is exactly NOT the case.
Here an example of a task which ran for almost 15 hrs before failing with an error:

Task: https://www.gpugrid.net/result.php?resultid=32653715

Name e7s106_e5s196p1f1036-ADRIA_AdB_KIXCMYB_HIP-1-2-RND0214_4
Arbeitspaket 27082868
Erstellt 11 Oct 2021 | 6:23:21 UTC
Gesendet 11 Oct 2021 | 6:24:56 UTC
Empfangen 12 Oct 2021 | 9:32:05 UTC
Serverstatus Abgeschlossen
Resultat Berechnungsfehler
Clientstatus Berechnungsfehler
Endstatus 195 (0xc3) EXIT_CHILD_FAILED
Computer ID 588794
Ablaufdatum 16 Oct 2021 | 6:24:56 UTC
Laufzeit 53,608.48
CPU Zeit 53,473.36
Prüfungsstatus Ungültig
Punkte 0.00
Anwendungsversion New version of ACEMD v2.18 (cuda101)
Stderr Ausgabe

<core_client_version>7.16.11</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code 195 (0xc3)</message>
<stderr_txt>
08:25:11 (11620): wrapper (7.9.26016): starting
08:25:11 (11620): wrapper: running bin/acemd3.exe (--boinc --device 0)
Detected memory leaks!
Dumping objects ->
..\api\boinc_api.cpp(309) : {323250} normal block at 0x000002C574996E70, 8 bytes long.
Data: < $v > 00 00 24 76 C5 02 00 00
..\lib\diagnostics_win.cpp(417) : {321999} normal block at 0x000002C576431310, 1080 bytes long.
Data: < > FC 08 00 00 CD CD CD CD 0C 01 00 00 00 00 00 00
..\zip\boinc_zip.cpp(122) : {149} normal block at 0x000002C57499EBA0, 260 bytes long.
Data: < > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Object dump complete.
10:58:41 (4808): wrapper (7.9.26016): starting
10:58:41 (4808): wrapper: running bin/acemd3.exe (--boinc --device 0)
Detected memory leaks!
Dumping objects ->
..\api\boinc_api.cpp(309) : {323286} normal block at 0x000001DA265261F0, 8 bytes long.
Data: < J& > 00 00 4A 26 DA 01 00 00
..\lib\diagnostics_win.cpp(417) : {322035} normal block at 0x000001DA26591B80, 1080 bytes long.
Data: <h > 68 1A 00 00 CD CD CD CD 20 01 00 00 00 00 00 00
..\zip\boinc_zip.cpp(122) : {149} normal block at 0x000001DA2652EB00, 260 bytes long.
Data: < > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Object dump complete.
11:30:47 (13592): wrapper (7.9.26016): starting
11:30:47 (13592): wrapper: running bin/acemd3.exe (--boinc --device 0)
ACEMD failed:
Error compiling program: nvrtc: error: invalid value for --gpu-architecture (-arch)

11:30:49 (13592): bin/acemd3.exe exited; CPU time 0.000000
11:30:49 (13592): app exit status: 0x1
11:30:49 (13592): called boinc_finish(195)
0 bytes in 0 Free Blocks.
298 bytes in 4 Normal Blocks.
1144 bytes in 1 CRT Blocks.
0 bytes in 0 Ignore Blocks.
0 bytes in 0 Client Blocks.
Largest number used: 0 bytes.
Total allocations: 130740 bytes.
Dumping objects ->
{323289} normal block at 0x0000014269701A70, 141 bytes long.
Data: <<project_prefere> 3C 70 72 6F 6A 65 63 74 5F 70 72 65 66 65 72 65
..\api\boinc_api.cpp(309) : {323286} normal block at 0x00000142696C62F0, 8 bytes long.
Data: < eiB > 00 00 65 69 42 01 00 00
{322649} normal block at 0x00000142697020F0, 141 bytes long.
Data: <<project_prefere> 3C 70 72 6F 6A 65 63 74 5F 70 72 65 66 65 72 65
{322036} normal block at 0x00000142696C6890, 8 bytes long.
Data: <p siB > 70 1B 73 69 42 01 00 00
..\zip\boinc_zip.cpp(122) : {149} normal block at 0x00000142696CE940, 260 bytes long.
Data: < > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
{136} normal block at 0x00000142696C7060, 16 bytes long.
Data: <p&#171;liB > 70 AB 6C 69 42 01 00 00 00 00 00 00 00 00 00 00
{135} normal block at 0x00000142696CAB70, 40 bytes long.
Data: <`pliB conda-pa> 60 70 6C 69 42 01 00 00 63 6F 6E 64 61 2D 70 61
{128} normal block at 0x00000142696CAB00, 48 bytes long.
Data: <--boinc --device> 2D 2D 62 6F 69 6E 63 20 2D 2D 64 65 76 69 63 65
{127} normal block at 0x00000142696C6FC0, 16 bytes long.
Data: <8&#248;liB > 38 F8 6C 69 42 01 00 00 00 00 00 00 00 00 00 00
{126} normal block at 0x00000142696C6AC0, 16 bytes long.
Data: < &#248;liB > 10 F8 6C 69 42 01 00 00 00 00 00 00 00 00 00 00
{125} normal block at 0x00000142696C6A70, 16 bytes long.
Data: <&#232;&#247;liB > E8 F7 6C 69 42 01 00 00 00 00 00 00 00 00 00 00
{124} normal block at 0x00000142696C6A20, 16 bytes long.
Data: <&#192;&#247;liB > C0 F7 6C 69 42 01 00 00 00 00 00 00 00 00 00 00
{123} normal block at 0x00000142696C6C00, 16 bytes long.
Data: < &#247;liB > 98 F7 6C 69 42 01 00 00 00 00 00 00 00 00 00 00
{122} normal block at 0x00000142696C6980, 16 bytes long.
Data: <p&#247;liB > 70 F7 6C 69 42 01 00 00 00 00 00 00 00 00 00 00
{121} normal block at 0x00000142696C70B0, 16 bytes long.
Data: <P&#247;liB > 50 F7 6C 69 42 01 00 00 00 00 00 00 00 00 00 00
{120} normal block at 0x00000142696C6930, 16 bytes long.
Data: <(&#247;liB > 28 F7 6C 69 42 01 00 00 00 00 00 00 00 00 00 00
{119} normal block at 0x00000142696C6570, 16 bytes long.
Data: < &#247;liB > 00 F7 6C 69 42 01 00 00 00 00 00 00 00 00 00 00
{118} normal block at 0x00000142696CF700, 496 bytes long.
Data: <peliB bin/acem> 70 65 6C 69 42 01 00 00 62 69 6E 2F 61 63 65 6D
{68} normal block at 0x00000142696C62A0, 16 bytes long.
Data: < &#234;&#188; &#246; > 80 EA BC 1A F6 7F 00 00 00 00 00 00 00 00 00 00
{67} normal block at 0x00000142696C6CF0, 16 bytes long.
Data: <@&#233;&#188; &#246; > 40 E9 BC 1A F6 7F 00 00 00 00 00 00 00 00 00 00
{66} normal block at 0x00000142696C6480, 16 bytes long.
Data: <&#248;W&#185; &#246; > F8 57 B9 1A F6 7F 00 00 00 00 00 00 00 00 00 00
{65} normal block at 0x00000142696C6520, 16 bytes long.
Data: <&#216;W&#185; &#246; > D8 57 B9 1A F6 7F 00 00 00 00 00 00 00 00 00 00
{64} normal block at 0x00000142696C6840, 16 bytes long.
Data: <P &#185; &#246; > 50 04 B9 1A F6 7F 00 00 00 00 00 00 00 00 00 00
{63} normal block at 0x00000142696C6B60, 16 bytes long.
Data: <0 &#185; &#246; > 30 04 B9 1A F6 7F 00 00 00 00 00 00 00 00 00 00
{62} normal block at 0x00000142696C6390, 16 bytes long.
Data: <&#224; &#185; &#246; > E0 02 B9 1A F6 7F 00 00 00 00 00 00 00 00 00 00
{61} normal block at 0x00000142696C6250, 16 bytes long.
Data: < &#185; &#246; > 10 04 B9 1A F6 7F 00 00 00 00 00 00 00 00 00 00
{60} normal block at 0x00000142696C66B0, 16 bytes long.
Data: <p &#185; &#246; > 70 04 B9 1A F6 7F 00 00 00 00 00 00 00 00 00 00
{59} normal block at 0x00000142696C67F0, 16 bytes long.
Data: < &#192;&#183; &#246; > 18 C0 B7 1A F6 7F 00 00 00 00 00 00 00 00 00 00
Object dump complete.

</stderr_txt>
]]>

Currently, I have another one with cuda101 (falsely selected by the server for this client) which is now running for several hours.

Michael.
____________
President of Rechenkraft.net - Germany's first and largest distributed computing organization.

Profile Michael H.W. Weber
Send message
Joined: 9 Feb 16
Posts: 71
Credit: 607,916,391
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwat
Message 57605 - Posted: 14 Oct 2021 | 12:35:20 UTC - in response to Message 57598.


Currently, I have another one with cuda101 (falsely selected by the server for this client) which is now running for several hours.

...it actually caused my machine to crash and was re-starting after re-boot.
So I aborted it.

Michael.
____________
President of Rechenkraft.net - Germany's first and largest distributed computing organization.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1090
Credit: 6,603,906,926
RAC: 21,893,126
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 57642 - Posted: 24 Oct 2021 | 15:16:50 UTC

I had an "195 (0xc3) EXIT_CHILD_FAILED" case this afternoon, a few seconds after start:

https://www.gpugrid.net/result.php?resultid=32657585

anyone any idea what the reason might have been?

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,601,861,851
RAC: 8,767,822
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57643 - Posted: 24 Oct 2021 | 15:52:09 UTC - in response to Message 57642.

Look at your own link:

EXCEPTIONAL CONDITION: src\mdio\bincoord.c, line 193: "nelems != 1"

Faulty data - a bad task. Not your fault.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 566
Credit: 5,939,502,024
RAC: 10,920,136
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57644 - Posted: 24 Oct 2021 | 16:25:46 UTC - in response to Message 57642.

Faulty data - a bad task. Not your fault.

+1

Windows Operating System:
EXCEPTIONAL CONDITION: src\mdio\bincoord.c, line 193: "nelems != 1"

Linux Operating System:
EXCEPTIONAL CONDITION: /home/user/conda/conda-bld/acemd3_1618916459379/work/src/mdio/bincoord.c, line 193: "nelems != 1"

When this warning appears, it usually implies that there is some definition error at the task initial parameters.
It is a badly constructed task at origin, and it will fail at every host that receive it.
Watching at Work Unit #27084895, from which this task hangs, It has previously failed at several other hosts, both Windows and Linux Operating Systems.
The destiny for a Work Unit like this is getting extinguished after the maximum allowed failed tasks (7) is reached...

Greger
Send message
Joined: 6 Jan 15
Posts: 74
Credit: 14,325,201,749
RAC: 31,074,710
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 57865 - Posted: 24 Nov 2021 | 17:42:50 UTC

New task from batch e1s34_0-ADRIA_GPCR2021_APJ_b0-0-1-RND2256 appears to break at start with Not A Number for coordinate:

ACEMD failed:
Particle coordinate is nan


Errorcode: process exited with code 195 (0xc3, -61)

WU: https://www.gpugrid.net/workunit.php?wuid=27087091

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,601,861,851
RAC: 8,767,822
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57866 - Posted: 24 Nov 2021 | 17:59:04 UTC
Last modified: 24 Nov 2021 | 18:38:56 UTC

Same here with two different Linux machines:

e1s25_0-ADRIA_GPCR2021_APJ_b0-0-1-RND8388_5
e1s174_0-ADRIA_GPCR2021_APJ_b0-0-1-RND2767_0

The first task has failed for multiple users: I was the first to attempt the second task, but it looks to have gone the same way.

And the same error under Windows:

e1s401_0-ADRIA_GPCR2021_APJ_b0-0-1-RND6370_3

curiously_indifferent
Send message
Joined: 20 Nov 17
Posts: 18
Credit: 1,086,142,304
RAC: 620,935
Level
Met
Scientific publications
watwatwat
Message 57867 - Posted: 24 Nov 2021 | 18:16:08 UTC

Same here - RTX2060 card. Fails after 10 seconds.

https://www.gpugrid.net/result.php?resultid=32662557

https://www.gpugrid.net/result.php?resultid=32663017

https://www.gpugrid.net/result.php?resultid=32663196

https://www.gpugrid.net/result.php?resultid=32663734

ZUSE
Avatar
Send message
Joined: 10 Jun 20
Posts: 7
Credit: 417,413,397
RAC: 657,376
Level
Gln
Scientific publications
wat
Message 57868 - Posted: 24 Nov 2021 | 18:45:20 UTC

have the same problem with 4x Nvidia T600

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1284
Credit: 4,923,256,959
RAC: 6,447,011
Level
Arg
Scientific publications
watwatwatwatwat
Message 57869 - Posted: 24 Nov 2021 | 20:10:02 UTC - in response to Message 57868.

Looks we all have the same issue with NaN. I've bombed through a couple dozen today for wasted download cap.

bozz4science
Send message
Joined: 22 May 20
Posts: 109
Credit: 68,936,176
RAC: 0
Level
Thr
Scientific publications
wat
Message 57870 - Posted: 24 Nov 2021 | 23:43:48 UTC

Same on my end. Had almost 80 tasks on a single machine today that all failed with said NaN error. Why were so many faulty tasks sent out in the first place?

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1284
Credit: 4,923,256,959
RAC: 6,447,011
Level
Arg
Scientific publications
watwatwatwatwat
Message 57871 - Posted: 25 Nov 2021 | 2:00:29 UTC

Thrown away 150 bad tasks today.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 566
Credit: 5,939,502,024
RAC: 10,920,136
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57872 - Posted: 25 Nov 2021 | 6:40:41 UTC

Zero valid tasks returned overnight, it's clearly a faulty constructed batch.
At least, it seems that we'll have new work to crunch when it is corrected.

joeybuddy96
Send message
Joined: 1 Apr 20
Posts: 9
Credit: 146,536,770
RAC: 0
Level
Cys
Scientific publications
wat
Message 57873 - Posted: 25 Nov 2021 | 21:29:53 UTC

I got 16 errored tasks: 8 of cuda101, 8 of cuda1121. No tasks successfully completed. " Particle coordinate is nan" and " The requested CUDA device could not be loaded".
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1284
Credit: 4,923,256,959
RAC: 6,447,011
Level
Arg
Scientific publications
watwatwatwatwat
Message 57875 - Posted: 26 Nov 2021 | 18:46:48 UTC

Looks like they are sending out corrected tasks now from that last batch.

Have several running now correctly.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 566
Credit: 5,939,502,024
RAC: 10,920,136
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57876 - Posted: 26 Nov 2021 | 22:25:15 UTC - in response to Message 57875.

Right,
Problems seem to have been solved in this new batch of ADRIA tasks.
I'm also estimating that this batch is considerably slighter than precedent ones, and my GTX 1660 Ti will be hitting full bonus with its current task.

Billy Ewell 1931
Send message
Joined: 22 Oct 10
Posts: 34
Credit: 967,276,174
RAC: 610,853
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57877 - Posted: 27 Nov 2021 | 5:01:18 UTC - in response to Message 57876.

Only partial success!
My Xeon powered machine with a GTX 1060 was reinitiated about 2 hours ago and is performing without a fault.

My I7 machine with a RTX 2070 and my I3 machine with a GTX 1060 were likewise restarted with GPU Grid and several tasks have all repeatedly failed within 10-13 seconds of starting.

I changed all drivers without any impact.

I suppose we let our boxes run until all tasks that choose to fail have succeeded and the few that are successful are recorded as winners.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1090
Credit: 6,603,906,926
RAC: 21,893,126
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 57878 - Posted: 27 Nov 2021 | 9:04:18 UTC

here the tasks failed within a few seconds:

https://www.gpugrid.net/result.php?resultid=32703425
https://www.gpugrid.net/result.php?resultid=32698761

excerpt from stderr:

ACEMD failed:
Error compiling program: nvrtc: error: invalid value for --gpu-architecture (-arch)


so I guess these new WUs are not running on Ampere cards (here: 2 x RTX3070).

However, before the WUs were crunched perfectly with these cards, regardless of the cuda version.

Too bad :-(

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,601,861,851
RAC: 8,767,822
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57879 - Posted: 27 Nov 2021 | 9:05:10 UTC - in response to Message 57877.

My I7 machine with a RTX 2070 and my I3 machine with a GTX 1060 were likewise restarted with GPU Grid and several tasks have all repeatedly failed within 10-13 seconds of starting.

Those two machines are running the same tasks - with 'Bandit' in their name - that are running successfully on other machines.

So the problem lies in your machine setup, not in the tasks themselves. A lot of other machines have the same problem - you'be not alone.

Your tasks are failing with 'app exit status: 0xc0000135' - in all likelihood, you are missing a Microsoft runtime DLL file. Please refer to message 57353

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,601,861,851
RAC: 8,767,822
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57880 - Posted: 27 Nov 2021 | 9:08:06 UTC - in response to Message 57878.

excerpt from stderr:

ACEMD failed:
Error compiling program: nvrtc: error: invalid value for --gpu-architecture (-arch)


so I guess these new WUs are not running on Ampere cards (here: 2 x RTX3070).

It's not the tasks that fail on Ampere cards - it's the CUDA 101 app. The 1121 app should be OK.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1090
Credit: 6,603,906,926
RAC: 21,893,126
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 57881 - Posted: 27 Nov 2021 | 9:31:07 UTC - in response to Message 57880.

excerpt from stderr:

ACEMD failed:
Error compiling program: nvrtc: error: invalid value for --gpu-architecture (-arch)


so I guess these new WUs are not running on Ampere cards (here: 2 x RTX3070).

It's not the tasks that fail on Ampere cards - it's the CUDA 101 app. The 1121 app should be OK.


This was exactly the problem when the previous batch of WUs started: CUDA101 apps could not be crunched on Ampere cards, 1121 WUs went well.

Then, someone here from the forum posted instructions how to change the content of a specific file in the - I guess - GPUGRID project folder (I forgot which file that was), I followed this instruction, and from than on my RTX3070 cards were crunching both CUDA versions.

Since this is no longer the case with the current batch, I suppose something must be different with these new WUs.

From what I remember, about half of the WUs I had crunched over several weeks were 101, about the other half was 1121.

Of course, it is rather impractical to just try downloading task after task and hoping that a 1121 will show up some time. As known, after a number of unsuccessful WUs, downloads of new ones are blocked for a day.

I would have expected that the GPUGRID people have repaired this specific problem in the meantime. Which obviously is not the case. So they keep blocking an increasing number of hosts, which to me does not make any sense at all :-(

Erich56
Send message
Joined: 1 Jan 15
Posts: 1090
Credit: 6,603,906,926
RAC: 21,893,126
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 57882 - Posted: 27 Nov 2021 | 9:38:01 UTC - in response to Message 57881.


Then, someone here from the forum posted instructions how to change the content of a specific file in the - I guess - GPUGRID project folder (I forgot which file that was), I followed this instruction, and from than on my RTX3070 cards were crunching both CUDA versions.

I now remember: it was the Conda-pack.zip... file of which the content had to be changed.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,601,861,851
RAC: 8,767,822
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57883 - Posted: 27 Nov 2021 | 9:53:27 UTC - in response to Message 57881.

I would have expected that the GPUGRID people have repaired this specific problem in the meantime. Which obviously is not the case. So they keep blocking an increasing number of hosts, which to me does not make any sense at all :-(

I agree completely. But since the project doesn't seem to be (effectively) learning the lessons from previous mistakes, the best we can do is to perform the analysis for them, draw attention to the precise causes, and do what we can to ensure that at least some scientific research is completed successfully. Just burning up tasks with failures, until the maximum error limit for the WU is reached, doesn't help anyone.

The file you need to change is acemd3.exe - it can be found in your current conda-pack.zip.xxxxxx file, in the GPUGrid project folder. Check whether a newer version of that file has been downloaded since you last modified it. Mine is currently dated 05 October - later than our last major discussion on the subject.

That zip pack should also contain vcruntime140_1.dll, but I don't know if simply placing it in the zip would help - it might need to be specifically added to the upacking instruction list as well.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1090
Credit: 6,603,906,926
RAC: 21,893,126
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 57884 - Posted: 27 Nov 2021 | 10:54:44 UTC - in response to Message 57883.

the Conda-pack file which is currently in the GPUGRID folder is named "Conda-pack.zip.1d5...358" and is dated 4.10.21.
However, the file for which I was changing the content was called "conda-pack.zip.aeb48...86a", but this file, for some reason, is no longer existing in the GPUGRID folder.

Maybe it was deleted this morning when GPUGRID was updated this morning when I was retrieving new tasks.
I am sure that both files were in the GPUGRID folder before.

In fact, I remember that the change I had made in October was to override the content of the "conda-pack.zip.aeb48...86a" with the content of the file "Conda-pack.zip.1d5...358".

What is new since this morning, among other files, is "windows_x86_64_cuda101.zip.c0d...b21", dated 27.11.(=date of download this morning).
Whether a similar file was in the GPUGRID folder before or not, and may have been deleted this morning - I do not know.

So what I could do is to copy the "conda-pack.zip.aeb48...86a", of which I had saved a copy in the "documents" folder, to the GPUGRID folder. Whether it helps or not, I will only see after retrying a new task (if it happens to be again CUDA101).

Any other suggestions ?

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,601,861,851
RAC: 8,767,822
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57885 - Posted: 27 Nov 2021 | 11:17:30 UTC - in response to Message 57884.

Have a look at the job.xml.xxxxxx file.

I have one dated 22 September that says

<job_desc>
<task>
<application>bin/acemd3.exe</application>
<command_line>--boinc --device $GPU_DEVICE_NUM</command_line>
<stdout_filename>progress.log</stdout_filename>
<checkpoint_filename>restart.chk</checkpoint_filename>
<fraction_done_filename>progress</fraction_done_filename>
</task>
<unzip_input>
<zipfilename>conda-pack.zip</zipfilename>
</unzip_input>
</job_desc>

and another dated yesterday that says

<job_desc>
<task>
<application>bin/acemd3.exe</application>
<command_line>--boinc --device $GPU_DEVICE_NUM</command_line>
<stdout_filename>progress.log</stdout_filename>
<checkpoint_filename>restart.chk</checkpoint_filename>
<fraction_done_filename>progress</fraction_done_filename>
</task>
<unzip_input>
<zipfilename>windows_x86_64__cuda101.zip</zipfilename>
</unzip_input>
</job_desc>

To be certain, you'd need to look at the job specification in client_state.xml, but I think I'd go with the newest.

Note that you'd also need to have the matching versions of cudart and cufft for the app you end up using.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1090
Credit: 6,603,906,926
RAC: 21,893,126
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 57887 - Posted: 27 Nov 2021 | 13:46:24 UTC - in response to Message 57885.

Have a look at the job.xml.xxxxxx file. ...


My job.xml.xxxxxx files look exactly like yours. Also date-wise.

To me, this shows that the new tasks no longer use the former <zipfilename>conda-pack.zip< but rather the new <zipfilename>windows_x86_64__cuda101.zip<

And since no "...cuda1121.zip" was downloaded into the GPUGRID folder, I suppose that the new WUs are running cuda101 only.
Which further means that these new WUs will not work with Ampere cards :-(

Looks as simple as that, most sadly :-(

Unless someone here can report about successful completion of the new WUs with an Ampere card.

If possible, some kind of confirmation/statement/explanation or whatever from the team would also help a lot.


Profile PDW
Send message
Joined: 7 Mar 14
Posts: 15
Credit: 1,000,002,525
RAC: 0
Level
Met
Scientific publications
watwatwatwatwat
Message 57888 - Posted: 27 Nov 2021 | 14:23:34 UTC - in response to Message 57887.

Unless someone here can report about successful completion of the new WUs with an Ampere card.

I have Ampere cards completing 101 and 1121 tasks from the latest batch just fine. All tasks on all cards have worked, am going to try some slower cards given the tasks are smaller.

I have never renamed any of the project files.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1090
Credit: 6,603,906,926
RAC: 21,893,126
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 57889 - Posted: 27 Nov 2021 | 14:49:32 UTC - in response to Message 57888.

I have Ampere cards completing 101 and 1121 tasks from the latest batch just fine.

thanks for the information, sounds interesting.

Could you please let me/us know whether your www.gpugrid.net folder (in BOINC > projects) contains any conda-pack.zip-files (if yes, which ones?), and whether besides the "windows_x86_64_cuda101.zip.c0d...b21" it contains such a file with "...cuda1121" (instead cuda101).

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,601,861,851
RAC: 8,767,822
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57890 - Posted: 27 Nov 2021 | 15:31:47 UTC - in response to Message 57889.

I have completed a Windows x64 cuda1121 task, and I have a windows_x86_64__cuda1121.zip file on that machine.

You can download a copy from my Google drive.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 566
Credit: 5,939,502,024
RAC: 10,920,136
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57891 - Posted: 27 Nov 2021 | 15:39:37 UTC - in response to Message 57888.

I have Ampere cards completing 101 and 1121 tasks from the latest batch just fine.

This lastly commented problem is only impacting Windows hosts.
If your hosts are running under any kind of Linux distribution, it is normal that they aren't being affected.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 7,520
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57892 - Posted: 27 Nov 2021 | 15:42:46 UTC

What's the point in keeping the CUDA10 app alive? The CUDA11 app works on older cards as well.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1031
Credit: 35,497,232,483
RAC: 64,632,171
Level
Trp
Scientific publications
wat
Message 57893 - Posted: 27 Nov 2021 | 15:44:37 UTC - in response to Message 57892.

What's the point in keeping the CUDA10 app alive? The CUDA11 app works on older cards as well.


I agree and I've said this a few times also. no point in keeping the CUDA101 app when there's the 1121 app.
____________

Erich56
Send message
Joined: 1 Jan 15
Posts: 1090
Credit: 6,603,906,926
RAC: 21,893,126
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 57894 - Posted: 27 Nov 2021 | 15:52:30 UTC - in response to Message 57891.

I have Ampere cards completing 101 and 1121 tasks from the latest batch just fine.

This lastly commented problem is only impacting Windows hosts.
If your hosts are running under any kind of Linux distribution, it is normal that they aren't being affected.

too bad that the user PDW has hidden his computers in the profile. So no idea what OS is being used ... unless he tells us.


What's the point in keeping the CUDA10 app alive? The CUDA11 app works on older cards as well.

good question

Werinbert
Send message
Joined: 12 May 13
Posts: 5
Credit: 100,032,540
RAC: 0
Level
Cys
Scientific publications
wat
Message 57896 - Posted: 27 Nov 2021 | 21:18:57 UTC - in response to Message 57876.

I'm also estimating that this batch is considerably slighter than precedent ones, and my GTX 1660 Ti will be hitting full bonus with its current task.

ServiceEnginIC, I noticed that your task completed in under 64,000 sec. My 1660 TI is looking to finish in just under 88,000 sec. I am wondering what could be causing such a big difference. The tasks, mine is a Cuda101 running under Win 7 and yours is Cuda1121 running under Linux. Are either of these the culprit?

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 566
Credit: 5,939,502,024
RAC: 10,920,136
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57898 - Posted: 27 Nov 2021 | 22:44:29 UTC - in response to Message 57896.

Working under Linux helps to squeeze maximum performance.
Some optimized settings at BOINC Manager and a moderate overclocking do the rest.
At Managing non-high-end hosts thread I try to share all what I know about it.

Profile PDW
Send message
Joined: 7 Mar 14
Posts: 15
Credit: 1,000,002,525
RAC: 0
Level
Met
Scientific publications
watwatwatwatwat
Message 57905 - Posted: 28 Nov 2021 | 10:05:52 UTC - in response to Message 57894.

I have Ampere cards completing 101 and 1121 tasks from the latest batch just fine.

This lastly commented problem is only impacting Windows hosts.
If your hosts are running under any kind of Linux distribution, it is normal that they aren't being affected.

too bad that the user PDW has hidden his computers in the profile. So no idea what OS is being used ... unless he tells us.

You asked:
Unless someone here can report about successful completion of the new WUs with an Ampere card.

As I posted previously I am using linux.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1090
Credit: 6,603,906,926
RAC: 21,893,126
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 57906 - Posted: 28 Nov 2021 | 10:53:41 UTC - in response to Message 57905.

Unless someone here can report about successful completion of the new WUs with an Ampere card.

As I posted previously I am using linux.

oh okay, thanks for the information (which explains why it works well on your system).

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 7,520
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57911 - Posted: 28 Nov 2021 | 13:55:24 UTC - in response to Message 57906.
Last modified: 28 Nov 2021 | 13:56:49 UTC

I have Ampere cards completing 101 and 1121 tasks from the latest batch just fine.

This lastly commented problem is only impacting Windows hosts.
If your hosts are running under any kind of Linux distribution, it is normal that they aren't being affected.

too bad that the user PDW has hidden his computers in the profile. So no idea what OS is being used ... unless he tells us.

You asked:
Unless someone here can report about successful completion of the new WUs with an Ampere card.

As I posted previously I am using linux.

oh okay, thanks for the information (which explains why it works well on your system).
No, it does not explain it.
I've tried to run a CUDA 101 task on my Ubuntu 18.04.6 host on an RTX 3080 Ti (driver: 495.44), and it's failed after a few minutes.
<core_client_version>7.16.17</core_client_version> <![CDATA[ <message> process exited with code 195 (0xc3, -61)</message> <stderr_txt> 14:33:16 (1675): wrapper (7.7.26016): starting 14:33:23 (1675): wrapper (7.7.26016): starting 14:33:23 (1675): wrapper: running bin/acemd3 (--boinc --device 0) ACEMD failed: Error compiling program: nvrtc: error: invalid value for --gpu-architecture (-arch) 14:35:30 (1675): bin/acemd3 exited; CPU time 127.166324 14:35:30 (1675): app exit status: 0x1 14:35:30 (1675): called boinc_finish(195) </stderr_txt> ]]>

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 7,520
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57918 - Posted: 28 Nov 2021 | 14:26:19 UTC - in response to Message 57911.
Last modified: 28 Nov 2021 | 14:39:37 UTC

Another example:
http://www.gpugrid.net/result.php?resultid=32706825
EDIT:
3rd attempt (failed as well):
http://www.gpugrid.net/result.php?resultid=32706943

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1031
Credit: 35,497,232,483
RAC: 64,632,171
Level
Trp
Scientific publications
wat
Message 57959 - Posted: 29 Nov 2021 | 14:28:37 UTC

after yesterday's snafu, I picked up two cuda101 tasks this morning on my Linux Ubuntu 20.04 3080Ti system. currently running ok. been running about 20 mins now, and is utilizing the GPU @99% so it's definitely working. I basically executed a project reset yesterday on this host, so I don't think my previous modifications to swap out the 101 app for 1121 carried over.
____________

Billy Ewell 1931
Send message
Joined: 22 Oct 10
Posts: 34
Credit: 967,276,174
RAC: 610,853
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57966 - Posted: 29 Nov 2021 | 18:07:32 UTC - in response to Message 57880.

Quote:Your tasks are failing with 'app exit status: 0xc0000135' - in all likelihood, you are missing a Microsoft runtime DLL file. Please refer to message 57353.Quote

Richard: Thank you kindly for solving the problem. I installed both 86 and 64 updating applications and now both machines are processing GPU Grid tasks without fault.

Billy Ewell 1931; celebrating the passage of my 90th birthday a few days ago and am physically in good shape and still mentally quite capable.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1284
Credit: 4,923,256,959
RAC: 6,447,011
Level
Arg
Scientific publications
watwatwatwatwat
Message 57967 - Posted: 29 Nov 2021 | 18:09:21 UTC

I missed out on all the new work because I had to get new master lists on all the hosts when their 24 hour timeouts finally expired.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,601,861,851
RAC: 8,767,822
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57969 - Posted: 29 Nov 2021 | 18:24:03 UTC - in response to Message 57967.

I missed out on all the new work because I had to get new master lists on all the hosts when their 24 hour timeouts finally expired.

I think the 24 hour (master file fetch) backoff is set by the client, rather than the server - so it can be over-ridden by a manual update.

That's unlike the 'daily quota exceeded' and the 'last request too recent' backoffs, which are enforced by the server and can't be bypassed.

I might use one of these boring lockdown days to force a client into 'master file fetch' mode, so I can see how it's recorded in client_state.xml, and hence how to remove it again - whenever and wherever that knowledge might be useful in the future.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1284
Credit: 4,923,256,959
RAC: 6,447,011
Level
Arg
Scientific publications
watwatwatwatwat
Message 57971 - Posted: 29 Nov 2021 | 19:54:05 UTC

Manual updates did nothing but keep resetting the 24 hour timer backoff.
Same with an update script running every 15 minutes. Backoff never got below 23 hours before resetting back to 24.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,601,861,851
RAC: 8,767,822
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57973 - Posted: 29 Nov 2021 | 21:04:16 UTC - in response to Message 57971.

That's because another failure doesn't reset the failure count. We need to find out where that's stored, and reduce it to less than 10.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 7,520
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57975 - Posted: 29 Nov 2021 | 21:37:29 UTC - in response to Message 57959.

after yesterday's snafu, I picked up two cuda101 tasks this morning on my Linux Ubuntu 20.04 3080Ti system. currently running ok. been running about 20 mins now, and is utilizing the GPU @99% so it's definitely working. I basically executed a project reset yesterday on this host, so I don't think my previous modifications to swap out the 101 app for 1121 carried over.
That's easy to check: the CUDA1121 is 990MB while the CUDA101 is 491MB (503406KB).
I think it's impossible to run the CUDA101 on RTX3000 series, as that was the main reason demanding a CUDA11 client not so long ago.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1031
Credit: 35,497,232,483
RAC: 64,632,171
Level
Trp
Scientific publications
wat
Message 57976 - Posted: 30 Nov 2021 | 2:50:49 UTC - in response to Message 57975.

after yesterday's snafu, I picked up two cuda101 tasks this morning on my Linux Ubuntu 20.04 3080Ti system. currently running ok. been running about 20 mins now, and is utilizing the GPU @99% so it's definitely working. I basically executed a project reset yesterday on this host, so I don't think my previous modifications to swap out the 101 app for 1121 carried over.
That's easy to check: the CUDA1121 is 990MB while the CUDA101 is 491MB (503406KB).
I think it's impossible to run the CUDA101 on RTX3000 series, as that was the main reason demanding a CUDA11 client not so long ago.



my gpugrid project folder contains two compressed files for acemd3.

x86_64-pc-linux-gnu__cuda101.zip.<alphanumeric> (515.5 MB)
x86_64-pc-linux-gnu__cuda1121.zip.<alphanumeric> (1.0 GB)

so it seems it did indeed use the cuda101 code on my 3080Ti and both tasks succeeded.

https://www.gpugrid.net/result.php?resultid=32707549
https://www.gpugrid.net/result.php?resultid=32701203

since both apps use the same filename of just 'acemd3', it's possible some bug is causing the wrong (or is it right? lol) one to be used or something to that effect.
____________

Profile PDW
Send message
Joined: 7 Mar 14
Posts: 15
Credit: 1,000,002,525
RAC: 0
Level
Met
Scientific publications
watwatwatwatwat
Message 57978 - Posted: 30 Nov 2021 | 8:08:50 UTC - in response to Message 57976.

since both apps use the same filename of just 'acemd3', it's possible some bug is causing the wrong (or is it right? lol) one to be used or something to that effect.

Don't forget it could be this...
http://www.gpugrid.net/forum_thread.php?id=5256&nowrap=true#57473

Have completed 5 of the recent cuda101 tasks on Ampere hosts now, a sixth is running and a seventh lined up. Have seen no failures as yet.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 7,520
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57985 - Posted: 1 Dec 2021 | 9:43:31 UTC - in response to Message 57978.

since both apps use the same filename of just 'acemd3', it's possible some bug is causing the wrong (or is it right? lol) one to be used or something to that effect.

Don't forget it could be this...
http://www.gpugrid.net/forum_thread.php?id=5256&nowrap=true#57473

Have completed 5 of the recent cuda101 tasks on Ampere hosts now, a sixth is running and a seventh lined up. Have seen no failures as yet.
I guess that you still use the "special" BOINC manager (compiled for SETI@home), and that handles apps in a different way. That would explain this anomaly.

Profile PDW
Send message
Joined: 7 Mar 14
Posts: 15
Credit: 1,000,002,525
RAC: 0
Level
Met
Scientific publications
watwatwatwatwat
Message 57987 - Posted: 1 Dec 2021 | 9:59:34 UTC - in response to Message 57985.

I guess that you still use the "special" BOINC manager (compiled for SETI@home), and that handles apps in a different way. That would explain this anomaly.

No.
No modified manager or client here, just the bog standard BOINC 7.16.6

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,601,861,851
RAC: 8,767,822
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57988 - Posted: 1 Dec 2021 | 10:21:40 UTC - in response to Message 57987.
Last modified: 1 Dec 2021 | 10:24:55 UTC

... just the bog standard BOINC 7.16.6

You are recommended to upgrade to v7.16.20 - it's pretty good code, and - importantly - it has updated SSL security certificates needed by some BOINC projects.

(Edit - the above advice applies only to Windows machines. If you're running Linux, you can ignore it. Your computers are hidden, so I don't know which applies)

Erich56
Send message
Joined: 1 Jan 15
Posts: 1090
Credit: 6,603,906,926
RAC: 21,893,126
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 58071 - Posted: 11 Dec 2021 | 15:16:28 UTC

a few hours ago, I had another task which failed after a few seconds with

195 (0xc3) EXIT_CHILD_FAILED

ACEMD failed:
Particle coordinate is nan


https://www.gpugrid.net/workunit.php?wuid=27099407

As can be seen, the task failed on a total of 8 different hosts.
I am questioning the rationale behind sending out a faulty task 8 x :-(((

Post to thread

Message boards : Number crunching : 195 (0xc3) EXIT_CHILD_FAILED

//