Advanced search

Message boards : Number crunching : WU invalid because of an upload issue at GPUGRIDs server end?

Author Message
Profile Michael H.W. Weber
Send message
Joined: 9 Feb 16
Posts: 71
Credit: 607,916,391
RAC: 2,748
Level
Lys
Scientific publications
watwatwatwatwatwatwatwat
Message 57592 - Posted: 13 Oct 2021 | 7:46:14 UTC
Last modified: 13 Oct 2021 | 7:48:26 UTC

Any idea why this task was marked as invalid after approx. 40,000 seconds of precious run time on my RTX 3080? So far, this machine has not produced a single error and the end of the log appears to note an upload issue?

Task: https://www.gpugrid.net/workunit.php?wuid=27081741

The results are about 500 MB in size for each of these tasks - not at all a problem at my end but according to the snail-style data transfer speed apparently a MAJOR problem at Barcelona's end?

Name e5s122_e2s172p0f91-ADRIA_AdB_KIXCMYB_HIP-1-2-RND2100_3
Arbeitspaket 27081741
Erstellt 12 Oct 2021 | 11:45:25 UTC
Gesendet 12 Oct 2021 | 11:45:51 UTC
Empfangen 13 Oct 2021 | 1:06:53 UTC
Serverstatus Abgeschlossen
Resultat Berechnungsfehler
Clientstatus Berechnungsfehler
Endstatus 0 (0x0)
Computer ID 584499
Ablaufdatum 17 Oct 2021 | 11:45:51 UTC
Laufzeit 40,157.41
CPU Zeit 39,016.81
Prüfungsstatus Ungültig
Punkte 0.00
Anwendungsversion New version of ACEMD v2.18 (cuda1121)


Stderr Ausgabe

<core_client_version>7.16.11</core_client_version>
<![CDATA[
<stderr_txt>
15:52:17 (23288): wrapper (7.9.26016): starting
15:52:17 (23288): wrapper: running bin/acemd3.exe (--boinc --device 0)
03:01:09 (23288): bin/acemd3.exe exited; CPU time 39016.812500
03:01:20 (23288): called boinc_finish(0)
0 bytes in 0 Free Blocks.
186 bytes in 4 Normal Blocks.
1144 bytes in 1 CRT Blocks.
0 bytes in 0 Ignore Blocks.
0 bytes in 0 Client Blocks.
Largest number used: 0 bytes.
Total allocations: 824084403 bytes.
Dumping objects ->
{389617} normal block at 0x0000028AAECC3BC0, 85 bytes long.
Data: <<project_prefere> 3C 70 72 6F 6A 65 63 74 5F 70 72 65 66 65 72 65
..\api\boinc_api.cpp(309) : {389614} normal block at 0x0000028AAECC4620, 8 bytes long.
Data: < &#160;&#174;&#138; > 00 00 A0 AE 8A 02 00 00
{388969} normal block at 0x0000028AAECC3C60, 85 bytes long.
Data: <<project_prefere> 3C 70 72 6F 6A 65 63 74 5F 70 72 65 66 65 72 65
{388355} normal block at 0x0000028AAECC48F0, 8 bytes long.
Data: < &#206;&#174;&#138; > 10 9D CE AE 8A 02 00 00
..\zip\boinc_zip.cpp(122) : {146} normal block at 0x0000028AAECC3090, 260 bytes long.
Data: < > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
{133} normal block at 0x0000028AAECC4670, 16 bytes long.
Data: <P&#226;&#203;&#174;&#138; > 50 E2 CB AE 8A 02 00 00 00 00 00 00 00 00 00 00
{132} normal block at 0x0000028AAECBE250, 40 bytes long.
Data: <pF&#204;&#174;&#138; conda-pa> 70 46 CC AE 8A 02 00 00 63 6F 6E 64 61 2D 70 61
{125} normal block at 0x0000028AAECBE480, 48 bytes long.
Data: <--boinc --device> 2D 2D 62 6F 69 6E 63 20 2D 2D 64 65 76 69 63 65
{124} normal block at 0x0000028AAECC4030, 16 bytes long.
Data: <XN&#204;&#174;&#138; > 58 4E CC AE 8A 02 00 00 00 00 00 00 00 00 00 00
{123} normal block at 0x0000028AAECC48A0, 16 bytes long.
Data: <0N&#204;&#174;&#138; > 30 4E CC AE 8A 02 00 00 00 00 00 00 00 00 00 00
{122} normal block at 0x0000028AAECC4CB0, 16 bytes long.
Data: < N&#204;&#174;&#138; > 08 4E CC AE 8A 02 00 00 00 00 00 00 00 00 00 00
{121} normal block at 0x0000028AAECC4440, 16 bytes long.
Data: <&#224;M&#204;&#174;&#138; > E0 4D CC AE 8A 02 00 00 00 00 00 00 00 00 00 00
{120} normal block at 0x0000028AAECC43A0, 16 bytes long.
Data: <&#184;M&#204;&#174;&#138; > B8 4D CC AE 8A 02 00 00 00 00 00 00 00 00 00 00
{119} normal block at 0x0000028AAECC4530, 16 bytes long.
Data: < M&#204;&#174;&#138; > 90 4D CC AE 8A 02 00 00 00 00 00 00 00 00 00 00
{118} normal block at 0x0000028AAECC3D60, 16 bytes long.
Data: <pM&#204;&#174;&#138; > 70 4D CC AE 8A 02 00 00 00 00 00 00 00 00 00 00
{117} normal block at 0x0000028AAECC4B20, 16 bytes long.
Data: <HM&#204;&#174;&#138; > 48 4D CC AE 8A 02 00 00 00 00 00 00 00 00 00 00
{116} normal block at 0x0000028AAECC4760, 16 bytes long.
Data: < M&#204;&#174;&#138; > 20 4D CC AE 8A 02 00 00 00 00 00 00 00 00 00 00
{115} normal block at 0x0000028AAECC4D20, 496 bytes long.
Data: <`G&#204;&#174;&#138; bin/acem> 60 47 CC AE 8A 02 00 00 62 69 6E 2F 61 63 65 6D
{65} normal block at 0x0000028AAECB3280, 16 bytes long.
Data: < &#234;&#181;&#164;&#246; > 80 EA B5 A4 F6 7F 00 00 00 00 00 00 00 00 00 00
{64} normal block at 0x0000028AAECB3230, 16 bytes long.
Data: <@&#233;&#181;&#164;&#246; > 40 E9 B5 A4 F6 7F 00 00 00 00 00 00 00 00 00 00
{63} normal block at 0x0000028AAECB2FB0, 16 bytes long.
Data: <&#248;W&#178;&#164;&#246; > F8 57 B2 A4 F6 7F 00 00 00 00 00 00 00 00 00 00
{62} normal block at 0x0000028AAECB3190, 16 bytes long.
Data: <&#216;W&#178;&#164;&#246; > D8 57 B2 A4 F6 7F 00 00 00 00 00 00 00 00 00 00
{61} normal block at 0x0000028AAECB2BF0, 16 bytes long.
Data: <P &#178;&#164;&#246; > 50 04 B2 A4 F6 7F 00 00 00 00 00 00 00 00 00 00
{60} normal block at 0x0000028AAECB2BA0, 16 bytes long.
Data: <0 &#178;&#164;&#246; > 30 04 B2 A4 F6 7F 00 00 00 00 00 00 00 00 00 00
{59} normal block at 0x0000028AAECB3780, 16 bytes long.
Data: <&#224; &#178;&#164;&#246; > E0 02 B2 A4 F6 7F 00 00 00 00 00 00 00 00 00 00
{58} normal block at 0x0000028AAECB2B00, 16 bytes long.
Data: < &#178;&#164;&#246; > 10 04 B2 A4 F6 7F 00 00 00 00 00 00 00 00 00 00
{57} normal block at 0x0000028AAECB2EC0, 16 bytes long.
Data: <p &#178;&#164;&#246; > 70 04 B2 A4 F6 7F 00 00 00 00 00 00 00 00 00 00
{56} normal block at 0x0000028AAECB3910, 16 bytes long.
Data: < &#192;&#176;&#164;&#246; > 18 C0 B0 A4 F6 7F 00 00 00 00 00 00 00 00 00 00
Object dump complete.

</stderr_txt>
<message>
upload failure: <file_xfer_error>
<file_name>e5s122_e2s172p0f91-ADRIA_AdB_KIXCMYB_HIP-1-2-RND2100_3_0</file_name>
<error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
</message>
]]>

Technical data transfer issues due to poor server performance are persisting for many, many years with this project and should be resolved quickly.

Michael.
____________
President of Rechenkraft.net - Germany's first and largest distributed computing organization.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 812
Credit: 1,084,289,831
RAC: 1,517,561
Level
Met
Scientific publications
watwatwatwatwat
Message 57593 - Posted: 13 Oct 2021 | 7:50:20 UTC - in response to Message 57592.

Yes, problem with the project that can't accept large file sizes.
https://www.gpugrid.net/forum_thread.php?id=5261

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1249
Credit: 3,347,351,168
RAC: 1,045,276
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57594 - Posted: 13 Oct 2021 | 8:44:56 UTC

Actually, that one had up upload error on e5s122_e2s172p0f91-ADRIA_AdB_KIXCMYB_HIP-1-2-RND2100_3_0 - not the _9 file which usually grows to ~ 500 MB and sometimes more.

error code was -240: that seems to mean that BOINC had a problem creating the file in the first place, before even trying to upload it.

https://www.gpugrid.net/result.php?resultid=32653943

Profile Michael H.W. Weber
Send message
Joined: 9 Feb 16
Posts: 71
Credit: 607,916,391
RAC: 2,748
Level
Lys
Scientific publications
watwatwatwatwatwatwatwat
Message 57597 - Posted: 13 Oct 2021 | 10:17:55 UTC - in response to Message 57594.

Yes, problem with the project that can't accept large file sizes.
https://www.gpugrid.net/forum_thread.php?id=5261

OMG. I can't believe this...
Check out this one.

Actually, that one had up upload error on e5s122_e2s172p0f91-ADRIA_AdB_KIXCMYB_HIP-1-2-RND2100_3_0 - not the _9 file which usually grows to ~ 500 MB and sometimes more.

I have taken a look at some of the tasks I had completed successfully. None of them had a _9 ending in the task name. Still they had approx. 500 MB upload file sizes. Hence, the file name - to my observation - does not reliably hint to a result file size.

Michael.
____________
President of Rechenkraft.net - Germany's first and largest distributed computing organization.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1249
Credit: 3,347,351,168
RAC: 1,045,276
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57599 - Posted: 13 Oct 2021 | 10:35:05 UTC - in response to Message 57597.

The _9 doesn't refer to the task name, it refers to the upload file name. Each task generates multiple upload files.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1249
Credit: 3,347,351,168
RAC: 1,045,276
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57600 - Posted: 13 Oct 2021 | 10:37:41 UTC - in response to Message 57597.

Check out this one.

It contains the line

Temporarily failed upload of e2s67_e1s44p0f1240-ADRIA_AdB_KIXCMYB_HIP-0-2-RND1963_3_9

Life v lies: Dont be a DN...
Send message
Joined: 14 Feb 20
Posts: 5
Credit: 10,903,053
RAC: 298
Level
Pro
Scientific publications
wat
Message 57611 - Posted: 14 Oct 2021 | 17:23:08 UTC

a partial cross-post
IN THE HOPES THAT SOME ADMIN WILL NOTICE:
WU ran for 33 HOURS, 38 Min
10/14/2021 12:59PM Computation for task e9s158_e7s140p0f143-ADRIA_AdB_KIXCMYB_HIP-0-2-RND0938_5 finished

e9s158_e7s140p0f143-ADRIA_AdB_KIXCMYB_HIP-0-2-RND0938_5_9 501MB (525,918,872 bytes) is the one showing in the Transfers tab,
WHAT can I do so as NOT to waste this WR results or the 33 1/2 hours GPU time

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 544
Credit: 4,460,126,357
RAC: 1,610,525
Level
Arg
Scientific publications
wat
Message 57614 - Posted: 14 Oct 2021 | 18:09:18 UTC - in response to Message 57611.

a partial cross-post
IN THE HOPES THAT SOME ADMIN WILL NOTICE:
WU ran for 33 HOURS, 38 Min
10/14/2021 12:59PM Computation for task e9s158_e7s140p0f143-ADRIA_AdB_KIXCMYB_HIP-0-2-RND0938_5 finished

e9s158_e7s140p0f143-ADRIA_AdB_KIXCMYB_HIP-0-2-RND0938_5_9 501MB (525,918,872 bytes) is the one showing in the Transfers tab,
WHAT can I do so as NOT to waste this WR results or the 33 1/2 hours GPU time

nothing you can do at the moment unfortunately.
____________

Life v lies: Dont be a DN...
Send message
Joined: 14 Feb 20
Posts: 5
Credit: 10,903,053
RAC: 298
Level
Pro
Scientific publications
wat
Message 57616 - Posted: 14 Oct 2021 | 18:43:03 UTC - in response to Message 57611.

.OK, I have made (multiple) backup copies of the entire GPU project folder, and have for now suspended transfers.
If the upload does abort, how can I get BOINC to retry the WU output file uploads from the backups??
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 544
Credit: 4,460,126,357
RAC: 1,610,525
Level
Arg
Scientific publications
wat
Message 57619 - Posted: 14 Oct 2021 | 19:39:43 UTC - in response to Message 57616.

.OK, I have made (multiple) backup copies of the entire GPU project folder, and have for now suspended transfers.
If the upload does abort, how can I get BOINC to retry the WU output file uploads from the backups??


even if you restart the task from a backup, you will have the same issue. the problem is the output file is too big and cannot be uploaded since it is over the maximum file size allowed by the project's server. restarting computation will result in the same file being generated, still too big.

the problem can only be solved by the project.
____________

Life v lies: Dont be a DN...
Send message
Joined: 14 Feb 20
Posts: 5
Credit: 10,903,053
RAC: 298
Level
Pro
Scientific publications
wat
Message 57620 - Posted: 14 Oct 2021 | 21:02:22 UTC - in response to Message 57619.
Last modified: 14 Oct 2021 | 21:04:38 UTC

it may be a moot point, but I am NOT in the least interested in rerunning the WU (or, franking running ANY GpuGrid WUs for the foreseeable future).
I made a backup of the OUTPUT files.
my question is, can I try (for what it's worth) to get BOINC to redo the UPLOAD attempt, not rerun another 33 1/2 hours of GPU usage
LLP, PhD PE

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 544
Credit: 4,460,126,357
RAC: 1,610,525
Level
Arg
Scientific publications
wat
Message 57621 - Posted: 14 Oct 2021 | 21:59:16 UTC - in response to Message 57620.

You claimed that there was a file in your transfers tab. That’s the file that won’t upload because it’s too large. BOINC will keep trying indefinitely already, retransferring all the files that have already been uploaded won’t make any difference. Each GPUGRID task produces several output files that all need to be uploaded. When the _9 file is too big, you run into this problem. Short of gaining control of the project’s upload server and changing their settings for them, there’s really nothing you can do at this point.
____________

Post to thread

Message boards : Number crunching : WU invalid because of an upload issue at GPUGRIDs server end?