Cause of quantum chemistry task failures: md5sum errors

Message boards : Number crunching : Cause of quantum chemistry task failures: md5sum errors
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Michael H.W. Weber

Send message
Joined: 9 Feb 16
Posts: 78
Credit: 656,229,684
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwat
Message 51129 - Posted: 28 Dec 2018, 16:44:19 UTC
Last modified: 28 Dec 2018, 16:46:39 UTC

I have a single Ubuntu Linux machine participating in GPUGRID using its CPU. Apart from a few correctly completed QC tasks, by now this machine has produced 28 "compute errors" after just a few seconds of run time each (0 secs CPU time). Checking the error logs yields the following message for all 28 tasks:

WARNING: md5sum mismatch of tar archive
expected: 75a9f0faa822a01dfe0e0e5c43400ed0
     got: dfc9f09eb6b6771c69d6cf10b91bc6c9  -

bunzip2: Data integrity error when decompressing.

I noticed that WU download (and communication in general) is extremely slow - could it be that this is the cause of byte-hick-ups resulting in non-functional WU archives ending up with checksum errors upon extraction?

In effect, this machine is prohibited to download additional tasks for 24 hours making it kind of obsolete to continue to participate in the current GPUGRID team challenge and GPUGRID QC task computation in general.

Maybe an upgrade of the GPUGRID server infrastructure would help improve the situation?

Michael.
President of Rechenkraft.net - Germany's first and largest distributed computing organization.
ID: 51129 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Zalster
Avatar

Send message
Joined: 26 Feb 14
Posts: 211
Credit: 4,496,324,562
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwat
Message 51130 - Posted: 28 Dec 2018, 17:09:27 UTC - in response to Message 51129.  

I seem to remember something about those in the past, just can't remember.

Your computers are hidden so we can not check the error you report. If you unhide them we can look at the entire stderr report and hopefully get an idea as to what is occuring.
ID: 51130 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Michael H.W. Weber

Send message
Joined: 9 Feb 16
Posts: 78
Credit: 656,229,684
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwat
Message 51132 - Posted: 28 Dec 2018, 19:21:16 UTC

Here is an exemplary stderr log:

Name	m0000040872_65a1af79_n00050-SDOERR_QMML50_4-0-1-RND5714_1
Arbeitspaket	15707811
Erstellt	27 Dec 2018 | 17:23:04 UTC
Gesendet	27 Dec 2018 | 18:30:32 UTC
Empfangen	27 Dec 2018 | 18:32:06 UTC
Serverstatus	Abgeschlossen
Resultat	Berechnungsfehler
Clientstatus	Berechnungsfehler
Endstatus	195 (0xc3) EXIT_CHILD_FAILED
Computer ID	428878
Ablaufdatum	1 Jan 2019 | 18:30:32 UTC
Laufzeit	2.72
CPU Zeit	0.00
Prüfungsstatus	Ungültig
Punkte	0.00
Anwendungsversion	Quantum Chemistry v3.31 (mt)
Stderr Ausgabe

<core_client_version>7.9.3</core_client_version>
<![CDATA[
<message>
process exited with code 195 (0xc3, -61)</message>
<stderr_txt>
19:30:37 (6677): wrapper (7.7.26016): starting
19:30:37 (6677): wrapper (7.7.26016): starting
19:30:37 (6677): wrapper: running /usr/bin/flock (/var/lib/boinc-client/projects/www.gpugrid.net/miniconda.lock -c "/bin/bash ./miniconda-installer.sh -b -u -p /var/lib/boinc-client/projects/www.gpugrid.net/miniconda &&
                      /var/lib/boinc-client/projects/www.gpugrid.net/miniconda/bin/conda install -m -y -p qmml3 --override-channels -c defaults  -c gpugrid  --file requirements.txt ")
WARNING: md5sum mismatch of tar archive
expected: 75a9f0faa822a01dfe0e0e5c43400ed0
     got: dfc9f09eb6b6771c69d6cf10b91bc6c9  -

bunzip2: Data integrity error when decompressing.
	Input file = /var/lib/boinc-client/projects/www.gpugrid.net/miniconda/preconda.tar.bz2, output file = (stdout)

It is possible that the compressed file(s) have become corrupted.
You can use the -tvv option to test integrity of such files.

You can use the `bzip2recover' program to attempt to recover
data from undamaged sections of corrupted files.

tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now
19:30:38 (6677): /usr/bin/flock exited; CPU time 0.185019
19:30:38 (6677): app exit status: 0x1
19:30:38 (6677): called boinc_finish(195)

</stderr_txt>
]]>

Michael.
President of Rechenkraft.net - Germany's first and largest distributed computing organization.
ID: 51132 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Michael H.W. Weber

Send message
Joined: 9 Feb 16
Posts: 78
Credit: 656,229,684
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwat
Message 51135 - Posted: 29 Dec 2018, 13:32:38 UTC

44 tasks are now affected...

Michael.
President of Rechenkraft.net - Germany's first and largest distributed computing organization.
ID: 51135 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Michael H.W. Weber

Send message
Joined: 9 Feb 16
Posts: 78
Credit: 656,229,684
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwat
Message 51368 - Posted: 23 Jan 2019, 21:24:40 UTC

Is this issue resolved?

Michael.
President of Rechenkraft.net - Germany's first and largest distributed computing organization.
ID: 51368 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 51417 - Posted: 2 Feb 2019, 16:49:03 UTC - in response to Message 51368.  
Last modified: 2 Feb 2019, 16:49:17 UTC

Not something we can fix from here. Try resetting the project, which should clear local files.
ID: 51417 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : Cause of quantum chemistry task failures: md5sum errors

©2025 Universitat Pompeu Fabra