Major SNAFU in Effect

Message boards : Number crunching : Major SNAFU in Effect
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · Next

AuthorMessage
Nick Name

Send message
Joined: 3 Sep 13
Posts: 53
Credit: 1,533,531,731
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 51786 - Posted: 14 May 2019, 1:14:37 UTC
Last modified: 14 May 2019, 1:20:02 UTC

I noticed a ton of errors on a previously 100% reliable host tonight. Looks like a bad batch of WUs got pushed out, both IDP and KIX jobs are affected.
IDP
http://www.gpugrid.net/workunit.php?wuid=16483464
http://www.gpugrid.net/workunit.php?wuid=16480175
http://www.gpugrid.net/workunit.php?wuid=16480417
http://www.gpugrid.net/workunit.php?wuid=16453242
KIX
http://www.gpugrid.net/workunit.php?wuid=16483553
http://www.gpugrid.net/workunit.php?wuid=16474311
http://www.gpugrid.net/workunit.php?wuid=16483548


I have 25 bad jobs in total that also have failed on numerous other hosts.

[edit]I should have said mine is a Linux host, and I just noticed most of the other hosts where work failed are also Linux machines.[/edit]
Team USA forum | Team USA page
Join us and #crunchforcures. We are now also folding:join team ID 236370!
ID: 51786 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
PappaLitto

Send message
Joined: 21 Mar 16
Posts: 513
Credit: 4,673,458,277
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwat
Message 51789 - Posted: 14 May 2019, 1:24:29 UTC

http://www.gpugrid.net/results.php?hostid=490728

Above is my host with erroring WUs
ID: 51789 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Bedrich Hajek

Send message
Joined: 28 Mar 09
Posts: 490
Credit: 11,731,645,728
RAC: 57
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 51790 - Posted: 14 May 2019, 1:59:34 UTC - in response to Message 51789.  

http://www.gpugrid.net/results.php?hostid=490728

Above is my host with erroring WUs



Did someone forget to renew a license?





ID: 51790 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 731
Level
Tyr
Scientific publications
watwatwatwatwat
Message 51791 - Posted: 14 May 2019, 3:29:32 UTC

I'm getting nothing but comp errors on these new tasks also.
ID: 51791 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 51792 - Posted: 14 May 2019, 4:18:58 UTC

Same here, of course. But I haven't seen anyone from the project around here for a while. Is anyone at home?
ID: 51792 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
STARBASEn
Avatar

Send message
Joined: 17 Feb 09
Posts: 91
Credit: 1,603,303,394
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 51793 - Posted: 14 May 2019, 4:42:08 UTC

Same here as well. Error 212 on WU's that were running fine up to 4 -5 hours ago. sounds like a license thing to me as well. Suspended project until the issue is resolved.
ID: 51793 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
DRSMT

Send message
Joined: 23 Feb 17
Posts: 21
Credit: 5,528,199,475
RAC: 0
Level
Tyr
Scientific publications
watwatwatwat
Message 51795 - Posted: 14 May 2019, 6:00:57 UTC

Have the same issues on two Linux machines, so not sure if this is a license thing.
ID: 51795 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
rod4x4

Send message
Joined: 4 Aug 14
Posts: 266
Credit: 2,219,935,054
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 51796 - Posted: 14 May 2019, 6:18:42 UTC

For the last 2 years, the License error usually comes after July 1st. 12 month license, I am assuming.
ID: 51796 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 731
Level
Tyr
Scientific publications
watwatwatwatwat
Message 51798 - Posted: 14 May 2019, 7:19:59 UTC

Every task I had in my cache on 4 hosts errored out today. Since I don't run very high resource allotment, some tasks had been running a couple of hours a day with no issues until today. The hosts are processing other projects without any errors during this time. I'd have to guess a license expired today.
ID: 51798 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Azmodes

Send message
Joined: 7 Jan 17
Posts: 34
Credit: 1,371,429,518
RAC: 0
Level
Met
Scientific publications
watwatwat
Message 51799 - Posted: 14 May 2019, 7:56:34 UTC
Last modified: 14 May 2019, 8:02:17 UTC

Same. I have two Ubuntu machines that throw up nothing but immediate errors now. My two Windows crunchers are fine, though.
ID: 51799 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 51801 - Posted: 14 May 2019, 8:03:13 UTC

The Linux app is broken (most probably its license expired).
All of my Linux hosts run immediately into this error with every single workunit:
<core_client_version>7.9.3</core_client_version>
<![CDATA[
<message>
process exited with code 212 (0xd4, -44)</message>
<stderr_txt>

</stderr_txt>
]]>

However my Windows host are crunching happily, so I switched back to Windows on my Linux hosts.

The GPUGrid staff need to act on this without delay.
ID: 51801 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Michael H.W. Weber

Send message
Joined: 9 Feb 16
Posts: 78
Credit: 656,229,684
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwat
Message 51802 - Posted: 14 May 2019, 8:12:57 UTC
Last modified: 14 May 2019, 8:13:21 UTC

Same over here:
http://www.gpugrid.net/forum_thread.php?id=4909&nowrap=true#51794

Michael.
President of Rechenkraft.net - Germany's first and largest distributed computing organization.
ID: 51802 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 351
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 51804 - Posted: 14 May 2019, 10:15:56 UTC

The Linux ACEMD v9.19 apps were deployed on 13/14 February 2018 - so it possibly looks like a 15 month licence expiry.

The Windows v9.22 apps were deployed on 26 July 2018, so with luck we have until late October for those...

Applications
ID: 51804 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
rod4x4

Send message
Joined: 4 Aug 14
Posts: 266
Credit: 2,219,935,054
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 51808 - Posted: 14 May 2019, 12:22:42 UTC
Last modified: 14 May 2019, 12:51:24 UTC

A temporary fix for Linux users is to set your system date back 1 year.

EDIT: Setting time back 1 year caused certificate errors with other projects. So I have now set time back 1 month. This seems to work better.

This has allowed me to start GPUgrid jobs successfully.

You may need to stop time sync services so the system does not reset the time back to current time.

For systemd based distros (eg...Ubuntu) - sudo datetimectl set-ntp 0 will turn time sync off

EDIT: you will need to reissue this command and reset time after each reboot. If this licensing issue persists, I will post a more permanent time sync fix

This was the temporary fix last year when license issues occurred.
ID: 51808 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
James C. Owens

Send message
Joined: 16 Apr 09
Posts: 7
Credit: 3,568,270,438
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 51810 - Posted: 14 May 2019, 14:35:36 UTC - in response to Message 51808.  

Is project leadership aware of the licensing expiration? Seems like someone should be keeping a tickler file for this so that renewals could happen before WU's start erroring out.
ID: 51810 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 51811 - Posted: 14 May 2019, 16:25:44 UTC - in response to Message 51810.  
Last modified: 14 May 2019, 16:25:53 UTC

Is project leadership aware of the licensing expiration?
Apparently not. That's why this SNAFU.

Seems like someone should be keeping a tickler file for this so that renewals could happen before WU's start erroring out.
True.
ID: 51811 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 51814 - Posted: 14 May 2019, 17:37:27 UTC - in response to Message 51811.  

There wasn't any notification of the pending shutdown of the Quantum Chemistry (CPU) work units either, or when they might be restarted.
I am not sure that there is any project leadership at the moment.
ID: 51814 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 731
Level
Tyr
Scientific publications
watwatwatwatwat
Message 51816 - Posted: 14 May 2019, 19:51:19 UTC

I'm going to just suspend the project on all my hosts. The fact I have to exclude my Turing cards makes it difficult to work with the project anyway.

I'll just check back in occasionally and see if a new Linux app is available with current licensing.
ID: 51816 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 51817 - Posted: 14 May 2019, 20:25:16 UTC - in response to Message 51810.  

Seems like someone should be keeping a tickler file for this so that renewals could happen before WU's start erroring out.

also in the past, license renewals were not done in time and tasks failed. Too bad, but it really seems that the people at GPUGRID simply forget about these things.
ID: 51817 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 351
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 51821 - Posted: 15 May 2019, 6:46:56 UTC

Just in case anyone is still wondering, I've been sent WU 16485663.

Failed three times on Linux v9.19 hosts, now running normally under Windows v9.22

Confirms that it's an application problem, not a data problem.
ID: 51821 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · 3 · Next

Message boards : Number crunching : Major SNAFU in Effect

©2025 Universitat Pompeu Fabra