All Gerard WUs erroring

Message boards : Number crunching : All Gerard WUs erroring
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · Next

AuthorMessage
Trotador

Send message
Joined: 25 Mar 12
Posts: 103
Credit: 14,948,929,771
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 42537 - Posted: 2 Jan 2016, 20:14:14 UTC

Hi,

I'm seeing this happening with the last dowloaded units, wingmen also have the same error

"process exited with code 212 (0xd4, -44)"

Not sure but it could be only for linux WUs
ID: 42537 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Trotador

Send message
Joined: 25 Mar 12
Posts: 103
Credit: 14,948,929,771
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 42538 - Posted: 2 Jan 2016, 23:41:03 UTC - in response to Message 42537.  

Also for windows, error message

"(unknown error) - exit code -97 (0xffffff9f)"
ID: 42538 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Bedrich Hajek

Send message
Joined: 28 Mar 09
Posts: 490
Credit: 11,739,145,728
RAC: 116,723
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 42539 - Posted: 3 Jan 2016, 1:23:18 UTC - in response to Message 42537.  
Last modified: 3 Jan 2016, 1:24:23 UTC

Hi,

I'm seeing this happening with the last downloaded units, wingmen also have the same error

"process exited with code 212 (0xd4, -44)"

Not sure but it could be only for linux WUs



Yes, there seems to be a batch of WUs, that are failing on previously reliable Linux machines and some mostly bad windows hosts, but they are running fine on my windows computers. One has already completed successfully at this time.

See links:

https://www.gpugrid.net/workunit.php?wuid=11397999


https://www.gpugrid.net/workunit.php?wuid=11398213


https://www.gpugrid.net/workunit.php?wuid=11398820


https://www.gpugrid.net/workunit.php?wuid=11398294
ID: 42539 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Max Ringler

Send message
Joined: 27 Apr 15
Posts: 2
Credit: 147,218,248
RAC: 0
Level
Cys
Scientific publications
watwat
Message 42540 - Posted: 3 Jan 2016, 9:15:30 UTC

On my Windows 7 machine, (I7-3770, GTX 980) I currently had ~10 GERALD WU (more in the cue and still comming in) that were running @ less then %1 GPU usage (according to GPU-Z) while the progress in the BOINC manager appeared to be normal/a little slow (~15 hour estimation per WU). All these WU suddenly disappeared from the BOINC manager without any error massage and also without showing up in my results in my GPUGRID stats. Certainly there is something flawed with these WUs!
ID: 42540 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Max Ringler

Send message
Joined: 27 Apr 15
Posts: 2
Credit: 147,218,248
RAC: 0
Level
Cys
Scientific publications
watwat
Message 42541 - Posted: 3 Jan 2016, 9:23:45 UTC - in response to Message 42540.  

I missed the other WUs, but right now this happened to the WU:

e14s27_e9s23p1f368-GERARD_CXCL12_DIM_HEP_GLYCAM-0-1-RND5008

This WU was running @ <1% GPU usage but at close to normal progress speed, however it was restarting every ~10 hours or so. I now cancelled this WU, and the next one in my cue seems to work normally again (e13s16_e8s26p11f203-GERARD_CXCL12_DIMPROTO3-0-1-RND2849; estimated time ~12 hours, 82% GPU usage)
ID: 42541 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 42542 - Posted: 3 Jan 2016, 10:16:21 UTC

ID: 42542 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Bedrich Hajek

Send message
Joined: 28 Mar 09
Posts: 490
Credit: 11,739,145,728
RAC: 116,723
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 42546 - Posted: 3 Jan 2016, 12:15:47 UTC - in response to Message 42542.  
Last modified: 3 Jan 2016, 12:25:56 UTC

I have both kind of these WUs:

1. Erroring on all hosts, including mine.
https://www.gpugrid.net/workunit.php?wuid=11396918
https://www.gpugrid.net/workunit.php?wuid=11396911

2. Erroring on all hosts, except on mine:
https://www.gpugrid.net/workunit.php?wuid=11397526
https://www.gpugrid.net/workunit.php?wuid=11398513
https://www.gpugrid.net/workunit.php?wuid=11397102
https://www.gpugrid.net/workunit.php?wuid=11397012
https://www.gpugrid.net/workunit.php?wuid=11398161
https://www.gpugrid.net/workunit.php?wuid=11398515
https://www.gpugrid.net/workunit.php?wuid=11396116
https://www.gpugrid.net/workunit.php?wuid=11398187


So how many errors did you get recently? If it's a small number, you could attribute that to running into the occasional bad WU. If you have a lot more, than it's more than just a linux problem.

For the record, I have 2 errors since the new year. All WUs on my machines are currently running okay and I hope it stays that way!. So, I would say that I ran into 2 bad WUs.
ID: 42546 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ServicEnginIC
Avatar

Send message
Joined: 24 Sep 10
Posts: 593
Credit: 12,146,936,510
RAC: 4,406,248
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 42547 - Posted: 3 Jan 2016, 12:16:31 UTC

I've found the same behavior in my linux hosts, in WUs received since Jan-02-2016 past midday.
Consequently, statistics are getting worse, possibly due to those failing linux WUs...
This can be seen at the bottom of "Server status" page.

https://www.gpugrid.net/server_status.php

On Jan-02-2016 at 22:41 UTC, the medium error rate over the 25 kinds of WUs in progress was 20,9952 %
This has increased to 25,7552 % at 11:44 UTC on Jan-03-2016.

ID: 42547 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 42549 - Posted: 3 Jan 2016, 14:07:23 UTC - in response to Message 42546.  

So how many errors did you get recently? If it's a small number, you could attribute that to running into the occasional bad WU. If you have a lot more, than it's more than just a linux problem.
I have four errors recently. It's a bit more than usual. The two aborted WUs are my fault.
ID: 42549 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 42550 - Posted: 3 Jan 2016, 15:00:26 UTC

I haven't seen the problem yet on a pair of GTX 960s.
https://www.gpugrid.net/results.php?hostid=194224&offset=0&show_names=0&state=0&appid=

I had originally boosted the P2 memory clock as per ETA's suggestion (https://einstein.phys.uwm.edu/forum_thread.php?id=11044), but saw a few "simulation unstable" messages, though I don't think they led to actual errors at that point. But that was a little to close to the edge for me, so I removed that boost and the cards are back to factory default, which is not much of an overclock on these MSI 2GD5T OC cards. Maybe that keeps them stable on the most difficult work units.
ID: 42550 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
northcup

Send message
Joined: 29 Dec 15
Posts: 1
Credit: 135,300
RAC: 0
Level

Scientific publications
wat
Message 42551 - Posted: 3 Jan 2016, 16:55:40 UTC

14814161 11399908 286919 3 Jan 2016 | 16:38:22 UTC 3 Jan 2016 | 16:39:05 UTC Error while computing 0.00 0.00 --- Long runs

14814079 11399366 286919 3 Jan 2016 | 16:16:38 UTC 3 Jan 2016 | 16:32:43 UTC Error while computing 0.00 0.00 ---

14813534 11399465 286919 3 Jan 2016 | 13:04:05 UTC 3 Jan 2016 | 13:06:03 UTC Error while computing 0.00 0.00 ---

14801182 11384321 286919 29 Dec 2015 | 20:21:34 UTC 1 Jan 2016 | 9:50:05 UTC Completed and validated 212,450.23 4,110.21 135,300.00 Long runs

Same problem here with a valid run from dezember last year. Greets, Klaus
ID: 42551 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rion Family

Send message
Joined: 13 Jan 14
Posts: 21
Credit: 15,415,926,517
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 42553 - Posted: 3 Jan 2016, 17:52:02 UTC
Last modified: 3 Jan 2016, 17:53:18 UTC

I have seen the same thing on my linux host - all work units since the one below error out the same way

Stderr output
<core_client_version>7.3.15</core_client_version>
<![CDATA[
<message>
process exited with code 212 (0xd4, -44)
</message>
<stderr_txt>

</stderr_txt>
]]>

14811283 11398815 176528 3 Jan 2016 | 0:28:00 UTC 3 Jan 2016 | 1:08:39 UTC Error while computing 0.00 0.00 --- Long runs (8-12 hours on fastest card) v8.46 (cuda65)
ID: 42553 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
opr

Send message
Joined: 24 May 11
Posts: 7
Credit: 93,272,937
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 42554 - Posted: 3 Jan 2016, 19:04:13 UTC

Hello , I'm using ubuntu 14.04 lts. Gerard-WU's stopped after 1 second and were uploaded. "Output file was absent" for four files at a time. I did some collatz conjecture earlier today but I guess that didn't mess up my computer as others are having problems too.

opr
ID: 42554 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 42556 - Posted: 3 Jan 2016, 20:26:43 UTC

Not sure if it's related, but I too just had an error with a Gerard unit, which is a rare thing to happen for me.

http://www.gpugrid.net/workunit.php?wuid=11389493
Exit status 194 (0xc2) EXIT_ABORTED_BY_CLIENT
(unknown error) - exit code 194 (0xc2)

Name e3s31_e2s25p1f424-GERARD_CXCL12_CHALC4_DIM1-0-1-RND7047_1
Workunit 11389493
Created 1 Jan 2016 | 22:45:42 UTC
Sent 1 Jan 2016 | 22:45:48 UTC
Received 3 Jan 2016 | 11:06:22 UTC
Server state Over
Outcome Computation error
Client state Compute error
Exit status 194 (0xc2) EXIT_ABORTED_BY_CLIENT
Computer ID 153764
Report deadline 6 Jan 2016 | 22:45:48 UTC
Run time 80,101.12
CPU time 11,903.64
Validate state Invalid
Credit 0.00
Application version Long runs (8-12 hours on fastest card) v8.47 (cuda65)
Stderr output

<core_client_version>7.6.22</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code 194 (0xc2)
</message>
ID: 42556 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Stroppy

Send message
Joined: 10 Feb 09
Posts: 4
Credit: 2,771,097,960
RAC: 163,981
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 42561 - Posted: 4 Jan 2016, 18:09:44 UTC

Since 16:48 UTC on the second of January, my Linux host(206986) has failed all WU's it has received. My 2 Windows hosts are working as usual. A quick look through the task lists for the top 10 users shows the same pattern. Has anyone come up with a theory as to what is happening? In the meantime I have set that host to NNT to avoid causing any congestion at the server-side.
ID: 42561 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Trotador

Send message
Joined: 25 Mar 12
Posts: 103
Credit: 14,948,929,771
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 42562 - Posted: 4 Jan 2016, 18:29:57 UTC

This issue continues ocurring in all my hosts (Linux).

Guess is that administrators are still in holidays, no claim, they deserve them.
ID: 42562 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile MJH

Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 42563 - Posted: 4 Jan 2016, 23:13:51 UTC - in response to Message 42562.  

The Linux app binary has expired and needs to be updated. I'll get that done tomorrow, hopefully.
ID: 42563 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Stoneageman
Avatar

Send message
Joined: 25 May 09
Posts: 224
Credit: 34,057,374,498
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 42565 - Posted: 5 Jan 2016, 10:46:48 UTC

Thanks Matt. Hope the update will improve it's performance
ID: 42565 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile God is Love, JC proves it. I t...
Avatar

Send message
Joined: 24 Nov 11
Posts: 30
Credit: 201,648,059
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 42566 - Posted: 7 Jan 2016, 0:05:06 UTC

WU e15s19_e14s24p1f286-GERARD_CXCL12_DIMPROTO3-0-1-RND3500_2 has been stuck at '85% "progress" ' for some 12 hours now.
I only have a 640, so WUs take 40-60 hours generally.
This task has already run for 69:58.
is this part of a defective batch?
How many more hours should I sacrifice for this WU?
I am presuming that if I abort it, there will be zero credit for these 70 hours (even if it is a flawed WU?)

I Run Win 7 on my HP-1120, i7-2600. (I am NOT going to 'upgrade' to Win 10 for months, until (I hope) MS gets all the garbage in Win8-10 patched up.)

Please advise.

Meanwhile I have paused it and am putting my GPU to better use.

Thanks.

I think ∴ I THINK I am
My thinking neither is the source of my being
NOR proves it to you
God Is Love, Jesus proves it! ∴ we are
ID: 42566 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 42567 - Posted: 7 Jan 2016, 2:52:09 UTC - in response to Message 42566.  

I'd suggest restarting the PC. And if the problem still persists, then abort the task.
ID: 42567 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · 3 · Next

Message boards : Number crunching : All Gerard WUs erroring

©2026 Universitat Pompeu Fabra