All Gerard WUs erroring

Message boards : Number crunching : All Gerard WUs erroring

Author	Message
Trotador Send message Joined: 25 Mar 12 Posts: 103 Credit: 13,920,977,393 RAC: 695,388 Level Scientific publications	Message 42537 - Posted: 2 Jan 2016 \| 20:14:14 UTC
	Hi, I'm seeing this happening with the last dowloaded units, wingmen also have the same error "process exited with code 212 (0xd4, -44)" Not sure but it could be only for linux WUs
	ID: 42537 \| Rating: 0 \| rate: / Reply Quote

Trotador Send message Joined: 25 Mar 12 Posts: 103 Credit: 13,920,977,393 RAC: 695,388 Level Scientific publications	Message 42538 - Posted: 2 Jan 2016 \| 23:41:03 UTC - in response to Message 42537.
	Also for windows, error message "(unknown error) - exit code -97 (0xffffff9f)"
	ID: 42538 \| Rating: 0 \| rate: / Reply Quote

Bedrich Hajek Send message Joined: 28 Mar 09 Posts: 486 Credit: 11,351,770,243 RAC: 9,384,821 Level Scientific publications	Message 42539 - Posted: 3 Jan 2016 \| 1:23:18 UTC - in response to Message 42537. Last modified: 3 Jan 2016 \| 1:24:23 UTC
	Hi, I'm seeing this happening with the last downloaded units, wingmen also have the same error "process exited with code 212 (0xd4, -44)" Not sure but it could be only for linux WUs Yes, there seems to be a batch of WUs, that are failing on previously reliable Linux machines and some mostly bad windows hosts, but they are running fine on my windows computers. One has already completed successfully at this time. See links: https://www.gpugrid.net/workunit.php?wuid=11397999 https://www.gpugrid.net/workunit.php?wuid=11398213 https://www.gpugrid.net/workunit.php?wuid=11398820 https://www.gpugrid.net/workunit.php?wuid=11398294
	ID: 42539 \| Rating: 0 \| rate: / Reply Quote

Max Ringler Send message Joined: 27 Apr 15 Posts: 2 Credit: 147,218,248 RAC: 0 Level Scientific publications	Message 42540 - Posted: 3 Jan 2016 \| 9:15:30 UTC
	On my Windows 7 machine, (I7-3770, GTX 980) I currently had ~10 GERALD WU (more in the cue and still comming in) that were running @ less then %1 GPU usage (according to GPU-Z) while the progress in the BOINC manager appeared to be normal/a little slow (~15 hour estimation per WU). All these WU suddenly disappeared from the BOINC manager without any error massage and also without showing up in my results in my GPUGRID stats. Certainly there is something flawed with these WUs!
	ID: 42540 \| Rating: 0 \| rate: / Reply Quote

Max Ringler Send message Joined: 27 Apr 15 Posts: 2 Credit: 147,218,248 RAC: 0 Level Scientific publications	Message 42541 - Posted: 3 Jan 2016 \| 9:23:45 UTC - in response to Message 42540.
	I missed the other WUs, but right now this happened to the WU: e14s27_e9s23p1f368-GERARD_CXCL12_DIM_HEP_GLYCAM-0-1-RND5008 This WU was running @ <1% GPU usage but at close to normal progress speed, however it was restarting every ~10 hours or so. I now cancelled this WU, and the next one in my cue seems to work normally again (e13s16_e8s26p11f203-GERARD_CXCL12_DIMPROTO3-0-1-RND2849; estimated time ~12 hours, 82% GPU usage)
	ID: 42541 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2353 Credit: 16,375,531,916 RAC: 5,811,976 Level Scientific publications	Message 42542 - Posted: 3 Jan 2016 \| 10:16:21 UTC
	I have both kind of these WUs: 1. Erroring on all hosts, including mine. https://www.gpugrid.net/workunit.php?wuid=11396918 https://www.gpugrid.net/workunit.php?wuid=11396911 2. Erroring on all hosts, except on mine: https://www.gpugrid.net/workunit.php?wuid=11397526 https://www.gpugrid.net/workunit.php?wuid=11398513 https://www.gpugrid.net/workunit.php?wuid=11397102 https://www.gpugrid.net/workunit.php?wuid=11397012 https://www.gpugrid.net/workunit.php?wuid=11398161 https://www.gpugrid.net/workunit.php?wuid=11398515 https://www.gpugrid.net/workunit.php?wuid=11396116 https://www.gpugrid.net/workunit.php?wuid=11398187
	ID: 42542 \| Rating: 0 \| rate: / Reply Quote

Bedrich Hajek Send message Joined: 28 Mar 09 Posts: 486 Credit: 11,351,770,243 RAC: 9,384,821 Level Scientific publications	Message 42546 - Posted: 3 Jan 2016 \| 12:15:47 UTC - in response to Message 42542. Last modified: 3 Jan 2016 \| 12:25:56 UTC
	I have both kind of these WUs: 1. Erroring on all hosts, including mine. https://www.gpugrid.net/workunit.php?wuid=11396918 https://www.gpugrid.net/workunit.php?wuid=11396911 2. Erroring on all hosts, except on mine: https://www.gpugrid.net/workunit.php?wuid=11397526 https://www.gpugrid.net/workunit.php?wuid=11398513 https://www.gpugrid.net/workunit.php?wuid=11397102 https://www.gpugrid.net/workunit.php?wuid=11397012 https://www.gpugrid.net/workunit.php?wuid=11398161 https://www.gpugrid.net/workunit.php?wuid=11398515 https://www.gpugrid.net/workunit.php?wuid=11396116 https://www.gpugrid.net/workunit.php?wuid=11398187 So how many errors did you get recently? If it's a small number, you could attribute that to running into the occasional bad WU. If you have a lot more, than it's more than just a linux problem. For the record, I have 2 errors since the new year. All WUs on my machines are currently running okay and I hope it stays that way!. So, I would say that I ran into 2 bad WUs.
	ID: 42546 \| Rating: 0 \| rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 581 Credit: 10,205,412,529 RAC: 16,950,825 Level Scientific publications	Message 42547 - Posted: 3 Jan 2016 \| 12:16:31 UTC
	I've found the same behavior in my linux hosts, in WUs received since Jan-02-2016 past midday. Consequently, statistics are getting worse, possibly due to those failing linux WUs... This can be seen at the bottom of "Server status" page. https://www.gpugrid.net/server_status.php On Jan-02-2016 at 22:41 UTC, the medium error rate over the 25 kinds of WUs in progress was 20,9952 % This has increased to 25,7552 % at 11:44 UTC on Jan-03-2016. ____________
	ID: 42547 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2353 Credit: 16,375,531,916 RAC: 5,811,976 Level Scientific publications	Message 42549 - Posted: 3 Jan 2016 \| 14:07:23 UTC - in response to Message 42546.
	So how many errors did you get recently? If it's a small number, you could attribute that to running into the occasional bad WU. If you have a lot more, than it's more than just a linux problem. I have four errors recently. It's a bit more than usual. The two aborted WUs are my fault.
	ID: 42549 \| Rating: 0 \| rate: / Reply Quote

Jim1348 Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level Scientific publications	Message 42550 - Posted: 3 Jan 2016 \| 15:00:26 UTC
	I haven't seen the problem yet on a pair of GTX 960s. https://www.gpugrid.net/results.php?hostid=194224&offset=0&show_names=0&state=0&appid= I had originally boosted the P2 memory clock as per ETA's suggestion (https://einstein.phys.uwm.edu/forum_thread.php?id=11044), but saw a few "simulation unstable" messages, though I don't think they led to actual errors at that point. But that was a little to close to the edge for me, so I removed that boost and the cards are back to factory default, which is not much of an overclock on these MSI 2GD5T OC cards. Maybe that keeps them stable on the most difficult work units.
	ID: 42550 \| Rating: 0 \| rate: / Reply Quote

northcup Send message Joined: 29 Dec 15 Posts: 1 Credit: 135,300 RAC: 0 Level Scientific publications	Message 42551 - Posted: 3 Jan 2016 \| 16:55:40 UTC
	14814161 11399908 286919 3 Jan 2016 \| 16:38:22 UTC 3 Jan 2016 \| 16:39:05 UTC Error while computing 0.00 0.00 --- Long runs 14814079 11399366 286919 3 Jan 2016 \| 16:16:38 UTC 3 Jan 2016 \| 16:32:43 UTC Error while computing 0.00 0.00 --- 14813534 11399465 286919 3 Jan 2016 \| 13:04:05 UTC 3 Jan 2016 \| 13:06:03 UTC Error while computing 0.00 0.00 --- 14801182 11384321 286919 29 Dec 2015 \| 20:21:34 UTC 1 Jan 2016 \| 9:50:05 UTC Completed and validated 212,450.23 4,110.21 135,300.00 Long runs Same problem here with a valid run from dezember last year. Greets, Klaus
	ID: 42551 \| Rating: 0 \| rate: / Reply Quote

Rion Family Send message Joined: 13 Jan 14 Posts: 21 Credit: 15,415,926,517 RAC: 22 Level Scientific publications	Message 42553 - Posted: 3 Jan 2016 \| 17:52:02 UTC Last modified: 3 Jan 2016 \| 17:53:18 UTC
	I have seen the same thing on my linux host - all work units since the one below error out the same way Stderr output <core_client_version>7.3.15</core_client_version> <![CDATA[ <message> process exited with code 212 (0xd4, -44) </message> <stderr_txt> </stderr_txt> ]]> 14811283 11398815 176528 3 Jan 2016 \| 0:28:00 UTC 3 Jan 2016 \| 1:08:39 UTC Error while computing 0.00 0.00 --- Long runs (8-12 hours on fastest card) v8.46 (cuda65)
	ID: 42553 \| Rating: 0 \| rate: / Reply Quote

opr Send message Joined: 24 May 11 Posts: 7 Credit: 93,272,937 RAC: 0 Level Scientific publications	Message 42554 - Posted: 3 Jan 2016 \| 19:04:13 UTC
	Hello , I'm using ubuntu 14.04 lts. Gerard-WU's stopped after 1 second and were uploaded. "Output file was absent" for four files at a time. I did some collatz conjecture earlier today but I guess that didn't mess up my computer as others are having problems too. opr
	ID: 42554 \| Rating: 0 \| rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 42556 - Posted: 3 Jan 2016 \| 20:26:43 UTC
	Not sure if it's related, but I too just had an error with a Gerard unit, which is a rare thing to happen for me. http://www.gpugrid.net/workunit.php?wuid=11389493 Exit status 194 (0xc2) EXIT_ABORTED_BY_CLIENT (unknown error) - exit code 194 (0xc2) Name e3s31_e2s25p1f424-GERARD_CXCL12_CHALC4_DIM1-0-1-RND7047_1 Workunit 11389493 Created 1 Jan 2016 \| 22:45:42 UTC Sent 1 Jan 2016 \| 22:45:48 UTC Received 3 Jan 2016 \| 11:06:22 UTC Server state Over Outcome Computation error Client state Compute error Exit status 194 (0xc2) EXIT_ABORTED_BY_CLIENT Computer ID 153764 Report deadline 6 Jan 2016 \| 22:45:48 UTC Run time 80,101.12 CPU time 11,903.64 Validate state Invalid Credit 0.00 Application version Long runs (8-12 hours on fastest card) v8.47 (cuda65) Stderr output <core_client_version>7.6.22</core_client_version> <![CDATA[ <message> (unknown error) - exit code 194 (0xc2) </message>
	ID: 42556 \| Rating: 0 \| rate: / Reply Quote

Stroppy Send message Joined: 10 Feb 09 Posts: 4 Credit: 2,506,407,065 RAC: 941,467 Level Scientific publications	Message 42561 - Posted: 4 Jan 2016 \| 18:09:44 UTC
	Since 16:48 UTC on the second of January, my Linux host(206986) has failed all WU's it has received. My 2 Windows hosts are working as usual. A quick look through the task lists for the top 10 users shows the same pattern. Has anyone come up with a theory as to what is happening? In the meantime I have set that host to NNT to avoid causing any congestion at the server-side.
	ID: 42561 \| Rating: 0 \| rate: / Reply Quote

Trotador Send message Joined: 25 Mar 12 Posts: 103 Credit: 13,920,977,393 RAC: 695,388 Level Scientific publications	Message 42562 - Posted: 4 Jan 2016 \| 18:29:57 UTC
	This issue continues ocurring in all my hosts (Linux). Guess is that administrators are still in holidays, no claim, they deserve them.
	ID: 42562 \| Rating: 0 \| rate: / Reply Quote

MJH Project administrator Project developer Project scientist Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level Scientific publications	Message 42563 - Posted: 4 Jan 2016 \| 23:13:51 UTC - in response to Message 42562.
	The Linux app binary has expired and needs to be updated. I'll get that done tomorrow, hopefully.
	ID: 42563 \| Rating: 0 \| rate: / Reply Quote

Stoneageman Send message Joined: 25 May 09 Posts: 224 Credit: 34,057,374,498 RAC: 17 Level Scientific publications	Message 42565 - Posted: 5 Jan 2016 \| 10:46:48 UTC
	Thanks Matt. Hope the update will improve it's performance
	ID: 42565 \| Rating: 0 \| rate: / Reply Quote

God is Love, JC proves it... Send message Joined: 24 Nov 11 Posts: 30 Credit: 201,648,059 RAC: 216 Level Scientific publications	Message 42566 - Posted: 7 Jan 2016 \| 0:05:06 UTC
	WU e15s19_e14s24p1f286-GERARD_CXCL12_DIMPROTO3-0-1-RND3500_2 has been stuck at '85% "progress" ' for some 12 hours now. I only have a 640, so WUs take 40-60 hours generally. This task has already run for 69:58. is this part of a defective batch? How many more hours should I sacrifice for this WU? I am presuming that if I abort it, there will be zero credit for these 70 hours (even if it is a flawed WU?) I Run Win 7 on my HP-1120, i7-2600. (I am NOT going to 'upgrade' to Win 10 for months, until (I hope) MS gets all the garbage in Win8-10 patched up.) Please advise. Meanwhile I have paused it and am putting my GPU to better use. Thanks. ____________ I think ∴ I THINK I am My thinking neither is the source of my being NOR proves it to you God Is Love, Jesus proves it! ∴ we are
	ID: 42566 \| Rating: 0 \| rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 42567 - Posted: 7 Jan 2016 \| 2:52:09 UTC - in response to Message 42566.
	I'd suggest restarting the PC. And if the problem still persists, then abort the task.
	ID: 42567 \| Rating: 0 \| rate: / Reply Quote

Stephen Farrell Send message Joined: 3 Nov 14 Posts: 10 Credit: 57,322,675 RAC: 0 Level Scientific publications	Message 42573 - Posted: 8 Jan 2016 \| 11:26:53 UTC
	Hi, I was wondering if others are still having this problem as the issue still persists on both my Linux boxes.
	ID: 42573 \| Rating: 0 \| rate: / Reply Quote

captainjack Send message Joined: 9 May 13 Posts: 171 Credit: 4,011,048,452 RAC: 16,275,505 Level Scientific publications	Message 42574 - Posted: 8 Jan 2016 \| 13:31:18 UTC
	Yep, the GPUGRID tasks are still not processing on my Linux boxes. But my backup project is getting quite a bit of work done.
	ID: 42574 \| Rating: 0 \| rate: / Reply Quote

Stephen Farrell Send message Joined: 3 Nov 14 Posts: 10 Credit: 57,322,675 RAC: 0 Level Scientific publications	Message 42576 - Posted: 8 Jan 2016 \| 16:09:52 UTC
	Okay, thanks for the update. I guess I'll just add a backup project myself until the issue is resolved. Cheers.
	ID: 42576 \| Rating: 0 \| rate: / Reply Quote

Trotador Send message Joined: 25 Mar 12 Posts: 103 Credit: 13,920,977,393 RAC: 695,388 Level Scientific publications	Message 42577 - Posted: 8 Jan 2016 \| 19:22:01 UTC
	Same here, no joy for Linux hosts, five days in a row, we don't seem to be anything worthy for the project.
	ID: 42577 \| Rating: 0 \| rate: / Reply Quote

nanoprobe Send message Joined: 26 Feb 12 Posts: 184 Credit: 222,376,233 RAC: 0 Level Scientific publications	Message 42583 - Posted: 9 Jan 2016 \| 14:30:56 UTC Last modified: 9 Jan 2016 \| 14:31:37 UTC
	The tasks were doing OK on my XP box. I moved the cards to a Win7 box and now they all error out in 2 seconds. Looks like moving the cards was a mistake but I can't move them back ATM.
	ID: 42583 \| Rating: 0 \| rate: / Reply Quote

nanoprobe Send message Joined: 26 Feb 12 Posts: 184 Credit: 222,376,233 RAC: 0 Level Scientific publications	Message 42585 - Posted: 9 Jan 2016 \| 20:32:11 UTC Last modified: 9 Jan 2016 \| 20:38:43 UTC
	I managed to download 1 task that didn't error out in 2 seconds. fingers crossed Still having issues getting tasks to download all the files needed to run. From event log. 4680 GPUGRID 1/9/2016 3:25:00 PM Temporarily failed download of e17s19_e13s27p1f405-GERARD_CXCL12_CHALC2_MON1-0-pdb_file: transient HTTP error After 5 attempts and 30inutes the last file did finally download. EDIT: Second task downloaded and is running. Stay tuned.
	ID: 42585 \| Rating: 0 \| rate: / Reply Quote

microchip Send message Joined: 4 Sep 11 Posts: 110 Credit: 326,102,587 RAC: 0 Level Scientific publications	Message 42587 - Posted: 10 Jan 2016 \| 15:12:24 UTC
	Same here on Linux. All WUs error out, even after a reset of the project. ____________ Team Belgium
	ID: 42587 \| Rating: 0 \| rate: / Reply Quote

Stephen Farrell Send message Joined: 3 Nov 14 Posts: 10 Credit: 57,322,675 RAC: 0 Level Scientific publications	Message 42591 - Posted: 12 Jan 2016 \| 11:13:57 UTC - in response to Message 42585.
	Hi nanoprobe, did you successfully complete the work unit that started?
	ID: 42591 \| Rating: 0 \| rate: / Reply Quote

bormolino Send message Joined: 16 May 13 Posts: 41 Credit: 88,193,390 RAC: 936 Level Scientific publications	Message 42597 - Posted: 13 Jan 2016 \| 13:31:53 UTC
	Still not running under linux ...
	ID: 42597 \| Rating: 0 \| rate: / Reply Quote

nanoprobe Send message Joined: 26 Feb 12 Posts: 184 Credit: 222,376,233 RAC: 0 Level Scientific publications	Message 42598 - Posted: 13 Jan 2016 \| 19:38:08 UTC - in response to Message 42591.
	Hi nanoprobe, did you successfully complete the work unit that started? Yes. It had previously errored out on a Linux machine with 0 runtime and a Windows machine after about 60 minutes of run time. I have also received 6 more since that one that have completed and currently have 2 more in progress. For me all the version 8.4.1 tasks error out. Version 8.4.7 tasks seem to run fine with only an occasional error and unfortunately they run for hours before they go south.
	ID: 42598 \| Rating: 0 \| rate: / Reply Quote

Trotador Send message Joined: 25 Mar 12 Posts: 103 Credit: 13,920,977,393 RAC: 695,388 Level Scientific publications	Message 42600 - Posted: 13 Jan 2016 \| 20:29:36 UTC
	One day more without Linux crunching and without status info...who cares?
	ID: 42600 \| Rating: 0 \| rate: / Reply Quote

nanoprobe Send message Joined: 26 Feb 12 Posts: 184 Credit: 222,376,233 RAC: 0 Level Scientific publications	Message 42602 - Posted: 14 Jan 2016 \| 1:36:14 UTC - in response to Message 42600.
	One day more without Linux crunching and without status info...who cares? Someone didn't get their nap today.
	ID: 42602 \| Rating: 0 \| rate: / Reply Quote

Bikermatt Send message Joined: 8 Apr 10 Posts: 37 Credit: 3,883,905,352 RAC: 3,665,129 Level Scientific publications	Message 42615 - Posted: 15 Jan 2016 \| 2:01:56 UTC - in response to Message 42602.
	One day more without Linux crunching and without status info...who cares? Someone didn't get their nap today. No, don't be a jerk. This has been a known problem with a known cause for a week now and no one has bothered to fix it. For many years there was a significant performance boost when crunching with Linux at this project. The developers actually recommended that you crunch with Linux. Many of us have dedicated Linux hosts to this project due to that fact. Now my Linux hosts are having to crunch mathematics crap and look for pulsars to keep my house warm. Could someone please fix this?
	ID: 42615 \| Rating: 0 \| rate: / Reply Quote

nanoprobe Send message Joined: 26 Feb 12 Posts: 184 Credit: 222,376,233 RAC: 0 Level Scientific publications	Message 42616 - Posted: 15 Jan 2016 \| 4:32:33 UTC - in response to Message 42615.
	One day more without Linux crunching and without status info...who cares? Someone didn't get their nap today. No, don't be a jerk. This has been a known problem with a known cause for a week now and no one has bothered to fix it. For many years there was a significant performance boost when crunching with Linux at this project. The developers actually recommended that you crunch with Linux. Many of us have dedicated Linux hosts to this project due to that fact. Now my Linux hosts are having to crunch mathematics crap and look for pulsars to keep my house warm. Could someone please fix this? No nap and lost your sense of humor? Go look in a mirror and take a chill pill man. This ain't life or death and GPUGrid doesn't revolve around you.
	ID: 42616 \| Rating: 0 \| rate: / Reply Quote

nanoprobe Send message Joined: 26 Feb 12 Posts: 184 Credit: 222,376,233 RAC: 0 Level Scientific publications	Message 42617 - Posted: 15 Jan 2016 \| 4:33:04 UTC - in response to Message 42615. Last modified: 15 Jan 2016 \| 4:35:18 UTC
	That was weird. Triple post.?????
	ID: 42617 \| Rating: 0 \| rate: / Reply Quote

nanoprobe Send message Joined: 26 Feb 12 Posts: 184 Credit: 222,376,233 RAC: 0 Level Scientific publications	Message 42618 - Posted: 15 Jan 2016 \| 4:33:21 UTC - in response to Message 42615. Last modified: 15 Jan 2016 \| 4:36:53 UTC
	Can't explain the triple post.
	ID: 42618 \| Rating: 0 \| rate: / Reply Quote

Trotador Send message Joined: 25 Mar 12 Posts: 103 Credit: 13,920,977,393 RAC: 695,388 Level Scientific publications	Message 42619 - Posted: 15 Jan 2016 \| 5:19:04 UTC - in response to Message 42618.
	Can't explain the triple post. You missed your nap? :)
	ID: 42619 \| Rating: 0 \| rate: / Reply Quote

Gerard Send message Joined: 26 Mar 14 Posts: 101 Credit: 0 RAC: 0 Level Scientific publications	Message 42620 - Posted: 15 Jan 2016 \| 10:35:21 UTC - in response to Message 42615.
	Guys! Matt is trying to fix it, see https://www.gpugrid.net/forum_thread.php?id=4235 . Apparently the solution must not be trivial. Please be patient!
	ID: 42620 \| Rating: 0 \| rate: / Reply Quote

Bedrich Hajek Send message Joined: 28 Mar 09 Posts: 486 Credit: 11,351,770,243 RAC: 9,384,821 Level Scientific publications	Message 42621 - Posted: 15 Jan 2016 \| 11:45:24 UTC Last modified: 15 Jan 2016 \| 11:53:58 UTC
	Now I am getting this same "linux" error on my both my windows machines. https://www.gpugrid.net/hosts_user.php?userid=19626 Also, when I downloaded a new unit, and I suspended a good unit to test the new unit. The new unit would crash, and when I resumed the previously good unit, it also crashed.
	ID: 42621 \| Rating: 0 \| rate: / Reply Quote

nanoprobe Send message Joined: 26 Feb 12 Posts: 184 Credit: 222,376,233 RAC: 0 Level Scientific publications	Message 42622 - Posted: 15 Jan 2016 \| 12:24:05 UTC - in response to Message 42619. Last modified: 15 Jan 2016 \| 12:35:01 UTC
	Can't explain the triple post. You missed your nap? :) Or I fell asleep at the keyboard. ;-) FWIW most of the tasks I'm getting are resends that have failed at least once on a Linux host. So far they have all run to completion on my host. Win7, Xeon E5 2683, twin GTX 970. Along with GPUGrid tasks I'm also running a full load of CPU tasks minus 2 threads each for the cards if that means anything.
	ID: 42622 \| Rating: 0 \| rate: / Reply Quote

TJ Send message Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level Scientific publications	Message 42625 - Posted: 15 Jan 2016 \| 14:42:34 UTC
	Six errors in the last day on Windows, so has nothing to do with the Linux app. Also the site is very slow at the moment. ____________ Greetings from TJ
	ID: 42625 \| Rating: 0 \| rate: / Reply Quote

MJH Project administrator Project developer Project scientist Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level Scientific publications	Message 42626 - Posted: 15 Jan 2016 \| 14:49:17 UTC
	New app 848. Matt
	ID: 42626 \| Rating: 0 \| rate: / Reply Quote

nanoprobe Send message Joined: 26 Feb 12 Posts: 184 Credit: 222,376,233 RAC: 0 Level Scientific publications	Message 42627 - Posted: 15 Jan 2016 \| 15:07:13 UTC - in response to Message 42626.
	New app 848. Matt Matt, Is it just me or are others having download issues? From logs: 13444 GPUGRID 1/15/2016 10:05:43 AM Temporarily failed download of e1s22_2-GERARD_A2ARFX_luf6806_b1-1-e1s22_2-GERARD_A2ARFX_luf6806_b1-0-2-RND6779_1: transient HTTP error 13445 GPUGRID 1/15/2016 10:05:43 AM Backing off 00:02:20 on download of e1s22_2-GERARD_A2ARFX_luf6806_b1-1-e1s22_2-GERARD_A2ARFX_luf6806_b1-0-2-RND6779_1 13446 GPUGRID 1/15/2016 10:05:44 AM Temporarily failed download of e1s22_2-GERARD_A2ARFX_luf6806_b1-1-psf_file: transient HTTP error I'm only having this issue here. Sometimes it takes hours to get all the files for 1 task to run.
	ID: 42627 \| Rating: 0 \| rate: / Reply Quote

caffeineyellow5 Send message Joined: 30 Jul 14 Posts: 225 Credit: 2,658,976,345 RAC: 0 Level Scientific publications	Message 42628 - Posted: 15 Jan 2016 \| 16:15:14 UTC - in response to Message 42622. Last modified: 15 Jan 2016 \| 16:23:07 UTC
	Across 5 computers, all windows ranging from 7, 8.1, and 10, I have had about 130 errored out WUs in the past 24-30 hours. Over 100 of them are 0.00 second errors and the other 30 or so are with time put in. One of the computers crashed to an unrecoverable crash and needed manual assistance in the BIOS then the OS to get it back to a good state and running again. This caused aborted WUs or was caused by WUs. I didn't get a chance to check logs and drivers and recent updates and stuff, I just got it running again and walked away frustrated. I did notice all my Windows machines have wanted or done reboots for several Windows Updates, some having to do with drivers such as power management, but most the general ominous "Update to Windows" or "Security Update for Windows" that you need to actual read KB articles to find out what they were. I am pretty sure they all wanted at least one if not two reboots in the past 7 days. I am not sure if this is significant to any issue with errored WUs, but I would suppose it would be more than just me if it was. I did notice two of the tasks (GERARD_CXCL12_FXCHALC4_DIM and GERARD_CXCL12_FXCHALC4_MON) had a 100% error rate, so I manually aborted about 5 of them at less than an hour into them. They might have been fixed and actually finished, but it was a gut reaction. Before these recent errors I was hovering around 28-32 errors for all of the computers combined, so coming in to see 160 today was a shock. Then seeing they didn't even spend time to crunch eased my fears a bit that it was something totally on my end. Hope this all works out and after all WUs are going to error out occasionally and bad bathes get released and fixed in every project since distributed computing began. But if there is something I can do on my end to help decrease these errors, let me know that too. ____________ 1 Corinthians 9:16 "For though I preach the gospel, I have nothing to glory of: for necessity is laid upon me; yea, woe is unto me, if I preach not the gospel!" Ephesians 6:18-20, please ;-) http://tbc-pa.org
	ID: 42628 \| Rating: 0 \| rate: / Reply Quote

Trotador Send message Joined: 25 Mar 12 Posts: 103 Credit: 13,920,977,393 RAC: 695,388 Level Scientific publications	Message 42647 - Posted: 17 Jan 2016 \| 7:26:19 UTC - in response to Message 42626.
	New app 848. Matt working OK in my host
	ID: 42647 \| Rating: 0 \| rate: / Reply Quote

Stephen Farrell Send message Joined: 3 Nov 14 Posts: 10 Credit: 57,322,675 RAC: 0 Level Scientific publications	Message 42650 - Posted: 17 Jan 2016 \| 11:01:21 UTC - in response to Message 42647.
	New app 848. Matt working OK in my host Same here on my Linux hosts.
	ID: 42650 \| Rating: 0 \| rate: / Reply Quote

caffeineyellow5 Send message Joined: 30 Jul 14 Posts: 225 Credit: 2,658,976,345 RAC: 0 Level Scientific publications	Message 42678 - Posted: 23 Jan 2016 \| 6:41:20 UTC - in response to Message 42628.
	Across 5 computers, all windows ranging from 7, 8.1, and 10, I have had about 130 errored out WUs in the past 24-30 hours. Over 100 of them are 0.00 second errors and the other 30 or so are with time put in. One of the computers crashed to an unrecoverable crash and needed manual assistance in the BIOS then the OS to get it back to a good state and running again. This caused aborted WUs or was caused by WUs. I didn't get a chance to check logs and drivers and recent updates and stuff, I just got it running again and walked away frustrated. I did notice all my Windows machines have wanted or done reboots for several Windows Updates, some having to do with drivers such as power management, but most the general ominous "Update to Windows" or "Security Update for Windows" that you need to actual read KB articles to find out what they were. I am pretty sure they all wanted at least one if not two reboots in the past 7 days. I am not sure if this is significant to any issue with errored WUs, but I would suppose it would be more than just me if it was. Maybe a huge coincidence and maybe not, another one of my computers died and would not reboot to anything but the BIOS. The one mentioned above is a Windows 7 and this one was a Windows 10, but both have the same processor (i7-4790K) and motherboard (Asus Z97-AR). Both are just about 11 months old. The first one I was able to default the BIOS, then got it to start in Safe Mode, then without making changes it was able to reboot back to full windows with no problems. This one stick of memory went bad but even with completely replaced memory, it needed BOINC completely uninstalled and all traces deleted to get it running outside of Safe Mode. I had to disable BOINC from starting with Sysinternals Autoruns, then uninstall and delete in full mode. After that I reinstalled BOINC and added the projects again. It simply would not start again and not freeze until the full deletion. I suspect it was the actual WUs it was crunching that were marked as abandoned when I did that deletion of the Program Data BOINC folder. At first I didn't post here about it because when I found it was a bad memory stick keeping it from starting, I suspected the power issue we had at the house that day. And the memory may very well have been from the power issue. But the not starting until I dumped BOINC, but had several restarts with it disabled, then froze as soon as I started BOINC manually several times, And even after uninstallations of BOINC and all the NVIDIA drivers and services, I narrowed the problem down to BOINC to the point of the freezing. Inside that, I can only suspect the exact reason. I know that memory crashes can cause programs to 'break' if part of the program is still in memory at the time of the crash and not recoverable, but with BOINC's checkpoints, I would suspect this not a problem after a reinstall if the program itself was damaged, the WU it was crunching should go back to the most recent checkpoint and continue on. As this is an old subject now related to a specific batch of WUs, this may be a moot point, but thought it worth a late mention. Just in case there are still these units roaming around and computers that run or ran into problems than may only be found late also, this can serve as maybe some answer and possibly a solution (with the Autoruns in Safe Mode [With Networking if you need to download it, found with a google search "Sysinternals Autoruns"]). And as always, any feedback, if this can help fix an issue or you can help me avoid them, is appreciated. And if I can confirm or help deny and suspicions on something I forgot to mention or may be able to answer, please also offer questions. ____________ 1 Corinthians 9:16 "For though I preach the gospel, I have nothing to glory of: for necessity is laid upon me; yea, woe is unto me, if I preach not the gospel!" Ephesians 6:18-20, please ;-) http://tbc-pa.org
	ID: 42678 \| Rating: 0 \| rate: / Reply Quote

caffeineyellow5 Send message Joined: 30 Jul 14 Posts: 225 Credit: 2,658,976,345 RAC: 0 Level Scientific publications	Message 42679 - Posted: 23 Jan 2016 \| 6:41:25 UTC - in response to Message 42628.
	Across 5 computers, all windows ranging from 7, 8.1, and 10, I have had about 130 errored out WUs in the past 24-30 hours. Over 100 of them are 0.00 second errors and the other 30 or so are with time put in. One of the computers crashed to an unrecoverable crash and needed manual assistance in the BIOS then the OS to get it back to a good state and running again. This caused aborted WUs or was caused by WUs. I didn't get a chance to check logs and drivers and recent updates and stuff, I just got it running again and walked away frustrated. I did notice all my Windows machines have wanted or done reboots for several Windows Updates, some having to do with drivers such as power management, but most the general ominous "Update to Windows" or "Security Update for Windows" that you need to actual read KB articles to find out what they were. I am pretty sure they all wanted at least one if not two reboots in the past 7 days. I am not sure if this is significant to any issue with errored WUs, but I would suppose it would be more than just me if it was. Maybe a huge coincidence and maybe not, another one of my computers died and would not reboot to anything but the BIOS. The one mentioned above is a Windows 7 and this one was a Windows 10, but both have the same processor (i7-4790K) and motherboard (Asus Z97-AR). Both are just about 11 months old. The first one I was able to default the BIOS, then got it to start in Safe Mode, then without making changes it was able to reboot back to full windows with no problems. This one stick of memory went bad but even with completely replaced memory, it needed BOINC completely uninstalled and all traces deleted to get it running outside of Safe Mode. I had to disable BOINC from starting with Sysinternals Autoruns, then uninstall and delete in full mode. After that I reinstalled BOINC and added the projects again. It simply would not start again and not freeze until the full deletion. I suspect it was the actual WUs it was crunching that were marked as abandoned when I did that deletion of the Program Data BOINC folder. At first I didn't post here about it because when I found it was a bad memory stick keeping it from starting, I suspected the power issue we had at the house that day. And the memory may very well have been from the power issue. But the not starting until I dumped BOINC, but had several restarts with it disabled, then froze as soon as I started BOINC manually several times, And even after uninstallations of BOINC and all the NVIDIA drivers and services, I narrowed the problem down to BOINC to the point of the freezing. Inside that, I can only suspect the exact reason. I know that memory crashes can cause programs to 'break' if part of the program is still in memory at the time of the crash and not recoverable, but with BOINC's checkpoints, I would suspect this not a problem after a reinstall if the program itself was damaged, the WU it was crunching should go back to the most recent checkpoint and continue on. As this is an old subject now related to a specific batch of WUs, this may be a moot point, but thought it worth a late mention. Just in case there are still these units roaming around and computers that run or ran into problems than may only be found late also, this can serve as maybe some answer and possibly a solution (with the Autoruns in Safe Mode [With Networking if you need to download it, found with a google search "Sysinternals Autoruns"]). And as always, any feedback, if this can help fix an issue or you can help me avoid them, is appreciated. And if I can confirm or help deny and suspicions on something I forgot to mention or may be able to answer, please also offer questions. ____________ 1 Corinthians 9:16 "For though I preach the gospel, I have nothing to glory of: for necessity is laid upon me; yea, woe is unto me, if I preach not the gospel!" Ephesians 6:18-20, please ;-) http://tbc-pa.org
	ID: 42679 \| Rating: 0 \| rate: / Reply Quote

caffeineyellow5 Send message Joined: 30 Jul 14 Posts: 225 Credit: 2,658,976,345 RAC: 0 Level Scientific publications	Message 42680 - Posted: 23 Jan 2016 \| 6:44:47 UTC
	Now I am the one who can't explain the double post. lol
	ID: 42680 \| Rating: 0 \| rate: / Reply Quote

nanoprobe Send message Joined: 26 Feb 12 Posts: 184 Credit: 222,376,233 RAC: 0 Level Scientific publications	Message 42760 - Posted: 6 Feb 2016 \| 1:27:20 UTC
	Matt: I'm still having the download issue. Anyone else?
	ID: 42760 \| Rating: 0 \| rate: / Reply Quote

Stephen Farrell Send message Joined: 3 Nov 14 Posts: 10 Credit: 57,322,675 RAC: 0 Level Scientific publications	Message 42792 - Posted: 10 Feb 2016 \| 10:06:31 UTC - in response to Message 42760.
	I haven't had any issues since the new app was released.
	ID: 42792 \| Rating: 0 \| rate: / Reply Quote

nanoprobe Send message Joined: 26 Feb 12 Posts: 184 Credit: 222,376,233 RAC: 0 Level Scientific publications	Message 42809 - Posted: 12 Feb 2016 \| 22:14:26 UTC
	I have on both machines running here. Can someone at least look into this and try to fix it?
	ID: 42809 \| Rating: 0 \| rate: / Reply Quote

fractal Send message Joined: 16 Aug 08 Posts: 87 Credit: 1,248,879,715 RAC: 0 Level Scientific publications	Message 42810 - Posted: 12 Feb 2016 \| 23:04:17 UTC - in response to Message 42809. Last modified: 12 Feb 2016 \| 23:05:31 UTC
	I have on both machines running here. Can someone at least look into this and try to fix it? Non stop successful work all month for me. I can't see your machines to see if there is a useful error message since you have them hidden. Have you tried resetting the project? That sometimes clears up http errors.
	ID: 42810 \| Rating: 0 \| rate: / Reply Quote

nanoprobe Send message Joined: 26 Feb 12 Posts: 184 Credit: 222,376,233 RAC: 0 Level Scientific publications	Message 42814 - Posted: 13 Feb 2016 \| 22:43:54 UTC - in response to Message 42810.
	I have on both machines running here. Can someone at least look into this and try to fix it? Non stop successful work all month for me. I can't see your machines to see if there is a useful error message since you have them hidden. Have you tried resetting the project? That sometimes clears up http errors. I'll unhide my machines but I don't see how that will help. The error message is always the same. Rebooting/resetting doesn't help. Here are the latest 2. 138 GPUGRID 2/13/2016 3:59:44 PM Temporarily failed download of e19s5_e17s11p1f237-GERARD_CXCL12_CHLKDER_mol01-0-psf_file: transient HTTP error 139 GPUGRID 2/13/2016 3:59:44 PM Backing off 00:02:33 on download of e19s5_e17s11p1f237-GERARD_CXCL12_CHLKDER_mol01-0-psf_file 140 GPUGRID 2/13/2016 3:59:47 PM Temporarily failed download of e19s6_e17s11p1f311-GERARD_CXCL12_CHLKDER_mol01-0-psf_file: transient HTTP error 141 GPUGRID 2/13/2016 3:59:47 PM Backing off 00:02:29 on download of e19s6_e17s11p1f311-GERARD_CXCL12_CHLKDER_mol01-0-psf_file 142 2/13/2016 3:59:48 PM Project communication failed: attempting access to reference site 143 2/13/2016 3:59:49 PM Internet access OK - project servers may be temporarily down.
	ID: 42814 \| Rating: 0 \| rate: / Reply Quote

Nick Name Send message Joined: 3 Sep 13 Posts: 53 Credit: 1,533,531,731 RAC: 0 Level Scientific publications	Message 42815 - Posted: 14 Feb 2016 \| 2:08:59 UTC - in response to Message 42814.
	I've seen this occasionally, but as far as I know have not had files stuck trying to download for hours as was mentioned in another thread. When I've seen it before it resolved in a few minutes. Tonight I observed a couple tasks stuck (probably around ten files in total), with a backoff time of one hour and 45 minutes. I manually tried downloading again and most of them completed, but a couple hung with the http transient error. I still have one file left which I guess will eventually finish. All my cards are busy so it's not a problem right now. Interestingly I had no problems uploading a completed job while this was happening. ____________ Team USA forum \| Team USA page Join us and #crunchforcures. We are now also folding:join team ID 236370!
	ID: 42815 \| Rating: 0 \| rate: / Reply Quote

fractal Send message Joined: 16 Aug 08 Posts: 87 Credit: 1,248,879,715 RAC: 0 Level Scientific publications	Message 42816 - Posted: 14 Feb 2016 \| 2:13:58 UTC - in response to Message 42814. Last modified: 14 Feb 2016 \| 2:15:26 UTC
	You are right. There is nothing I can see from your tasks that would explain the errors. All your task list shows are the successful tasks, not the ones that it could not download. I can't offer any suggestions to find out where the issue is in the network. I can only tell you that everything in my path from my machine to GPUGrid does not exhibit that behavior any more. It did for a while a few weeks back, but it seems to have resolved itself with no action on my part.
	ID: 42816 \| Rating: 0 \| rate: / Reply Quote

kashi Send message Joined: 29 Jan 15 Posts: 3 Credit: 76,300,087 RAC: 0 Level Scientific publications	Message 42817 - Posted: 14 Feb 2016 \| 4:30:49 UTC
	Looked in the log. Can't find any recent stuck files that stick for hours thankfully, like I've got in the past. However to show that it's still happening, here's some recent ones that were stuck for a shorter time: 14/02/2016 10:33:03 AM \| GPUGRID \| Temporarily failed download of e5s20_e4s2p1f299-GERARD_CXCL12_CHLCPUBCHEM_chalcone4681-0-vel_file: transient HTTP error 14/02/2016 10:33:03 AM \| GPUGRID \| Backing off 00:02:51 on download of e5s20_e4s2p1f299-GERARD_CXCL12_CHLCPUBCHEM_chalcone4681-0-vel_file 14/02/2016 10:33:04 AM \| \| Project communication failed: attempting access to reference site 14/02/2016 10:33:05 AM \| \| Internet access OK - project servers may be temporarily down. 14/02/2016 10:33:06 AM \| GPUGRID \| Temporarily failed download of e5s20_e4s2p1f299-GERARD_CXCL12_CHLCPUBCHEM_chalcone4681-0-pdb_file: transient HTTP error 14/02/2016 10:33:06 AM \| GPUGRID \| Backing off 00:02:24 on download of e5s20_e4s2p1f299-GERARD_CXCL12_CHLCPUBCHEM_chalcone4681-0-pdb_file 14/02/2016 10:33:07 AM \| \| Project communication failed: attempting access to reference site 14/02/2016 10:33:08 AM \| \| Internet access OK - project servers may be temporarily down. 14/02/2016 10:35:31 AM \| GPUGRID \| Started download of e5s20_e4s2p1f299-GERARD_CXCL12_CHLCPUBCHEM_chalcone4681-0-pdb_file 14/02/2016 10:35:55 AM \| GPUGRID \| Started download of e5s20_e4s2p1f299-GERARD_CXCL12_CHLCPUBCHEM_chalcone4681-0-vel_file 14/02/2016 10:36:42 AM \| GPUGRID \| Temporarily failed download of e5s20_e4s2p1f299-GERARD_CXCL12_CHLCPUBCHEM_chalcone4681-0-pdb_file: transient HTTP error 14/02/2016 10:36:42 AM \| GPUGRID \| Backing off 00:07:46 on download of e5s20_e4s2p1f299-GERARD_CXCL12_CHLCPUBCHEM_chalcone4681-0-pdb_file 14/02/2016 10:36:43 AM \| \| Project communication failed: attempting access to reference site 14/02/2016 10:36:44 AM \| \| Internet access OK - project servers may be temporarily down. 14/02/2016 10:37:06 AM \| GPUGRID \| Temporarily failed download of e5s20_e4s2p1f299-GERARD_CXCL12_CHLCPUBCHEM_chalcone4681-0-vel_file: transient HTTP error 14/02/2016 10:37:06 AM \| GPUGRID \| Backing off 00:04:55 on download of e5s20_e4s2p1f299-GERARD_CXCL12_CHLCPUBCHEM_chalcone4681-0-vel_file 14/02/2016 10:37:07 AM \| \| Project communication failed: attempting access to reference site 14/02/2016 10:37:08 AM \| \| Internet access OK - project servers may be temporarily down. 14/02/2016 10:56:49 AM \| GPUGRID \| Started download of e5s20_e4s2p1f299-GERARD_CXCL12_CHLCPUBCHEM_chalcone4681-0-vel_file 14/02/2016 10:56:49 AM \| GPUGRID \| Started download of e5s20_e4s2p1f299-GERARD_CXCL12_CHLCPUBCHEM_chalcone4681-0-pdb_file 14/02/2016 10:56:55 AM \| GPUGRID \| Finished download of e5s20_e4s2p1f299-GERARD_CXCL12_CHLCPUBCHEM_chalcone4681-0-vel_file 14/02/2016 10:57:50 AM \| GPUGRID \| Finished download of e5s20_e4s2p1f299-GERARD_CXCL12_CHLCPUBCHEM_chalcone4681-0-pdb_file
	ID: 42817 \| Rating: 0 \| rate: / Reply Quote

nanoprobe Send message Joined: 26 Feb 12 Posts: 184 Credit: 222,376,233 RAC: 0 Level Scientific publications	Message 42819 - Posted: 14 Feb 2016 \| 18:17:04 UTC - in response to Message 42816. Last modified: 14 Feb 2016 \| 18:17:43 UTC
	You are right. There is nothing I can see from your tasks that would explain the errors. All your task list shows are the successful tasks, not the ones that it could not download. They all eventually download and run to completion. It's waiting for hours while the downloads are stuck that is the issue. I can't offer any suggestions to find out where the issue is in the network. I can only tell you that everything in my path from my machine to GPUGrid does not exhibit that behavior any more. It did for a while a few weeks back, but it seems to have resolved itself with no action on my part. I wish the issue would "resolve itself" but so far that has not happened.
	ID: 42819 \| Rating: 0 \| rate: / Reply Quote

Post to thread

Message boards : Number crunching : All Gerard WUs erroring

	About	Science	Volunteers	Performance	Forum	Join us	Donate