*_pYEEI_* information and issues

Author	Message
Michael Goetz Send message Joined: 2 Mar 09 Posts: 124 Credit: 124,873,744 RAC: 0 Level Scientific publications	Message 14079 - Posted: 30 Dec 2009, 1:38:05 UTC - in response to Message 14078. I just managed to squeak by. I had six of these error out, dropping my daily quota to 9. The next WU was the 9th of the day; fortunately, it's a different series and is crunching normally. If it had been another error I think this GPU would have been done for the day. (Unless it's still counting this as WUs per CPU core, in which case I had a lot of headway.) Want to find one of the largest known primes? Try PrimeGrid. Or help cure disease at WCG. ID: 14079 · Rating: 0 · rate: / Reply Quote

Stoneageman Send message Joined: 25 May 09 Posts: 224 Credit: 34,057,374,498 RAC: 0 Level Scientific publications	Message 14080 - Posted: 30 Dec 2009, 2:21:20 UTC Last modified: 30 Dec 2009, 2:57:17 UTC UPDATE: Four more have trashed another gpu Aborted a boat load of these critters, yet still they come. It's like they are breeding! ID: 14080 · Rating: 0 · rate: / Reply Quote

Beyond Send message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level Scientific publications	Message 14081 - Posted: 30 Dec 2009, 9:15:13 UTC Can you PLEASE PLEASE PLEASE make sure WU batches are OK before sending them out. ID: 14081 · Rating: 0 · rate: / Reply Quote

Siegfried Niklas Send message Joined: 23 Feb 09 Posts: 39 Credit: 144,654,294 RAC: 0 Level Scientific publications	Message 14082 - Posted: 30 Dec 2009, 9:54:46 UTC GTX295 - Nine _pYEEI_ WUs crashed in a row. http://www.gpugrid.net/results.php?hostid=53295 "MDIO ERROR: syntax error in file "structure.psf", line number 1: failed to find PSF keyword ERROR: mdioload.cu, line 172: Unable to read topology file" No new work sent for 7,5 hours. (recently got new) Should I abort _pYEEI_ on other GPUs (cache)? ID: 14082 · Rating: 0 · rate: / Reply Quote

hzels Send message Joined: 4 Sep 08 Posts: 7 Credit: 52,864,406 RAC: 0 Level Scientific publications	Message 14083 - Posted: 30 Dec 2009, 11:11:14 UTC - in response to Message 14082. last WUs all going down the drain: <stderr_txt> # Using CUDA device 0 # There are 2 devices supporting CUDA # Device 0: "GeForce GTX 280" # Clock rate: 1.55 GHz # Total amount of global memory: 1073741824 bytes # Number of multiprocessors: 30 # Number of cores: 240 # Device 1: "GeForce GTX 260" # Clock rate: 1.51 GHz # Total amount of global memory: 939524096 bytes # Number of multiprocessors: 27 # Number of cores: 216 MDIO ERROR: syntax error in file "structure.psf", line number 1: failed to find PSF keyword ERROR: mdioload.cu, line 172: Unable to read topology file called boinc_finish </stderr_txt> I'm over to Collatz for some days. ID: 14083 · Rating: 0 · rate: / Reply Quote

Michael Goetz Send message Joined: 2 Mar 09 Posts: 124 Credit: 124,873,744 RAC: 0 Level Scientific publications	Message 14085 - Posted: 30 Dec 2009, 16:17:39 UTC I just had another one of these fail: 1057058 Want to find one of the largest known primes? Try PrimeGrid. Or help cure disease at WCG. ID: 14085 · Rating: 0 · rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 14086 - Posted: 30 Dec 2009, 16:29:08 UTC - in response to Message 14085. Last modified: 30 Dec 2009, 16:32:34 UTC Had 2 fail in a few seconds on one system, 3 on another. 184-IBUCH_reverse_pYEEI_2912-0-40-RND6748 http://www.gpugrid.net/workunit.php?wuid=1056751 128-IBUCH_reverse_pYEEI_2912-0-40-RND3643 http://www.gpugrid.net/workunit.php?wuid=1056695 Also, could not get any tasks this morning between about 1am and noon, on the same system, but running a task now. http://www.gpugrid.net/workunit.php?wuid=1056826 http://www.gpugrid.net/workunit.php?wuid=1056758 http://www.gpugrid.net/workunit.php?wuid=1056826 ID: 14086 · Rating: 0 · rate: / Reply Quote

Beyond Send message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level Scientific publications	Message 14092 - Posted: 31 Dec 2009, 0:36:13 UTC - in response to Message 14028. Please use this thread to post any problem regarding all workunits tagged as _pYEEI_. Thanks, ignasi As you can see (I hope) massive problems have been reported and many systems have been locked out (and are sitting idle) of receiving new WUs due to these faulty units. Don't you think it's about time to pull the rest? It looks like they're just being allowed to run until they fail so many times that the server cancels them. That's not showing any concern at all for the people who are doing your work. I know they're not being canceled because I've received 22 of them so far today. Every one of those 22 has failed on several machines before being sent to me. That's just wrong. ID: 14092 · Rating: 0 · rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 14094 - Posted: 31 Dec 2009, 14:09:24 UTC - in response to Message 14092. In a way the _pYEEI_ tasks are SPAM! I had to take extreme action yesterday - shut down my system for a couple of hours ;) ID: 14094 · Rating: 0 · rate: / Reply Quote

ignasi Send message Joined: 10 Apr 08 Posts: 254 Credit: 16,836,000 RAC: 0 Level Scientific publications	Message 14107 - Posted: 3 Jan 2010, 17:50:40 UTC - in response to Message 14094. My most sincere apologies to everybody for all this. I wanted to fill up the queue before going offline for some days but obviously it didn't work as expected. The balance between keeping crunchers support, not having an empty queue and having private life is always very sensitive to human errors. Sincerely, ignasi ID: 14107 · Rating: 0 · rate: / Reply Quote

Michael Goetz Send message Joined: 2 Mar 09 Posts: 124 Credit: 124,873,744 RAC: 0 Level Scientific publications	Message 14109 - Posted: 3 Jan 2010, 18:28:46 UTC - in response to Message 14107. My most sincere apologies to everybody for all this. No worries here; stuff happens. It's the nature of the "free" distributed computing that there are going to be minor problems along the way. Happy new year! Want to find one of the largest known primes? Try PrimeGrid. Or help cure disease at WCG. ID: 14109 · Rating: 0 · rate: / Reply Quote

Beyond Send message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level Scientific publications	Message 14111 - Posted: 3 Jan 2010, 19:26:08 UTC - in response to Message 14109. My most sincere apologies to everybody for all this. Happy new year! Thanks for letting us know what happened. Communication is appreciated. Happy new year everyone! ID: 14111 · Rating: 0 · rate: / Reply Quote

Stoneageman Send message Joined: 25 May 09 Posts: 224 Credit: 34,057,374,498 RAC: 0 Level Scientific publications	Message 14112 - Posted: 3 Jan 2010, 19:58:41 UTC - in response to Message 14107. The balance between keeping crunchers support, not having an empty queue and having private life is always very sensitive to human errors. Sincerely, ignasi "A PRIVATE life".......... well ok. However, we expect you to sleep with the server :) ID: 14112 · Rating: 0 · rate: / Reply Quote

ignasi Send message Joined: 10 Apr 08 Posts: 254 Credit: 16,836,000 RAC: 0 Level Scientific publications	Message 14113 - Posted: 4 Jan 2010, 9:33:01 UTC - in response to Message 14112. "A PRIVATE life".......... well ok. However, we expect you to sleep with the server :) [/quote] I am afraid girlfriends are too jealous... ID: 14113 · Rating: 0 · rate: / Reply Quote

Snow Crash Send message Joined: 4 Apr 09 Posts: 450 Credit: 539,316,349 RAC: 0 Level Scientific publications	Message 14122 - Posted: 4 Jan 2010, 23:52:38 UTC - in response to Message 14113. "A PRIVATE life".......... well ok. However, we expect you to sleep with the server :) I am afraid girlfriends are too jealous... You have more than ONE !!! No wonder he can't get the WUs straight , he is sleep deprived :-) Keep up the good work, we'll crunch the best we can! Thanks - Steve ID: 14122 · Rating: 0 · rate: / Reply Quote

[AF>Libristes>Jip] Elgrande71 Send message Joined: 16 Jul 08 Posts: 45 Credit: 78,618,001 RAC: 0 Level Scientific publications	Message 14348 - Posted: 26 Jan 2010, 14:27:46 UTC - in response to Message 14122. Three compute errors 1,2,3 on this host . ID: 14348 · Rating: 0 · rate: / Reply Quote

AndyMM Send message Joined: 27 Jan 09 Posts: 4 Credit: 582,988,184 RAC: 0 Level Scientific publications	Message 14396 - Posted: 27 Jan 2010, 10:10:20 UTC Sorry but gong to say good bye. Last 3 days non stop computation errors made even worse by the fact the cards just sat there doing nothing. Switching all my GPUs to F@H. I do not accept having my money wasted with units processing for 17 hours then showing a computing error. ID: 14396 · Rating: 0 · rate: / Reply Quote

GPUGRID Role account Send message Joined: 15 Feb 07 Posts: 134 Credit: 1,349,535,983 RAC: 0 Level Scientific publications	Message 14410 - Posted: 27 Jan 2010, 13:56:28 UTC - in response to Message 14396. Hi, It's because you have been accepting beta work from us. If reliability of work is of paramount importance to you, don't track the beta application. Matt ID: 14410 · Rating: 0 · rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 14524 - Posted: 28 Jan 2010, 0:30:51 UTC - in response to Message 14396. Switching all my GPUs to F@H. Your cards do a lot more work here than they can at F@H. If the problem is Beta related you just need to turn the Betas off, as MJH said. It might also be that you need to restart the system. Sometimes one failure can cause contunuous failures (a runaway) and you need to restart the system. I say this because the problem was only limited to your GTX 295, and not your GTX 275. Many of your tasks seem to have been aborted by user. Some immediately and one after running for a long time, 286-IBUCH_esrever_pYEEI_0301-10-40-RND7408 - Aborted by user after 43,189.28 seconds. Turn off Betas, restart and see how you get on. ID: 14524 · Rating: 0 · rate: / Reply Quote

AndyMM Send message Joined: 27 Jan 09 Posts: 4 Credit: 582,988,184 RAC: 0 Level Scientific publications	Message 14789 - Posted: 29 Jan 2010, 14:08:33 UTC - in response to Message 14524. Thanks for the comments. I looked in my GPUGrid preferences and did not notice anything saying Beta I did see "Run test applications? This helps us develop applications, but may cause jobs to fail on your computer" Which was already set to no. Please advise, how do a turn off receiving Beta work units Thanks Andy ID: 14789 · Rating: 0 · rate: / Reply Quote