restart.coor....

Author	Message
naja002 Send message Joined: 25 Sep 08 Posts: 111 Credit: 10,352,599 RAC: 0 Level Scientific publications	Message 7788 - Posted: 24 Mar 2009, 0:35:02 UTC Last modified: 24 Mar 2009, 0:37:49 UTC This one's killing me..... I've been through the computation error thread....it's touching on so many different things these days...I really don't want to mix this in with that. This "restart.coor" has not been a problem in the past, but the last couple of weeks--it's killing me.... If my PCs reboot--the WUs finish , then error out = restart.coor. This started on it's own--out of nowhere. I've since upgraded the driver to 185.26--samething. I upgraded the Boinc version today from 6.6.3 to 6.6.17 and now I have cores that have been running for hours--the first of which has just finished and errored out = "restart.coor". I'm now expecting the next 3 WUs to do the same. Maybe the ones in queue again also. Yesterday I rebooted a rig and the WUs completed, errored out restart.coor--and so did the WUs in queue.... There's something going on here. Really not happy about the errors, but these WUs continue to run for hours (5-6hrs today) then error out when they are finished. Major waste of resources. This is a new thing that developed without any changes to my rigs. Today I did not reboot--just upgraded the boinc version. So, now every time I shut down boinc I can expect this? What if I just put it on snooze? Something is wrong here and with all of the computation errors going around...I really don't believe the it's on my end or solely on my end. Yes, all my stuff is OCed, but it's been that way for months. Everything is fine--unless there's a restart of boinc...plus I've tweaked down some OC's and this situation continues to go from bad to worse. My PCs are unhidden if anyone wants to take a look. Getting very frustrated here. If they just errored out it wouldn't be so bad, but they run for hours....until they complete....just to error out.... HELP! ;) ID: 7788 · Rating: 0 · rate: / Reply Quote

naja002 Send message Joined: 25 Sep 08 Posts: 111 Credit: 10,352,599 RAC: 0 Level Scientific publications	Message 7789 - Posted: 24 Mar 2009, 1:36:08 UTC - in response to Message 7788. Next one error our = restart.coor. Expecting at least 2 more from different rigs.... ID: 7789 · Rating: 0 · rate: / Reply Quote

Alain Maes Send message Joined: 8 Sep 08 Posts: 63 Credit: 1,699,957,181 RAC: 3,882 Level Scientific publications	Message 7792 - Posted: 24 Mar 2009, 8:51:04 UTC - in response to Message 7789. First of all, the real error message is "incorrect function - exit code 1". Disregard the restart.coor, it appears in all WUs, also the one that complete succesfully and validate. Secondly, the errors occur on 2 of your 3 machines, but some of the WUs there do finish succesfully and validate. Which indicates that software is not the problem. So remains a hardware problem. My guess here is aging of certain components in power supply and/or videocard. Even further reducing the OC on your videocards is likely the best option to try. Good luck. kind regards Alain ID: 7792 · Rating: 0 · rate: / Reply Quote

naja002 Send message Joined: 25 Sep 08 Posts: 111 Credit: 10,352,599 RAC: 0 Level Scientific publications	Message 7797 - Posted: 24 Mar 2009, 12:02:40 UTC - in response to Message 7792. First of all, the real error message is "incorrect function - exit code 1". Disregard the restart.coor, it appears in all WUs, also the one that complete succesfully and validate. Secondly, the errors occur on 2 of your 3 machines, but some of the WUs there do finish succesfully and validate. Which indicates that software is not the problem. So remains a hardware problem. My guess here is aging of certain components in power supply and/or videocard. Even further reducing the OC on your videocards is likely the best option to try. Good luck. kind regards Alain Ok, Thanx Alain for the info. I wasn't aware that the restart.coor appears in sucessful WUs. I rarely check them. I understand that the actual error is pretty meaningless, and thought that the constant restart.coor might have some value....guess not. I have one other possible issue that may be causing this. I've already seriously downclocked one card....I'll get to the other 2 later today. See how that goes and then check out the other possibility.... I may end up having to make a choice, but if so....GpuGrid will win out. It's good to know though that the problem(s) should be on my end. This has been driving me nutz....;) Many Thanx! ID: 7797 · Rating: 0 · rate: / Reply Quote

Paul D. Buck Send message Joined: 9 Jun 08 Posts: 1050 Credit: 37,321,185 RAC: 0 Level Scientific publications	Message 7803 - Posted: 24 Mar 2009, 13:47:04 UTC - in response to Message 7797. I may end up having to make a choice, but if so....GpuGrid will win out. It's good to know though that the problem(s) should be on my end. This has been driving me nutz....;) If you pick one system, describe it, and list its essential features, what you have tried to this point, link to failed tasks we can all shout suggestions from the sidelines. I suggest one system at a time because sometimes making too many changes on too many places causes confusion ... and that is the last thing needed ... ID: 7803 · Rating: 0 · rate: / Reply Quote

uBronan Send message Joined: 1 Feb 09 Posts: 139 Credit: 575,023 RAC: 0 Level Scientific publications	Message 7804 - Posted: 24 Mar 2009, 13:52:07 UTC Last modified: 24 Mar 2009, 14:15:31 UTC Hmm i don't agree that it can't be a software problem. But as i check other projects, i see that even though they are damn fast these GPU's. They make a lot of errors on their jobs, they are simply made for raw performance not very demanding computations. For a game loosing a pixel here or there won't do much, but into a computation thats some kinda different cake. I also found units crash when reported to be done completely, and only have to be send to the project. Can someone tell me where the hardware comes in when a unit has to be send. I believe these GPU projects have a good future but we must not forget they still have some problems to be solved. So i guess we will be stuck on what we have now and see units being destroyed sometimes for non good reasons. Its not allways certain that underclocking the VC is a solution for not getting errors, i underclocked my VC on the lowest settings possible and even then units from seti and seti beta crashed. We cant complain much since all these projects are basically still alpha or beta or in some kinda test. So failures will keep coming for no good reasons i guess, or the makers must find a cheap/fast way to give a change to recalculate on the error given. Sadly now even though it has been calculating the whole damn thing for 16 or more hours you end up getting nothing. P.S talking about the devil my laptop crashed hard with a seti beta unit running ID: 7804 · Rating: 0 · rate: / Reply Quote

Michael Goetz Send message Joined: 2 Mar 09 Posts: 124 Credit: 124,873,744 RAC: 0 Level Scientific publications	Message 7807 - Posted: 24 Mar 2009, 15:39:16 UTC - in response to Message 7804. Hmm i don't agree that it can't be a software problem. But as i check other projects, i see that even though they are damn fast these GPU's. They make a lot of errors on their jobs, they are simply made for raw performance not very demanding computations. For a game loosing a pixel here or there won't do much, but into a computation thats some kinda different cake. I agree with some of what you said, but disagree with some, too. I agree that the symptoms do not rule out software problems. Just because it sometimes fails and sometimes succeeds doesn't mean it's not software. It just means the bug isn't quite so obvious, or it would have been caught easily in testing. I've been writing software for 30+ years, and it's those bugs that only occur rarely that drive you nuts... but they're still software problems. As for errors due to the GPU's being designed for speed, not accuracy, that's only partially correct. Yes, a few insignificant bits getting flipped won't make a noticeable difference in a 3D game, but could in a scientific calculation. But those flipped bits can also cause crashes in games, so it's not as if occasional errors are an acceptable design feature. The GPUs SHOULD be running error free. They're designed to operate without errors. When overclocked, especially by users, the criteria for what is acceptable (from the user's perspective) clock speeds might be "my game doesn't crash", which may, indeed, allow for errors to creep into the calculations. But as shipped from the factory, the GPU shouldn't be making errors. If it is, it's defective. I have a factory overclocked EVGA GTX280. It has yet to error out on any task. One theory that I have, based on what I've read here and on other projects' boards, is that the fan control for some GPUs may not be operating properly, or the cooling may simply be insufficient. That may allow the GPU to overheat and become unstable. My GPU-equipped computer is currently in a fairly cold basement room, and it has no thermal problems currently. The highest temps I ever see are about 78 degrees C, and 75 degrees is the base temperature, below which the fan runs at idle speed (40%). I've never seen the fan running above 50%, and the maximum allowable operating temperature is over 100 degrees, so there's a lot of cooling capacity in reserve. There's lots of things that can cause problems with the WUs. Two of them that are under our control are OC and temperature. Beside trying lower clock speeds, make sure that the GPU temperature is not too high. Mike ID: 7807 · Rating: 0 · rate: / Reply Quote

Alain Maes Send message Joined: 8 Sep 08 Posts: 63 Credit: 1,699,957,181 RAC: 3,882 Level Scientific publications	Message 7808 - Posted: 24 Mar 2009, 16:15:19 UTC - in response to Message 7792. ... Which indicates that software is not the problem... This was indeed not accurate enough. Should have been "software setup", i.e. software versions are the correct ones. And indeed this still means that occasionally a WU can crash, it happened to all of us. Also, newer software versions (drivers) might do better. But when so many crash in a short time there are in my view really only three possibilities: a bad batch of WUs, wrong software versions or hardware trouble. Here I do not believe the first two are applicable.. However, always happy to learn something of course. kind regards Alain ID: 7808 · Rating: 0 · rate: / Reply Quote

uBronan Send message Joined: 1 Feb 09 Posts: 139 Credit: 575,023 RAC: 0 Level Scientific publications	Message 7843 - Posted: 25 Mar 2009, 20:37:19 UTC Last modified: 25 Mar 2009, 20:44:04 UTC Michael my gpu temps are much below average. I also have a EVGA card and must admit it to be a good brand. My card full load only becomes 63 C with fan on 40%, and if i turn up that coolmonster to 70% it goes down to 53 C. So i guess those temps should be fine ;) Sometimes i wonder if the temp is because of the low shader count or that i simply clocked under specs which it should run at. It should run with higher clocks provided by EVGA and i am sure it can do more. But for the sake of the project i am going to keep it low. I had a crash from windows and after restart, found a 12 hours running unit crashed immediate. And again it show restart.coor as reason, so somehow i get the feeling that restarting the unit where it left at that moment is not properly working. But i simply blame the admins thats easier ;) The settings of my EVGA OC 9600 GT Settings gpu 600 should be 675 shader 1600 should be 1800 memory 900 should be 900(1800) ID: 7843 · Rating: 0 · rate: / Reply Quote

Stefan Ledwina Send message Joined: 16 Jul 07 Posts: 464 Credit: 298,573,998 RAC: 0 Level Scientific publications	Message 7847 - Posted: 25 Mar 2009, 20:56:39 UTC - in response to Message 7843. Webbie, the restart.coor file is only the checkpoint file... So in your case, the computer maybe just crashed when the app was writing to the file and it got corupted when the PC crashed. So it couldn't restart the computation of the WU... Just pure speculation of course... ;) pixelicious.at - my little photoblog ID: 7847 · Rating: 0 · rate: / Reply Quote

uBronan Send message Joined: 1 Feb 09 Posts: 139 Credit: 575,023 RAC: 0 Level Scientific publications	Message 7848 - Posted: 25 Mar 2009, 20:59:57 UTC Last modified: 25 Mar 2009, 21:00:13 UTC hihi yea lol i figured the same but then again pure speculation again ;) ID: 7848 · Rating: 0 · rate: / Reply Quote

Michael Goetz Send message Joined: 2 Mar 09 Posts: 124 Credit: 124,873,744 RAC: 0 Level Scientific publications	Message 7852 - Posted: 25 Mar 2009, 21:32:43 UTC - in response to Message 7848. Another potential source of errors is a power supply that is almost but not quite powerful enough to consistently drive the GPU. Since the GPU's power draw is dependent on how hard it's working, having a marginal power supply might work most of the time, but cause errors once in a while. Actually, there's no end to the possible causes. The question is how do you diagnose the problem. One thing I would try is swapping hardware if you have more than one rig. If you have another GPU, see if that works in this computer. If you have another computer, see if your GPU works in that computer. Perhaps you can isolate whether it's the GPU or the computer that's the root of the problem. ID: 7852 · Rating: 0 · rate: / Reply Quote

naja002 Send message Joined: 25 Sep 08 Posts: 111 Credit: 10,352,599 RAC: 0 Level Scientific publications	Message 7863 - Posted: 26 Mar 2009, 11:29:39 UTC Hey Everybody, Just wanted to say Thanx for the input and a special Thanx to Paul for offering to help weed through this with a fine tooth comb. :) 2 possible issues--1 is heat. I have a unique mega water cooling setup for my rigs. Too complex and detailed to go into, but I've made some corrections/adjustments and things are running properly again. I think it's the core issue...maybe the only issue. I turned my OCs back up and I'll check out the last possible issue in a day or 2 to see if it is an issue or not. Ran all day today with Successful WUs, so hopefully I'm back on track! ID: 7863 · Rating: 0 · rate: / Reply Quote

ExtraTerrestrial Apes Volunteer moderator Volunteer tester Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level Scientific publications	Message 7887 - Posted: 26 Mar 2009, 22:03:31 UTC - in response to Message 7863. Just keep in mind that when you're running 8800 GS / GT at around 1.9 GHz you're asking for a lot. Cards can take it, but you're definitely borderline here. Do you use furmark to check stability? MrS Scanning for our furry friends since Jan 2002 ID: 7887 · Rating: 0 · rate: / Reply Quote

naja002 Send message Joined: 25 Sep 08 Posts: 111 Credit: 10,352,599 RAC: 0 Level Scientific publications	Message 7930 - Posted: 28 Mar 2009, 10:38:43 UTC Hi ETA, Yes, I do push these cards hard....:). I don't use Furmark or any other "stability" program(s). Many have realized that they have limited value when it comes to crunching and doing it 24/7/365. I just OC until I get errors and then back down. Once I find the high end I back it down a notch further into the sweet spot. And by "errors" I mean driver errors, boot, etc...not necessaarily WU errors--those are usually the last step. But I only do the OC determinations when I have time to babysit. Otherwise I run "under" so-to-speak until I have an opportunity to do it right. My stuff has basically been running fine this way for months. However, I had an issue with my mega WCing setup. I understand what was going on now, but was quite confused before. Ultimately, my cards are OCed to where they should work fine to +55C.....after that it's hit and miss. With the prior flow issues that developed from WCing changes and just other factors....temps were pushed upward in a somewhat erratic fashion. Again, it would take a book to try to explain everything.... Right now the cards are running ~35C, so I have quite a bit of head room.... Back to 1 successful WU after another, so it seems as though heat was the main issue.... ID: 7930 · Rating: 0 · rate: / Reply Quote

ExtraTerrestrial Apes Volunteer moderator Volunteer tester Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level Scientific publications	Message 7933 - Posted: 28 Mar 2009, 11:34:57 UTC - in response to Message 7930. Well, that makes quite some sense. Actually I ran my stability tests in a rather similar way: I used 3D Mark to find the point where things broke, backed off a bit and ran for a while and then backed off a bit more and used these settings successfully for GPU-Grid. Back to 1 successful WU after another, so it seems as though heat was the main issue.... Glad it works again! And although I may not telling you naything new here: it was not "heat", it was the "combination of clock speed and heat" ;) The higher the temperature the smaller the maximum stable frequency becomes. So at 50 or 100 MHz less you would likely still have been fine at those higher temps. BTW: I'm running an ATI 4870 at Milkyway now and there's this nice tool "ATI Tray Tools", which also has a built-in artefact tester. It's much more convenient to use than e.g. 3D Mark and although it doesn't generate as much stress as FurMark (that thing could kill my card..) it's almost in line with the temperatures I get when running MW, so it may be a good test. Don't know if the tool also runs on nVidia, though. RivaTuner surely does run on ATI :D MrS Scanning for our furry friends since Jan 2002 ID: 7933 · Rating: 0 · rate: / Reply Quote

Michael Goetz Send message Joined: 2 Mar 09 Posts: 124 Credit: 124,873,744 RAC: 0 Level Scientific publications	Message 7935 - Posted: 28 Mar 2009, 12:05:23 UTC - in response to Message 7933. Because of the likelihood of GPUs being overclocked to the point of producing non-crash errors, I would think that most projects would be well served by running with at least a quorum of 2 so that some flipped bits don't end up distorting the science results. I'm sure very few people outside of the hardware manufacturers have done sufficient testing to really know at what point, for a given temperature, a particular GPU starts becoming unreliable in as much as the data starts becoming erroneous. It's easy to know when it crashes. It's not that easy to know if the calculations are correct. To really insure that the card is working correctly, you need to repeat the tests at a range of temperatures, using software that will verify that the results are correct. Furthermore, the tests need to be created in such a way that all of the hardware is being exercised. In other words, it's not enough that the results were correct. The test must use every multi-processor and ever shader on the GPU. Then, given the expected MTBF of the hardware at a given temperature and clock rate, the test needs to run continuously for a multiple of that MTBF. Determining the MTBF requires testing with multiple examples of the GPU. To make things worse, as the card, power supply, and fans age, the GPU may not perform as well as it did when it was new. So the tests should be repeated periodically. That's how you insure that your GPU is producing valid science data, as opposed to merely not crashing. Clearly, this is not only onerous, but outright impossible for the average consumer. If you're running over-clocked, you really can't be certain that all the calculations are occurring correctly. To be fair, even if you're running at factory speeds, you can't be certain either. That's why, IMHO, projects should always use a quorum of at least 2. Of course, there are some projects where validating 2 results against each other is impossible due to the nature of the work, but those projects are a small minority. Frankly, given the current interest in GPU-computing, I'm surprised that Nvidia and ATI don't include tools that let you test the accuracy of your calculations so that consumers can make wise choices when overclocking. Right now, we're flying blind. Mike ID: 7935 · Rating: 0 · rate: / Reply Quote

naja002 Send message Joined: 25 Sep 08 Posts: 111 Credit: 10,352,599 RAC: 0 Level Scientific publications	Message 7937 - Posted: 28 Mar 2009, 14:19:28 UTC - in response to Message 7933. Glad it works again! And although I may not telling you naything new here: it was not "heat", it was the "combination of clock speed and heat" ;) The higher the temperature the smaller the maximum stable frequency becomes. So at 50 or 100 MHz less you would likely still have been fine at those higher temps. Definitely understand and agree, but in my world (lol) the OC is a given....so the only change was the reduction in cooling. I certainly understand your point though....! That was the purpose of the "+" in the +55C above....it's a conservative figure...these clocks will run fine above that...they'll run at +60C, but I'm not quite sure where the breaking point is. However, they are clocked high enough that they won't perform at the temps a lot of folks get on air...but that's one of the benefits of water...higher clocks and cooler temps....hopefully leading to greater longevity, cuz atm I plan to run these suckers until the wheels fall off! Mike, You are definitely right, but as you already know...it's near impossible for an average-joe like me to accomplish that...and the truth is the same even at factory clocks as you posted. So, until I have reason to think/feel/believe that what I am doing is somehow skewing the results....I'm just going to keep moving forward. Hopefully, as this and other gpu projects advance--They'll be able to tell and let us know what's what. Right now I think we all have to rely upon them to determine the overall quality of the data...and if something's wrong---find out what it is and let us all know. This Gpu stuff is still relatively new, so there's little doubt that things will be learned about it in the future! Probably the main reason that I have stuck with GpuGrid vs. say, F@H, is because it was and still is a new advancement in the DC Gpu world--I like that. Even though the original WUs were not for "science" directly....they were for the overall greater good of DC and therefore "science" over the long haul! Somebody needs to get this tech up, running and sorted out....that is what GpuGrid has been doing and I like being a part of that. Now we have seti, MW, etc and others will follow....DC will grow exponentially. ID: 7937 · Rating: 0 · rate: / Reply Quote

ExtraTerrestrial Apes Volunteer moderator Volunteer tester Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level Scientific publications	Message 7939 - Posted: 28 Mar 2009, 16:00:54 UTC - in response to Message 7935. Frankly, given the current interest in GPU-computing, I'm surprised that Nvidia and ATI don't include tools that let you test the accuracy of your calculations so that consumers can make wise choices when overclocking. Right now, we're flying blind. I'm not that surprised. Do you think it's really in NV or ATIs interest to give us some tool to proof that hardware is defective? There have been cases where new games (e.g. Doom 3) appeared and some factory OC'ed cards failed them. If even games occasionally reveal such faults, what would the failure rates (and associated RAM costs) be if there was a proper test tool? Who of the big two would afford to be first and take the sales hit due to bad press? We as a community, however, would certainly want better test tools and more of a general consciousness towards this problem. And I don't think we're flying totally blind right now.. it's not that bad ;) We do have 3D Mark to find upper limits of stability. We have FurMark to fry our cards. We have the artefact tester of ATI Tray Tool and maybe others. Sure, these don't execute exactly the same code as the projects. But taking this logic a bit further a specialized test program supplied by NV or ATI would also not be sufficient, as it wouldn't run the same code either. In another thread GDF said he's quite confident in the error detection mechanisms of GPU-Grid. The part I understand and remember is that small errors will be corrected in future iterations due to the way atomic forces work, whereas large errors lead to detectable failures. MrS Scanning for our furry friends since Jan 2002 ID: 7939 · Rating: 0 · rate: / Reply Quote

Michael Goetz Send message Joined: 2 Mar 09 Posts: 124 Credit: 124,873,744 RAC: 0 Level Scientific publications	Message 7946 - Posted: 28 Mar 2009, 22:24:00 UTC - in response to Message 7939. Do you think it's really in NV or ATIs interest to give us some tool to proof that hardware is defective? That depends on how seriously they want to push their graphics cards as general-purpose supercomputing engines. You have an excellent point, however. I suppose it would be downright idiotic for ATI or NV to provide such a tool. For that matter, if you read that GPU power use report (referenced in another thread here), it's clear the boards are built and marketed as 3D game boards (duh!) which won't run full out in actual usage. ATI even had to go so far as to have the Catalyst driver detect if a specific benchmark was running because when ran full out, some of their boards' voltage regulators were overheating -- something that doesn't happen in real life in games or DC computing. Fortunately, with the CUDA libraries, it would be really easy to write a burn-in program that would run the GPU at 100%. ID: 7946 · Rating: 0 · rate: / Reply Quote