Encounter 10-12 H-bond term == Client error 0x1 ?

Author	Message
Hydropower Send message Joined: 3 Apr 09 Posts: 70 Credit: 6,003,024 RAC: 0 Level Scientific publications	Message 10439 - Posted: 6 Jun 2009, 22:12:01 UTC Virtually all my currently failing jobs have this "Found zero 10-12 H-bond term" warning. I have examined other people's results and more than once, the 'buddies' will error out as well. Is it known what side effects this "Found zero 10-12" warning has ? Is it being investigated ? One job (519558) had an out of memory error and was terminated by XP. I had disabled my 'faulty' GPU3, so this is NOT the 'faulty' one. My errors have occurred over several GPUs today. Do we have a GPU testing program ? Join team Bletchley Park, the innovators. ID: 10439 · Rating: 0 · rate: / Reply Quote

Ulf Ohlsson Send message Joined: 1 Jan 09 Posts: 20 Credit: 616,384 RAC: 0 Level Scientific publications	Message 10440 - Posted: 7 Jun 2009, 4:28:47 UTC I.m running on CUDA device: GeForce 9800 GTX/9800 GTX+ (driver version 18608, compute capability 1.1, 1024MB, est. 85GFLOPS) And have exact the same problem OS is Vista 64 Only 5% of the wus completes normally ID: 10440 · Rating: 0 · rate: / Reply Quote

Neil A Send message Joined: 9 Oct 08 Posts: 50 Credit: 12,676,739 RAC: 0 Level Scientific publications	Message 10445 - Posted: 7 Jun 2009, 19:49:25 UTC - in response to Message 10440. I have been experiencing this type of symptom for quite a while on one of my computers.... which as 2x GTX 260 Core 216 SC...backing off clocking doesn't seem to have helped, nor has reloading the driver, downgrading the driver or upgrading the driver. I welcome a solution.... Crunching for the benefit of humanity and in memory of my dad and other family members. ID: 10445 · Rating: 0 · rate: / Reply Quote

MarkJ Volunteer moderator Volunteer tester Send message Joined: 24 Dec 08 Posts: 738 Credit: 200,909,904 RAC: 0 Level Scientific publications	Message 10447 - Posted: 8 Jun 2009, 0:52:07 UTC I have a number of machines: 4 quaddies with GTS250's and an i7 with dual GTX260+. I run WinXP on them. Nothing is overclocked. I found that the GTS250's wu will fail unless I run the 182.50 drivers. Even after the so called "work around" from the project team they still failed. The GTX260's seem to also fail, but not as often when I was running 185.85 drivers. I downgraded to 182.50 and that seemed to resolve the issue. Also it seems that you have to uninstall the old drivers before installing new ones. I use Control Panel -> Add/Remove programs to uninstall. I am running BOINC 6.6.28 on 3 of the quaddies and 6.6.33 on the other quaddie and the i7. There is a known bug with 6.6 (up to and including 28) to do with preempting tasks. 6.6.33 won't shut down the science apps on exit, but you can use Advanced -> Shutdown connected client and then exit. BOINC blog ID: 10447 · Rating: 0 · rate: / Reply Quote

GDF Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 14 Mar 07 Posts: 1958 Credit: 629,356 RAC: 0 Level Scientific publications	Message 10448 - Posted: 8 Jun 2009, 7:27:42 UTC - in response to Message 10447. Please check your driver version. It is possible that very new drivers have problems. You should use the one suggested by Nvidia for CUDA 2.1 unless you have a reason to use another one (for instance a game requiring a new driver). See the join section. Driver 181.xx are stable. gdf ID: 10448 · Rating: 0 · rate: / Reply Quote

Ulf Ohlsson Send message Joined: 1 Jan 09 Posts: 20 Credit: 616,384 RAC: 0 Level Scientific publications	Message 10457 - Posted: 8 Jun 2009, 23:52:10 UTC - in response to Message 10448. Last modified: 9 Jun 2009, 0:00:36 UTC i tried every version of driver and still get the same errors, only a few wus finnishing correctly I turned to run some wus for seti beta and have finnished 40 wus in last 20 hours, none of them had any errors at all perhaps it would be possible to get some statistics of how many of the returned wus has errors for lets say 2 months time and if this shows a raising grapph of corrupted wus there might be some errors at server side perhaps also adding info bout OS and CPUs ID: 10457 · Rating: 0 · rate: / Reply Quote

SkyeHunter Send message Joined: 7 Mar 09 Posts: 12 Credit: 1,254,285 RAC: 0 Level Scientific publications	Message 10459 - Posted: 9 Jun 2009, 9:13:04 UTC A bit less than 2 weeks ago I almost constantly had this kind of errors. Backing down the GPU clock (including GPU Memory clock) did resolve the issues. With one exception, everything gpugrid has thrown to the system concerned ever since, ran without a glitch, although a bit slower .... ID: 10459 · Rating: 0 · rate: / Reply Quote

jrobbio Send message Joined: 13 Mar 09 Posts: 59 Credit: 324,366 RAC: 0 Level Scientific publications	Message 10460 - Posted: 9 Jun 2009, 10:21:26 UTC - in response to Message 10448. Please check your driver version. It is possible that very new drivers have problems. You should use the one suggested by Nvidia for CUDA 2.1 unless you have a reason to use another one (for instance a game requiring a new driver). See the join section. Driver 181.xx are stable. gdf I thought that we were going to be moving up to CUDA 2.2 or did the error on Nvidia's part put a stop to that? I always receive this message on my results, but they don't error out. Rob ID: 10460 · Rating: 0 · rate: / Reply Quote

Hydropower Send message Joined: 3 Apr 09 Posts: 70 Credit: 6,003,024 RAC: 0 Level Scientific publications	Message 10468 - Posted: 9 Jun 2009, 22:22:40 UTC I returned all cards to stock speeds but have not crunched since. I did RMA my GPU3 (later GPU5 after swapping slots) card. It showed a hardware failure on one test with OCCT. I still think there should be a testing / validation program for the shader processors. I found there is a bug in the memtestg80 program. It cannot run the same test on the same GPU twice in a row, the memory allocation always fails and after a little while the allocation fails with an 'unknown error'. This sounds familiar. Installing newer drivers should rule out driver errors on that one. Join team Bletchley Park, the innovators. ID: 10468 · Rating: 0 · rate: / Reply Quote

Ulf Ohlsson Send message Joined: 1 Jan 09 Posts: 20 Credit: 616,384 RAC: 0 Level Scientific publications	Message 10470 - Posted: 10 Jun 2009, 13:00:17 UTC - in response to Message 10459. I tried to slow down my GPU processes but there still same problem with WUs from GPU grid Seti @ hoem beta runs with 100% success. Feels like waste of time to continue crunching for GPU-grid as long as this problem isn't solved. core_client_version>6.6.33</core_client_version> <![CDATA[ <message> Felaktig funktion. (0x1) - exit code 1 (0x1) </message> <stderr_txt> # Using CUDA device 0 # Device 0: "GeForce 9800 GTX/9800 GTX+" # Clock rate: 1850000 kilohertz # Total amount of global memory: 1073741824 bytes # Number of multiprocessors: 16 # Number of cores: 128 # Amber: readparm : Reading parm file parameters # PARM file in AMBER 7 format # Encounter 10-12 H-bond term WARNING: parameters.cu, line 568: Found zero 10-12 H-bond term. WARNING: parameters.cu, line 568: Found zero 10-12 H-bond term. MDIO ERROR: cannot open file "restart.coor" ------------ <core_client_version>6.6.33</core_client_version> <![CDATA[ <message> Felaktig funktion. (0x1) - exit code 1 (0x1) </message> <stderr_txt> # Using CUDA device 0 # Device 0: "GeForce 9800 GTX/9800 GTX+" # Clock rate: 1850000 kilohertz # Total amount of global memory: 1073741824 bytes # Number of multiprocessors: 16 # Number of cores: 128 MDIO ERROR: cannot open file "restart.coor" ID: 10470 · Rating: 0 · rate: / Reply Quote

ExtraTerrestrial Apes Volunteer moderator Volunteer tester Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level Scientific publications	Message 10513 - Posted: 12 Jun 2009, 21:46:52 UTC @Hydro: seems like you're back up'n running. Was it "just" the downclocking? @Ulf: you reverted to the older 182.50 driver, which is known to be good and you still have the error. So it does not look like software. I'd suppose hardware, although you also already tried downclocking. Further evidence: your WUs take a long time until they fail, which is typical for temperature / hardware failures just at the edge of stability. Seti doesn't use the GPU as hard as GPU-Grid does, so it could still run nevertheless. Try running 3D Mark and / or FurMark for an hour. jrobbio wrote: I thought that we were going to be moving up to CUDA 2.2 The next client is going to be 2.2, but no reason to hurry. Hydro wrote: Is it known what side effects this "Found zero 10-12" warning has ? Is it being investigated ? I think Ignasi said this warning is nothing to worry about (for us). Sounds like "no side effects are known". MrS Scanning for our furry friends since Jan 2002 ID: 10513 · Rating: 0 · rate: / Reply Quote

Hydropower Send message Joined: 3 Apr 09 Posts: 70 Credit: 6,003,024 RAC: 0 Level Scientific publications	Message 10526 - Posted: 13 Jun 2009, 9:29:40 UTC - in response to Message 10513. Hi, not sure, the absence of GPU3 may have something to do with it too. I currently have the remaining 6 mildly overclocked to 633 (as that is an evga advertized speed for G200 based cards). So far so good. GPU3(5) has been RMA'd and its slot is currently empty. Fans are at 89% with temperatures not over 65c. It was not a power issue as there is plenty. Also not a driver issue (at least not with 3 cards) because it still is the same driver. I may try linux 64 today. regards H. @Hydro: seems like you're back up'n running. Was it "just" the downclocking? @Ulf: you reverted to the older 182.50 driver, which is known to be good and you still have the error. So it does not look like software. I'd suppose hardware, although you also already tried downclocking. Further evidence: your WUs take a long time until they fail, which is typical for temperature / hardware failures just at the edge of stability. Hydro wrote: Is it known what side effects this "Found zero 10-12" warning has ? Is it being investigated ? I think Ignasi said this warning is nothing to worry about (for us). Sounds like "no side effects are known". MrS Join team Bletchley Park, the innovators. ID: 10526 · Rating: 0 · rate: / Reply Quote

Hydropower Send message Joined: 3 Apr 09 Posts: 70 Credit: 6,003,024 RAC: 0 Level Scientific publications	Message 10527 - Posted: 13 Jun 2009, 10:18:54 UTC - in response to Message 10526. Ubuntu 8 failed installation because of: MP-BIOS bug, 8254 timer not connected to IO-APIC Ubuntu 9 failed because it cannot detect my CD ROM, after booting from CD ROM... Join team Bletchley Park, the innovators. ID: 10527 · Rating: 0 · rate: / Reply Quote

mikaok Send message Joined: 16 Jan 09 Posts: 12 Credit: 639,094 RAC: 0 Level Scientific publications	Message 10529 - Posted: 13 Jun 2009, 11:11:22 UTC - in response to Message 10513. Same error. Gpu isn't oc'ed and driver version is 182.08. cheers Mika ID: 10529 · Rating: 0 · rate: / Reply Quote

ExtraTerrestrial Apes Volunteer moderator Volunteer tester Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level Scientific publications	Message 10531 - Posted: 13 Jun 2009, 14:11:23 UTC - in response to Message 10529. Your error "Incorrect function. (0x1) - exit code 1 (0x1)" is a very general one which, roughly speaking, can happen due to anything going wrong during the calculation. MrS Scanning for our furry friends since Jan 2002 ID: 10531 · Rating: 0 · rate: / Reply Quote

Hydropower Send message Joined: 3 Apr 09 Posts: 70 Credit: 6,003,024 RAC: 0 Level Scientific publications	Message 10532 - Posted: 13 Jun 2009, 14:45:47 UTC - in response to Message 10531. The error is caused by "Cuda error: Kernel [frc_sum_nb_forces] failed in file 'f ". Not much overclocked at 1700 MHZ compared to stock 1650 for a GTS 8800. Again I think, even for overclocking tests, a good shader testing program would be useful. Join team Bletchley Park, the innovators. ID: 10532 · Rating: 0 · rate: / Reply Quote

mikaok Send message Joined: 16 Jan 09 Posts: 12 Credit: 639,094 RAC: 0 Level Scientific publications	Message 10535 - Posted: 13 Jun 2009, 15:05:23 UTC - in response to Message 10531. Your error "Incorrect function. (0x1) - exit code 1 (0x1)" is a very general one which, roughly speaking, can happen due to anything going wrong during the calculation. MrS Ok, i thought this was the same error we were talking about. My bad. Hydropower, this is a XFX version of the card, so it is guaranteed to work with these clocks. ID: 10535 · Rating: 0 · rate: / Reply Quote

Hydropower Send message Joined: 3 Apr 09 Posts: 70 Credit: 6,003,024 RAC: 0 Level Scientific publications	Message 10537 - Posted: 13 Jun 2009, 16:16:59 UTC - in response to Message 10535. That's what I mean only 3 %. Join team Bletchley Park, the innovators. ID: 10537 · Rating: 0 · rate: / Reply Quote

ExtraTerrestrial Apes Volunteer moderator Volunteer tester Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level Scientific publications	Message 10539 - Posted: 13 Jun 2009, 20:31:44 UTC - in response to Message 10532. The error is caused by "Cuda error: Kernel [frc_sum_nb_forces] failed in file 'f" I'm quite convinced that it's a transient error, which means at some point some calculation threw out a bad result. That means it wouldn't matter in which file and in which code line it happened.. unless we'd discover some regularity. I'm not disagreeing with you, but IMO saying "The error is caused by.." probably misses the point. Again I think, even for overclocking tests, a good shader testing program would be useful. We don't have the perfect tool yet, but I think if a card survives FurMark for an hours without artefacts it should be fine for GPU-Grid. Yes, it doesn't run exactly the same code (but nothing except GPU-Grid itself could do that), so there might be problems where only certain combinations of instructions trigger errors. But FurMark stressed the cards so hard, it could almost be called a thermal virus and should easily generate 20 - 30°C more than GPU-Grid (at constant fan speed). This reduces the maximum stable frequency by quite a bit and thus errors are much more likely to show up. Good old 3D Mark also has error detection built in. It's far from perfect, but if you can't finish it you know you're in trouble (it doesn't work the other way around, though). MrS Scanning for our furry friends since Jan 2002 ID: 10539 · Rating: 0 · rate: / Reply Quote