Message boards :
Graphics cards (GPUs) :
Encounter 10-12 H-bond term == Client error 0x1 ?
Message board moderation
| Author | Message |
|---|---|
HydropowerSend message Joined: 3 Apr 09 Posts: 70 Credit: 6,003,024 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]()
|
Virtually all my currently failing jobs have this "Found zero 10-12 H-bond term" warning. I have examined other people's results and more than once, the 'buddies' will error out as well. Is it known what side effects this "Found zero 10-12" warning has ? Is it being investigated ? One job (519558) had an out of memory error and was terminated by XP. I had disabled my 'faulty' GPU3, so this is NOT the 'faulty' one. My errors have occurred over several GPUs today. Do we have a GPU testing program ? Join team Bletchley Park, the innovators. |
|
Send message Joined: 1 Jan 09 Posts: 20 Credit: 616,384 RAC: 0 Level ![]() Scientific publications ![]() ![]()
|
I.m running on CUDA device: GeForce 9800 GTX/9800 GTX+ (driver version 18608, compute capability 1.1, 1024MB, est. 85GFLOPS) And have exact the same problem OS is Vista 64 Only 5% of the wus completes normally |
|
Send message Joined: 9 Oct 08 Posts: 50 Credit: 12,676,739 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I have been experiencing this type of symptom for quite a while on one of my computers.... which as 2x GTX 260 Core 216 SC...backing off clocking doesn't seem to have helped, nor has reloading the driver, downgrading the driver or upgrading the driver. I welcome a solution.... Crunching for the benefit of humanity and in memory of my dad and other family members. |
|
Send message Joined: 24 Dec 08 Posts: 738 Credit: 200,909,904 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I have a number of machines: 4 quaddies with GTS250's and an i7 with dual GTX260+. I run WinXP on them. Nothing is overclocked. I found that the GTS250's wu will fail unless I run the 182.50 drivers. Even after the so called "work around" from the project team they still failed. The GTX260's seem to also fail, but not as often when I was running 185.85 drivers. I downgraded to 182.50 and that seemed to resolve the issue. Also it seems that you have to uninstall the old drivers before installing new ones. I use Control Panel -> Add/Remove programs to uninstall. I am running BOINC 6.6.28 on 3 of the quaddies and 6.6.33 on the other quaddie and the i7. There is a known bug with 6.6 (up to and including 28) to do with preempting tasks. 6.6.33 won't shut down the science apps on exit, but you can use Advanced -> Shutdown connected client and then exit. BOINC blog |
GDFSend message Joined: 14 Mar 07 Posts: 1958 Credit: 629,356 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() |
Please check your driver version. It is possible that very new drivers have problems. You should use the one suggested by Nvidia for CUDA 2.1 unless you have a reason to use another one (for instance a game requiring a new driver). See the join section. Driver 181.xx are stable. gdf |
|
Send message Joined: 1 Jan 09 Posts: 20 Credit: 616,384 RAC: 0 Level ![]() Scientific publications ![]() ![]()
|
i tried every version of driver and still get the same errors, only a few wus finnishing correctly I turned to run some wus for seti beta and have finnished 40 wus in last 20 hours, none of them had any errors at all perhaps it would be possible to get some statistics of how many of the returned wus has errors for lets say 2 months time and if this shows a raising grapph of corrupted wus there might be some errors at server side perhaps also adding info bout OS and CPUs |
|
Send message Joined: 7 Mar 09 Posts: 12 Credit: 1,254,285 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
A bit less than 2 weeks ago I almost constantly had this kind of errors. Backing down the GPU clock (including GPU Memory clock) did resolve the issues. With one exception, everything gpugrid has thrown to the system concerned ever since, ran without a glitch, although a bit slower .... |
|
Send message Joined: 13 Mar 09 Posts: 59 Credit: 324,366 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]()
|
Please check your driver version. It is possible that very new drivers have problems. You should use the one suggested by Nvidia for CUDA 2.1 unless you have a reason to use another one (for instance a game requiring a new driver). I thought that we were going to be moving up to CUDA 2.2 or did the error on Nvidia's part put a stop to that? I always receive this message on my results, but they don't error out. Rob |
HydropowerSend message Joined: 3 Apr 09 Posts: 70 Credit: 6,003,024 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]()
|
I returned all cards to stock speeds but have not crunched since. I did RMA my GPU3 (later GPU5 after swapping slots) card. It showed a hardware failure on one test with OCCT. I still think there should be a testing / validation program for the shader processors. I found there is a bug in the memtestg80 program. It cannot run the same test on the same GPU twice in a row, the memory allocation always fails and after a little while the allocation fails with an 'unknown error'. This sounds familiar. Installing newer drivers should rule out driver errors on that one. Join team Bletchley Park, the innovators. |
|
Send message Joined: 1 Jan 09 Posts: 20 Credit: 616,384 RAC: 0 Level ![]() Scientific publications ![]() ![]()
|
I tried to slow down my GPU processes but there still same problem with WUs from GPU grid Seti @ hoem beta runs with 100% success. Feels like waste of time to continue crunching for GPU-grid as long as this problem isn't solved. core_client_version>6.6.33</core_client_version> <![CDATA[ <message> Felaktig funktion. (0x1) - exit code 1 (0x1) </message> <stderr_txt> # Using CUDA device 0 # Device 0: "GeForce 9800 GTX/9800 GTX+" # Clock rate: 1850000 kilohertz # Total amount of global memory: 1073741824 bytes # Number of multiprocessors: 16 # Number of cores: 128 # Amber: readparm : Reading parm file parameters # PARM file in AMBER 7 format # Encounter 10-12 H-bond term WARNING: parameters.cu, line 568: Found zero 10-12 H-bond term. WARNING: parameters.cu, line 568: Found zero 10-12 H-bond term. MDIO ERROR: cannot open file "restart.coor" ------------ <core_client_version>6.6.33</core_client_version> <![CDATA[ <message> Felaktig funktion. (0x1) - exit code 1 (0x1) </message> <stderr_txt> # Using CUDA device 0 # Device 0: "GeForce 9800 GTX/9800 GTX+" # Clock rate: 1850000 kilohertz # Total amount of global memory: 1073741824 bytes # Number of multiprocessors: 16 # Number of cores: 128 MDIO ERROR: cannot open file "restart.coor" |
|
Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
@Hydro: seems like you're back up'n running. Was it "just" the downclocking? @Ulf: you reverted to the older 182.50 driver, which is known to be good and you still have the error. So it does not look like software. I'd suppose hardware, although you also already tried downclocking. Further evidence: your WUs take a long time until they fail, which is typical for temperature / hardware failures just at the edge of stability. Seti doesn't use the GPU as hard as GPU-Grid does, so it could still run nevertheless. Try running 3D Mark and / or FurMark for an hour. jrobbio wrote: I thought that we were going to be moving up to CUDA 2.2 The next client is going to be 2.2, but no reason to hurry. Hydro wrote: Is it known what side effects this "Found zero 10-12" warning has ? Is it being investigated ? I think Ignasi said this warning is nothing to worry about (for us). Sounds like "no side effects are known". MrS Scanning for our furry friends since Jan 2002 |
HydropowerSend message Joined: 3 Apr 09 Posts: 70 Credit: 6,003,024 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]()
|
Hi, not sure, the absence of GPU3 may have something to do with it too. I currently have the remaining 6 mildly overclocked to 633 (as that is an evga advertized speed for G200 based cards). So far so good. GPU3(5) has been RMA'd and its slot is currently empty. Fans are at 89% with temperatures not over 65c. It was not a power issue as there is plenty. Also not a driver issue (at least not with 3 cards) because it still is the same driver. I may try linux 64 today. regards H. @Hydro: seems like you're back up'n running. Was it "just" the downclocking? Join team Bletchley Park, the innovators. |
HydropowerSend message Joined: 3 Apr 09 Posts: 70 Credit: 6,003,024 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]()
|
Ubuntu 8 failed installation because of: MP-BIOS bug, 8254 timer not connected to IO-APIC Ubuntu 9 failed because it cannot detect my CD ROM, after booting from CD ROM... Join team Bletchley Park, the innovators. |
mikaokSend message Joined: 16 Jan 09 Posts: 12 Credit: 639,094 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]()
|
Same error. Gpu isn't oc'ed and driver version is 182.08. cheers Mika |
|
Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Your error "Incorrect function. (0x1) - exit code 1 (0x1)" is a very general one which, roughly speaking, can happen due to anything going wrong during the calculation. MrS Scanning for our furry friends since Jan 2002 |
HydropowerSend message Joined: 3 Apr 09 Posts: 70 Credit: 6,003,024 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]()
|
The error is caused by "Cuda error: Kernel [frc_sum_nb_forces] failed in file 'f ". Not much overclocked at 1700 MHZ compared to stock 1650 for a GTS 8800. Again I think, even for overclocking tests, a good shader testing program would be useful. Join team Bletchley Park, the innovators. |
mikaokSend message Joined: 16 Jan 09 Posts: 12 Credit: 639,094 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]()
|
Your error "Incorrect function. (0x1) - exit code 1 (0x1)" is a very general one which, roughly speaking, can happen due to anything going wrong during the calculation. Ok, i thought this was the same error we were talking about. My bad. Hydropower, this is a XFX version of the card, so it is guaranteed to work with these clocks. |
HydropowerSend message Joined: 3 Apr 09 Posts: 70 Credit: 6,003,024 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]()
|
That's what I mean only 3 %. Join team Bletchley Park, the innovators. |
|
Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
The error is caused by "Cuda error: Kernel [frc_sum_nb_forces] failed in file 'f" I'm quite convinced that it's a transient error, which means at some point some calculation threw out a bad result. That means it wouldn't matter in which file and in which code line it happened.. unless we'd discover some regularity. I'm not disagreeing with you, but IMO saying "The error is caused by.." probably misses the point. Again I think, even for overclocking tests, a good shader testing program would be useful. We don't have the perfect tool yet, but I think if a card survives FurMark for an hours without artefacts it should be fine for GPU-Grid. Yes, it doesn't run exactly the same code (but nothing except GPU-Grid itself could do that), so there might be problems where only certain combinations of instructions trigger errors. But FurMark stressed the cards so hard, it could almost be called a thermal virus and should easily generate 20 - 30°C more than GPU-Grid (at constant fan speed). This reduces the maximum stable frequency by quite a bit and thus errors are much more likely to show up. Good old 3D Mark also has error detection built in. It's far from perfect, but if you can't finish it you know you're in trouble (it doesn't work the other way around, though). MrS Scanning for our furry friends since Jan 2002 |
©2025 Universitat Pompeu Fabra