Message boards :
Graphics cards (GPUs) :
Computational Error
Message board moderation
| Author | Message |
|---|---|
|
Send message Joined: 20 Aug 07 Posts: 18 Credit: 1,319,274 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
51-KASHIF_HIVPR_dim_ba1-2-100-RND4878_1 53+ hours of continuous computing, computer finishes the workunit, only to report a "COMPUTATIONAL ERROR" and the big fat "0" points awarded. Looks like I'm going to have to abort these longer workunits in the future. It's not worth the frustration of a computational error. Name 51-KASHIF_HIVPR_dim_ba1-2-100-RND4878_1 Workunit 424829 Created 1 May 2009 13:02:02 UTC Sent 1 May 2009 13:15:56 UTC Received 3 May 2009 18:51:26 UTC Server state Over Outcome Client error Client state Compute error Exit status -185 (0xffffffffffffff47) Computer ID 27410 Report deadline 6 May 2009 13:15:56 UTC CPU time 1208.969 stderr out <core_client_version>6.4.7</core_client_version> <![CDATA[ <message> Can't write init file: -108 </message> ]]> Validate state Invalid Claimed credit 8076.97800925926 Granted credit 0 application version 6.64 I know you got some use out of this because it sent in a 51.94 MB completion file. |
|
Send message Joined: 1 Feb 09 Posts: 139 Credit: 575,023 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
looks to me there is a error made by programming : <core_client_version>6.6.26</core_client_version> <![CDATA[ <message> Incorrect function. (0x1) - exit code 1 (0x1) </message> <stderr_txt> # Using CUDA device 0 # Device 0: "GeForce 9600 GT" # Clock rate: 1674000 kilohertz # Total amount of global memory: 536870912 bytes # Number of multiprocessors: 8 # Number of cores: 64 # Amber: readparm : Reading parm file parameters # PARM file in AMBER 7 format # Encounter 10-12 H-bond term WARNING: parameters.cu, line 568: Found zero 10-12 H-bond term. WARNING: parameters.cu, line 568: Found zero 10-12 H-bond term. MDIO ERROR: cannot open file "restart.coor" Cuda error: Kernel [fft_data_swizzle_out] failed in file 'c:\cygwin\home\speechserver\gpumd2\src\pme\CPME_cufft.cu' in line 61 : unknown error. </stderr_txt> ]]> 3th in arow which failed |
ZydorSend message Joined: 8 Feb 09 Posts: 252 Credit: 1,309,451 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]()
|
I just had one dump out on me a couple of minutes ago at the start of processing http://www.gpugrid.net/workunit.php?wuid=431994 Looking at other threads, others have had this type go bang in the last 24hrs, maybe there is a bad one out there ?? Rare I know, but its an inescapable thought - some traditionaly "reliable" high volume crunches have had one go bang (eg Paul) - would be worth digging a little, it seems a bit strange .... Regards Zy |
ZydorSend message Joined: 8 Feb 09 Posts: 252 Credit: 1,309,451 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]()
|
Paul The one I posted above is coming your way - you just downloaded it ..... :) Regards Zy |
|
Send message Joined: 20 Aug 07 Posts: 18 Credit: 1,319,274 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
I'd rather have a workunit dump at the begining rather than after it has completed its processing and is reported back to Grid servers. It was a waste of computing power and time. |
ZydorSend message Joined: 8 Feb 09 Posts: 252 Credit: 1,309,451 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]()
|
Thats for sure - dont know about this one, could have been my error, will be interesting to see if Paul gets through it. Regards Zy |
Paul D. BuckSend message Joined: 9 Jun 08 Posts: 1050 Credit: 37,321,185 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I just had one dump out on me a couple of minutes ago at the start of processing Um, you are not going to like this ... I am two hours in (2:18) and 18.3% done. Running just fine on my GTX295 card ... 9:22 hours to go ... For a small batch run I sure am getting a lot of them ... Hmmm, I wonder if there is a memory issue? However I do have this crash on 13-KASHIF_HIVPR_mon_ba3-6-100-RND2474_0 Though there is no real specific error, I got the Incorrect function. (0x1) - exit code 1 (0x1) error. It has already crashed for another person too ... I don't have as many as I first thought, only about 5 completed, plus the one error and the two in work. |
ZydorSend message Joined: 8 Feb 09 Posts: 252 Credit: 1,309,451 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]()
|
Interesting, nicely done :) Hmmmm wonder why it went bang for me ? First one for a while, all seems ok, one of those things at present. Thanks for the heads up, I'll keep my eye open more than usual in case something lurketh. Regards Zy |
Paul D. BuckSend message Joined: 9 Jun 08 Posts: 1050 Credit: 37,321,185 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I still wonder if it is not something to do with GPU memory size, mine is nearly twice yours ... |
|
Send message Joined: 24 Dec 08 Posts: 738 Credit: 200,909,904 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
looks to me there is a error made by programming : You are showing as running the (beta) 185.81 driver. I had problems with it too (and the swizzle_out error on one wu). I'm now running 182.50 which seems to work. BOINC blog |
ZydorSend message Joined: 8 Feb 09 Posts: 252 Credit: 1,309,451 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]()
|
I still wonder if it is not something to do with GPU memory size, mine is nearly twice yours ... Had a quick look at past ones, I have done two other KASHIF_HIVPR WUs. http://www.gpugrid.net/workunit.php?wuid=414191 http://www.gpugrid.net/workunit.php?wuid=421636 They went through ok. No idea if they were the "same" as such as the one that went bang. The latter may well have been something I did at the time, the CUDA card runs on my Home Office main beastie - normally my activities on it have not been an issue, may have been this time. Just posted the above for completeness in case it throws up anything of interest. Regards Zy |
|
Send message Joined: 25 Oct 08 Posts: 42 Credit: 42,812,268 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I have usually these message and then wus errors : 05/05/2009 06:41:14 GPUGRID Computation for task 42-TONI_HIVPR_mon_ba25-4-100-RND6738_0 finished on driver 185.81, boinc 6.6.20 and vista 64. :/ and my host : http://www.gpugrid.net/results.php?hostid=31684 with 1 gtx 260 and 1 8800 GT |
BeyondSend message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I had my only error ever a couple days ago on a KASHIF WU. They also take too long on slower cards. Is there a way to set the client not to DL these or do we just have to watch for them? |
|
Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I have usually these message and then wus errors : You have it backwards: actually you get the error first, then the WU is marked as finished and then BOINC complains about the missing files. Which, I suppose, are not there because the WU was terminated unusually instead of successfully writing result files before gracefully shutting down. and my Vista 64 machine is running 185.66 and 6.5.0 without problems. You might want to try this driver. MrS Scanning for our furry friends since Jan 2002 |
|
Send message Joined: 25 Oct 08 Posts: 42 Credit: 42,812,268 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
So it's a pb with Boinc, or my drivers ? Thanks. |
|
Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Neither: this BOINC message tells us nothing ecept that there was an error. Apart from this: since the 8th may all your WUs have errored out. What did you change? You clocked your 8800GT down, which shouldn't cause these errors. I suspect an upgrade to a new beta driver, which somehow messes things up. You might want to try a proven version like 182.50 or 182.08 and remove the newer one with some driver cleaner. You could also upgrade to BOINC 6.6.23, since it fixed at least one major bug in 6.6.20. You could also try with only one card installed to reduce the amount of variables in your config. MrS Scanning for our furry friends since Jan 2002 |
©2025 Universitat Pompeu Fabra