6 Errors Today [Problems with "KASHIF_HIVPR" and "IBUCH_KID"-WUs]

Message boards : Graphics cards (GPUs) : 6 Errors Today [Problems with "KASHIF_HIVPR" and "IBUCH_KID"-WUs]
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · Next

AuthorMessage
Profile Beyond
Avatar

Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9435 - Posted: 7 May 2009, 14:33:09 UTC

My first 2 errors ever AFAIK, the 1st a 76-KASHIF_HIVPR WU and the 2nd one of the infamous 76-IBUCH_KID WUs.
Two different cards, both 9600 GSO. Notice a similarity in the error messages?:

<core_client_version>6.6.24</core_client_version>
<![CDATA[
<message>
- exit code 98 (0x62)
</message>
<stderr_txt>
# Using CUDA device 0
# Device 0: "GeForce 9600 GSO"
# Clock rate: 1674000 kilohertz
# Total amount of global memory: 402325504 bytes
# Number of multiprocessors: 12
# Number of cores: 96
# Amber: readparm : Reading parm file parameters
# PARM file in AMBER 7 format
# Encounter 10-12 H-bond term
WARNING: parameters.cu, line 568: Found zero 10-12 H-bond term.
WARNING: parameters.cu, line 568: Found zero 10-12 H-bond term.
MDIO ERROR: cannot open file "restart.coor"
ERROR: c:\cygwin\home\speechserver\gpumd2\src\pme\CPME_cufft.cu, line 50: cufftExecC2C (gridcalc2.1)
called boinc_finish

</stderr_txt>
]]>



<core_client_version>6.6.20</core_client_version>
<![CDATA[
<message>
- exit code 98 (0x62)
</message>
<stderr_txt>
# Using CUDA device 0
# Device 0: "GeForce 9600 GSO"
# Clock rate: 1458000 kilohertz
# Total amount of global memory: 804978688 bytes
# Number of multiprocessors: 12
# Number of cores: 96
# Amber: readparm : Reading parm file parameters
# PARM file in AMBER 7 format
# Encounter 10-12 H-bond term
WARNING: parameters.cu, line 568: Found zero 10-12 H-bond term.
WARNING: parameters.cu, line 568: Found zero 10-12 H-bond term.
MDIO ERROR: cannot open file "restart.coor"
ERROR: c:\cy
</stderr_txt>
]]>


ID: 9435 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Zydor

Send message
Joined: 8 Feb 09
Posts: 252
Credit: 1,309,451
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 9444 - Posted: 7 May 2009, 18:14:11 UTC - in response to Message 9435.  

Got one through ok, then the next went bang after 30 mins.

Successful one was:
http://www.gpugrid.net/result.php?resultid=636960 A GIANNI

The one that failed this time - a KASHIF_HIVPR
http://www.gpugrid.net/result.php?resultid=639025
ERROR: c:\cygwin\home\speechserver\gpumd2\src\pme\CPME_cufft.cu, line 104: cufftExecC2R (gridcalc3)

With this one I was at the PC when it went. There was a system warning popup message, didnt get it word for word, only saw a flash as it disappeared , " something something could not be contacted, video driver restarted", dont hang your hat off that word for word, but essentially it looks as though the Video Driver lost connection, and the system auto restarted the video driver, when it did that, instant computation error.

I will ferret in the log files, I have the PC logged to death, hopefully I can dig something up about it.

Two more downloaded, A GIANNI and a KASHIF, I suspended the GIANNI, and will try another KASHIF, see what happens.

Regards
Zy
ID: 9444 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Zydor

Send message
Joined: 8 Feb 09
Posts: 252
Credit: 1,309,451
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 9446 - Posted: 7 May 2009, 19:22:20 UTC - in response to Message 9444.  
Last modified: 7 May 2009, 19:31:33 UTC

The KASHIF lasted 37 mins and went bang. A GIANNI is now running
The failed KASHIF: http://www.gpugrid.net/result.php?resultid=640997
Error was: Cuda error: Kernel [fft_data_swizzle_out] failed in file 'c:\cygwin\home\speechserver\gpumd2\src\pme\CPME_cufft.cu' in line 94 : unknown error.

(Not seen a "swizzle_out" error before)

Started this one - a GIANNI - and on past performance it will probably go through ok:
http://www.gpugrid.net/result.php?resultid=641393

[Edit] Any debuging switch or log file - whatever - that I can enable this end that will help, please let me know and I will. If you want me to run a series of suspect ones (etc) let me know how, I will [/Edit]

Regards
Zy
ID: 9446 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
[boinc.at] Nowi

Send message
Joined: 4 Sep 08
Posts: 44
Credit: 3,685,033
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwatwatwat
Message 9451 - Posted: 7 May 2009, 20:59:37 UTC

I have gotten another error of a 2-KASHIF_HIVPR-WU (result). The error appeared after more than 16 hours of computation on a 8800GT. Now I have three errors in a row. In my opinion is this unacceptable!!!!!!
ID: 9451 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile (_KoDAk_)
Avatar

Send message
Joined: 18 Oct 08
Posts: 43
Credit: 6,924,807
RAC: 0
Level
Ser
Scientific publications
watwatwatwatwatwatwatwat
Message 9459 - Posted: 8 May 2009, 7:34:55 UTC
Last modified: 8 May 2009, 7:35:47 UTC

boinc 6.6.24 x64

By KoDAkthebest
and some ERRORS (
http://www.gpugrid.net/results.php?hostid=31714
ID: 9459 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ignasi

Send message
Joined: 10 Apr 08
Posts: 254
Credit: 16,836,000
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 9461 - Posted: 8 May 2009, 8:53:00 UTC - in response to Message 9459.  

We are digging into these problems.

thanks,
ignasi
ID: 9461 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Zydor

Send message
Joined: 8 Feb 09
Posts: 252
Credit: 1,309,451
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 9462 - Posted: 8 May 2009, 9:12:01 UTC - in response to Message 9461.  
Last modified: 8 May 2009, 9:15:20 UTC

Hi Ignasi

I had a look at all my computation error ones this morning now that most have finally gone through. All the KASHIF one's when crunched by a 9800GTX+ or below go bang. If the wingman is a 260 inclusive and above, they go through. I am aware is a crude deduction on my part as I have a very limited overview of the problems, however it does now seem pretty solid that KASHIF's dont through on cards rated 9800GTX+ and below.

If thats starting to be the case, do you still want the cards of 9800GTX+ and below to run the KASHIF's? If you do, fine, I just hate running ones that will go bang as it only delays their crunching by cards that can do it.

If you dont, I can just abort a KASHIF if I spot one coming through.

Regards
Zy
ID: 9462 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist

Send message
Joined: 14 Mar 07
Posts: 1958
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 9464 - Posted: 8 May 2009, 9:33:49 UTC - in response to Message 9462.  
Last modified: 8 May 2009, 9:53:02 UTC

I am right to say that all the problems are related to older cards, like 8800,9800 and so on?
Did anyone experience repeated failures on those workunits with a 260,275,295 or 285?

gdf
ID: 9464 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Zydor

Send message
Joined: 8 Feb 09
Posts: 252
Credit: 1,309,451
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 9465 - Posted: 8 May 2009, 9:47:32 UTC - in response to Message 9464.  
Last modified: 8 May 2009, 9:51:01 UTC

Additional to my post at 9444 above.

Just remembered, and its only a part of it - its real annoying that I only got a flash of it as it went away - the error message referred to a file "nv???????" it maybe a DLL reference, cant remember. NV is probably no stunning revelation, but there it is for what its worth. Whatever the final full name, the error message claimed it had "stopped", and the system had restarted it. Instatantly I had the WU go bang. All cpu based models for other projects I run, have been unaffected by all this whether during normal running or when the KASHIFs go bang.

I seem to remember another post about a week ago, where there was a suspicion voiced about the memory size possibly being too small for these. ie at present maybe it needs 1GB cards, and goes bang on 512mb cards?

Regards
Zy
ID: 9465 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Zydor

Send message
Joined: 8 Feb 09
Posts: 252
Credit: 1,309,451
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 9467 - Posted: 8 May 2009, 10:05:25 UTC - in response to Message 9464.  

Just had another KASHIF go bang, it lasted 57 mins

http://www.gpugrid.net/result.php?resultid=643475

Error message:
Cuda error: Kernel [fft_data_swizzle_out] failed in file 'c:\cygwin\home\speechserver\gpumd2\src\pme\CPME_cufft.cu' in line 94 : unknown error.

swizzle_out is starting to be a common one for me.

Got to go out now and meet a Client, wont be back until around 4pm UTC.

Regards
Zy
ID: 9467 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mike047

Send message
Joined: 21 Dec 08
Posts: 47
Credit: 7,330,049
RAC: 0
Level
Ser
Scientific publications
watwatwatwatwatwatwat
Message 9468 - Posted: 8 May 2009, 10:41:34 UTC

I have had random failures on all my cards[8800gt/9600gso/9800gt/gts250] except the gtx260-192/216.

Some fail in a short period others linger much longer.
mike
ID: 9468 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
SkyeHunter

Send message
Joined: 7 Mar 09
Posts: 12
Credit: 1,254,285
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwat
Message 9469 - Posted: 8 May 2009, 11:02:27 UTC
Last modified: 8 May 2009, 11:06:56 UTC

Yup, similar issue here.

Yesterday got a WU that got stuck at 18% on my 8800GT. No error messages though, the Boinc manager thought the process was still running but remained for at least 12 hours at the same progress...

Cancelled the WU manually and started another one 18 hours ago. Usually WU's tend to take little less than 13 hours, and the current one hasn't been reporting yet (nor a new WU got uploaded, I keep my queue very short...). Propbably this evening I will see a similar issue.
ID: 9469 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
MarkJ
Volunteer moderator
Volunteer tester

Send message
Joined: 24 Dec 08
Posts: 738
Credit: 200,909,904
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9470 - Posted: 8 May 2009, 12:09:45 UTC - in response to Message 9465.  

Additional to my post at 9444 above.

Just remembered, and its only a part of it - its real annoying that I only got a flash of it as it went away - the error message referred to a file "nv???????" it maybe a DLL reference, cant remember. NV is probably no stunning revelation, but there it is for what its worth. Whatever the final full name, the error message claimed it had "stopped", and the system had restarted it. Instatantly I had the WU go bang. All cpu based models for other projects I run, have been unaffected by all this whether during normal running or when the KASHIFs go bang.

I seem to remember another post about a week ago, where there was a suspicion voiced about the memory size possibly being too small for these. ie at present maybe it needs 1GB cards, and goes bang on 512mb cards?

Regards
Zy


My GTS250's are only 512Mb and they seem to work with KASHIF wu. I did suggest the driver version as a culprit. I was having problems last week on my GTX260's and after uninstalling the driver (a 185 variant) and going back to 182.50 seemed to cure its problems.
BOINC blog
ID: 9470 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
MarkJ
Volunteer moderator
Volunteer tester

Send message
Joined: 24 Dec 08
Posts: 738
Credit: 200,909,904
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9471 - Posted: 8 May 2009, 12:14:16 UTC - in response to Message 9469.  

Yup, similar issue here.

Yesterday got a WU that got stuck at 18% on my 8800GT. No error messages though, the Boinc manager thought the process was still running but remained for at least 12 hours at the same progress...

Cancelled the WU manually and started another one 18 hours ago. Usually WU's tend to take little less than 13 hours, and the current one hasn't been reporting yet (nor a new WU got uploaded, I keep my queue very short...). Propbably this evening I will see a similar issue.


Ahh the "never ending wu" bug. What version of BOINC are you running? It seems to have been fixed in 6.6.23 onwards.
BOINC blog
ID: 9471 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
dyeman

Send message
Joined: 21 Mar 09
Posts: 35
Credit: 591,434,551
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9472 - Posted: 8 May 2009, 12:24:45 UTC - in response to Message 9471.  

See this thread also. I had hanging WUs using 6.6.17 and installing 6.6.23 didn't help. Installing Nvidia driver 185.85 fixed the hanging problem but haven't had a WU process successfully since (though may not be a driver issue - currently running a GIANNI WU and is at 67% and looking OK)
ID: 9472 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile dataman
Avatar

Send message
Joined: 18 Sep 08
Posts: 36
Credit: 100,352,867
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 9473 - Posted: 8 May 2009, 13:14:10 UTC - in response to Message 9464.  

I am right to say that all the problems are related to older cards, like 8800,9800 and so on?
Did anyone experience repeated failures on those workunits with a 260,275,295 or 285?

gdf

I have 7 9800GT's and one 8800GT. All have experienced failures. I'm on 6.6.20 and 185.85. I'm shutting them down until this problem is fixed. Good Luck!

ID: 9473 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
[boinc.at] Nowi

Send message
Joined: 4 Sep 08
Posts: 44
Credit: 3,685,033
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwatwatwat
Message 9474 - Posted: 8 May 2009, 14:09:50 UTC - in response to Message 9468.  

I have had random failures on all my cards[8800gt/9600gso/9800gt/gts250] except the gtx260-192/216.


All of this are GPU lower than G200. Maybe this is a clue.
ID: 9474 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Paul D. Buck

Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 9476 - Posted: 8 May 2009, 14:48:13 UTC

I hate to be a wet blanket.

But my 9800GT has five (5) total successful runs on just page one of my task list so it is NOT the card unless related to memory as this card has 1M VRAM ...

I am using driver 182.50, so it may be THAT ... WIn XP Pro, 32-bit is the other variant that may be an issue. BOINC Version 6.5.0 ...

The 6.6x versions did have some scheduler problems from something in the teens at least to 6.6.22 ... 6.6.23 and later seems to have cured that issue.
ID: 9476 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Zydor

Send message
Joined: 8 Feb 09
Posts: 252
Credit: 1,309,451
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 9478 - Posted: 8 May 2009, 15:53:18 UTC - in response to Message 9476.  

Above I mentioned a file that was "stopped" and restarted at the same moment the WU went bang. I found the error message for it. I have no idea whether it means anything to the current problem, or what it means in itself ...... however, posted for completeness as it did happen at the exact moment the WU went bang. "nvlddmkm" was what I was struggling to remember on the system error message at the time the WU went bang.

The error message reads:

"The description for Event ID 4101 from source Display cannot be found. Either the component that raises this event is not installed on your local computer or the installation is corrupted. You can install or repair the component on the local computer.

If the event originated on another computer, the display information had to be saved with the event.

The following information was included with the event:

nvlddmkm "


It was located in:
Event Viewer/Custom Views/Administrative Events
Source: display.

At the time it said it was "restarted" presumably referring to nvlddmkm - whatever that is :)

Regards
Zy
ID: 9478 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mike047

Send message
Joined: 21 Dec 08
Posts: 47
Credit: 7,330,049
RAC: 0
Level
Ser
Scientific publications
watwatwatwatwatwatwat
Message 9480 - Posted: 8 May 2009, 16:24:22 UTC - in response to Message 9473.  

I am right to say that all the problems are related to older cards, like 8800,9800 and so on?
Did anyone experience repeated failures on those workunits with a 260,275,295 or 285?

gdf

I have 7 9800GT's and one 8800GT. All have experienced failures. I'm on 6.6.20 and 185.85. I'm shutting them down until this problem is fixed. Good Luck!


I'll give it one more day, maybe two and I will do likewise.

I am very surprised at the admin/developers this time. Usually there is a little more input/concern shown.

Have I missed a thread from the project that explains what is happening and their concern??
mike
ID: 9480 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · Next

Message boards : Graphics cards (GPUs) : 6 Errors Today [Problems with "KASHIF_HIVPR" and "IBUCH_KID"-WUs]

©2025 Universitat Pompeu Fabra