Message boards :
Graphics cards (GPUs) :
6 Errors Today [Problems with "KASHIF_HIVPR" and "IBUCH_KID"-WUs]
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · Next
| Author | Message |
|---|---|
BeyondSend message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
My first 2 errors ever AFAIK, the 1st a 76-KASHIF_HIVPR WU and the 2nd one of the infamous 76-IBUCH_KID WUs. Two different cards, both 9600 GSO. Notice a similarity in the error messages?: <core_client_version>6.6.24</core_client_version> <![CDATA[ <message> - exit code 98 (0x62) </message> <stderr_txt> # Using CUDA device 0 # Device 0: "GeForce 9600 GSO" # Clock rate: 1674000 kilohertz # Total amount of global memory: 402325504 bytes # Number of multiprocessors: 12 # Number of cores: 96 # Amber: readparm : Reading parm file parameters # PARM file in AMBER 7 format # Encounter 10-12 H-bond term WARNING: parameters.cu, line 568: Found zero 10-12 H-bond term. WARNING: parameters.cu, line 568: Found zero 10-12 H-bond term. MDIO ERROR: cannot open file "restart.coor" ERROR: c:\cygwin\home\speechserver\gpumd2\src\pme\CPME_cufft.cu, line 50: cufftExecC2C (gridcalc2.1) called boinc_finish </stderr_txt> ]]> <core_client_version>6.6.20</core_client_version> <![CDATA[ <message> - exit code 98 (0x62) </message> <stderr_txt> # Using CUDA device 0 # Device 0: "GeForce 9600 GSO" # Clock rate: 1458000 kilohertz # Total amount of global memory: 804978688 bytes # Number of multiprocessors: 12 # Number of cores: 96 # Amber: readparm : Reading parm file parameters # PARM file in AMBER 7 format # Encounter 10-12 H-bond term WARNING: parameters.cu, line 568: Found zero 10-12 H-bond term. WARNING: parameters.cu, line 568: Found zero 10-12 H-bond term. MDIO ERROR: cannot open file "restart.coor" ERROR: c:\cy </stderr_txt> ]]> |
ZydorSend message Joined: 8 Feb 09 Posts: 252 Credit: 1,309,451 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]()
|
Got one through ok, then the next went bang after 30 mins. Successful one was: http://www.gpugrid.net/result.php?resultid=636960 A GIANNI The one that failed this time - a KASHIF_HIVPR http://www.gpugrid.net/result.php?resultid=639025 ERROR: c:\cygwin\home\speechserver\gpumd2\src\pme\CPME_cufft.cu, line 104: cufftExecC2R (gridcalc3) With this one I was at the PC when it went. There was a system warning popup message, didnt get it word for word, only saw a flash as it disappeared , " something something could not be contacted, video driver restarted", dont hang your hat off that word for word, but essentially it looks as though the Video Driver lost connection, and the system auto restarted the video driver, when it did that, instant computation error. I will ferret in the log files, I have the PC logged to death, hopefully I can dig something up about it. Two more downloaded, A GIANNI and a KASHIF, I suspended the GIANNI, and will try another KASHIF, see what happens. Regards Zy |
ZydorSend message Joined: 8 Feb 09 Posts: 252 Credit: 1,309,451 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]()
|
The KASHIF lasted 37 mins and went bang. A GIANNI is now running The failed KASHIF: http://www.gpugrid.net/result.php?resultid=640997 Error was: Cuda error: Kernel [fft_data_swizzle_out] failed in file 'c:\cygwin\home\speechserver\gpumd2\src\pme\CPME_cufft.cu' in line 94 : unknown error. (Not seen a "swizzle_out" error before) Started this one - a GIANNI - and on past performance it will probably go through ok: http://www.gpugrid.net/result.php?resultid=641393 [Edit] Any debuging switch or log file - whatever - that I can enable this end that will help, please let me know and I will. If you want me to run a series of suspect ones (etc) let me know how, I will [/Edit] Regards Zy |
|
Send message Joined: 4 Sep 08 Posts: 44 Credit: 3,685,033 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]()
|
I have gotten another error of a 2-KASHIF_HIVPR-WU (result). The error appeared after more than 16 hours of computation on a 8800GT. Now I have three errors in a row. In my opinion is this unacceptable!!!!!! |
(_KoDAk_)Send message Joined: 18 Oct 08 Posts: 43 Credit: 6,924,807 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
|
|
Send message Joined: 10 Apr 08 Posts: 254 Credit: 16,836,000 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
We are digging into these problems. thanks, ignasi |
ZydorSend message Joined: 8 Feb 09 Posts: 252 Credit: 1,309,451 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]()
|
Hi Ignasi I had a look at all my computation error ones this morning now that most have finally gone through. All the KASHIF one's when crunched by a 9800GTX+ or below go bang. If the wingman is a 260 inclusive and above, they go through. I am aware is a crude deduction on my part as I have a very limited overview of the problems, however it does now seem pretty solid that KASHIF's dont through on cards rated 9800GTX+ and below. If thats starting to be the case, do you still want the cards of 9800GTX+ and below to run the KASHIF's? If you do, fine, I just hate running ones that will go bang as it only delays their crunching by cards that can do it. If you dont, I can just abort a KASHIF if I spot one coming through. Regards Zy |
GDFSend message Joined: 14 Mar 07 Posts: 1958 Credit: 629,356 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() |
I am right to say that all the problems are related to older cards, like 8800,9800 and so on? Did anyone experience repeated failures on those workunits with a 260,275,295 or 285? gdf |
ZydorSend message Joined: 8 Feb 09 Posts: 252 Credit: 1,309,451 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]()
|
Additional to my post at 9444 above. Just remembered, and its only a part of it - its real annoying that I only got a flash of it as it went away - the error message referred to a file "nv???????" it maybe a DLL reference, cant remember. NV is probably no stunning revelation, but there it is for what its worth. Whatever the final full name, the error message claimed it had "stopped", and the system had restarted it. Instatantly I had the WU go bang. All cpu based models for other projects I run, have been unaffected by all this whether during normal running or when the KASHIFs go bang. I seem to remember another post about a week ago, where there was a suspicion voiced about the memory size possibly being too small for these. ie at present maybe it needs 1GB cards, and goes bang on 512mb cards? Regards Zy |
ZydorSend message Joined: 8 Feb 09 Posts: 252 Credit: 1,309,451 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]()
|
Just had another KASHIF go bang, it lasted 57 mins http://www.gpugrid.net/result.php?resultid=643475 Error message: Cuda error: Kernel [fft_data_swizzle_out] failed in file 'c:\cygwin\home\speechserver\gpumd2\src\pme\CPME_cufft.cu' in line 94 : unknown error. swizzle_out is starting to be a common one for me. Got to go out now and meet a Client, wont be back until around 4pm UTC. Regards Zy |
mike047Send message Joined: 21 Dec 08 Posts: 47 Credit: 7,330,049 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]()
|
I have had random failures on all my cards[8800gt/9600gso/9800gt/gts250] except the gtx260-192/216. Some fail in a short period others linger much longer. mike |
|
Send message Joined: 7 Mar 09 Posts: 12 Credit: 1,254,285 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Yup, similar issue here. Yesterday got a WU that got stuck at 18% on my 8800GT. No error messages though, the Boinc manager thought the process was still running but remained for at least 12 hours at the same progress... Cancelled the WU manually and started another one 18 hours ago. Usually WU's tend to take little less than 13 hours, and the current one hasn't been reporting yet (nor a new WU got uploaded, I keep my queue very short...). Propbably this evening I will see a similar issue. |
|
Send message Joined: 24 Dec 08 Posts: 738 Credit: 200,909,904 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Additional to my post at 9444 above. My GTS250's are only 512Mb and they seem to work with KASHIF wu. I did suggest the driver version as a culprit. I was having problems last week on my GTX260's and after uninstalling the driver (a 185 variant) and going back to 182.50 seemed to cure its problems. BOINC blog |
|
Send message Joined: 24 Dec 08 Posts: 738 Credit: 200,909,904 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Yup, similar issue here. Ahh the "never ending wu" bug. What version of BOINC are you running? It seems to have been fixed in 6.6.23 onwards. BOINC blog |
|
Send message Joined: 21 Mar 09 Posts: 35 Credit: 591,434,551 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
See this thread also. I had hanging WUs using 6.6.17 and installing 6.6.23 didn't help. Installing Nvidia driver 185.85 fixed the hanging problem but haven't had a WU process successfully since (though may not be a driver issue - currently running a GIANNI WU and is at 67% and looking OK) |
datamanSend message Joined: 18 Sep 08 Posts: 36 Credit: 100,352,867 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I am right to say that all the problems are related to older cards, like 8800,9800 and so on? I have 7 9800GT's and one 8800GT. All have experienced failures. I'm on 6.6.20 and 185.85. I'm shutting them down until this problem is fixed. Good Luck!
|
|
Send message Joined: 4 Sep 08 Posts: 44 Credit: 3,685,033 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]()
|
I have had random failures on all my cards[8800gt/9600gso/9800gt/gts250] except the gtx260-192/216. All of this are GPU lower than G200. Maybe this is a clue. |
Paul D. BuckSend message Joined: 9 Jun 08 Posts: 1050 Credit: 37,321,185 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I hate to be a wet blanket. But my 9800GT has five (5) total successful runs on just page one of my task list so it is NOT the card unless related to memory as this card has 1M VRAM ... I am using driver 182.50, so it may be THAT ... WIn XP Pro, 32-bit is the other variant that may be an issue. BOINC Version 6.5.0 ... The 6.6x versions did have some scheduler problems from something in the teens at least to 6.6.22 ... 6.6.23 and later seems to have cured that issue. |
ZydorSend message Joined: 8 Feb 09 Posts: 252 Credit: 1,309,451 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]()
|
Above I mentioned a file that was "stopped" and restarted at the same moment the WU went bang. I found the error message for it. I have no idea whether it means anything to the current problem, or what it means in itself ...... however, posted for completeness as it did happen at the exact moment the WU went bang. "nvlddmkm" was what I was struggling to remember on the system error message at the time the WU went bang. The error message reads: "The description for Event ID 4101 from source Display cannot be found. Either the component that raises this event is not installed on your local computer or the installation is corrupted. You can install or repair the component on the local computer. If the event originated on another computer, the display information had to be saved with the event. The following information was included with the event: nvlddmkm " It was located in: Event Viewer/Custom Views/Administrative Events Source: display. At the time it said it was "restarted" presumably referring to nvlddmkm - whatever that is :) Regards Zy |
mike047Send message Joined: 21 Dec 08 Posts: 47 Credit: 7,330,049 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]()
|
I am right to say that all the problems are related to older cards, like 8800,9800 and so on? I'll give it one more day, maybe two and I will do likewise. I am very surprised at the admin/developers this time. Usually there is a little more input/concern shown. Have I missed a thread from the project that explains what is happening and their concern?? mike |
©2025 Universitat Pompeu Fabra