6 Errors Today [Problems with "KASHIF_HIVPR" and "IBUCH_KID"-WUs]

Message boards : Graphics cards (GPUs) : 6 Errors Today [Problems with "KASHIF_HIVPR" and "IBUCH_KID"-WUs]
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 5 · Next

AuthorMessage
Profile dataman
Avatar

Send message
Joined: 18 Sep 08
Posts: 36
Credit: 100,352,867
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 9384 - Posted: 6 May 2009, 19:07:45 UTC

Everything has been running well but had 6 errors today across 3 diffrent cards (9800GT's)

1 of these:

ERROR: c:\cygwin\home\speechserver\gpumd2\src\pme\CPME_cufft.cu, line 84: cufftExecC2C (gridCalc2.2)
]]>

1 of these:

Cuda error: Kernel [shake_step_2] failed in file 'shake.cu' in line 128 : unknown error.

4 of these:

Cuda error: Kernel [PmeRealSpace_compute_forces] failed in file 'PmeRealSpace.cu' in line 172 : unknown error.

What's going on?

ID: 9384 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
palmss

Send message
Joined: 28 Aug 08
Posts: 7
Credit: 60,897,550
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9385 - Posted: 6 May 2009, 19:13:44 UTC

I have a "PmeRealSpace" error too, with a 8800GT here http://www.gpugrid.net/result.php?resultid=631932
ID: 9385 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile K1atOdessa

Send message
Joined: 25 Feb 08
Posts: 249
Credit: 444,646,963
RAC: 0
Level
Gln
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9391 - Posted: 6 May 2009, 19:55:41 UTC

Same here, meRealSpace error, running an 8800GT. "IBUCH_KID" WU's. Do I see a pattern forming, or just a coincidence?

Error WU 634715

ID: 9391 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
[boinc.at] Nowi

Send message
Joined: 4 Sep 08
Posts: 44
Credit: 3,685,033
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwatwatwat
Message 9400 - Posted: 6 May 2009, 21:25:38 UTC - in response to Message 9391.  

I have the same error on three WU. GPU is a 8800GT....
ID: 9400 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile dataman
Avatar

Send message
Joined: 18 Sep 08
Posts: 36
Credit: 100,352,867
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 9404 - Posted: 6 May 2009, 22:49:01 UTC

Cuda error: Kernel [fft_data_swizzle_in] failed in file 'c:\cygwin\home\speechserver\gpumd2\src\pme\CPME_cufft.cu' in line 44 : unknown error.

More errors ... :(

ID: 9404 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Zydor

Send message
Joined: 8 Feb 09
Posts: 252
Credit: 1,309,451
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 9405 - Posted: 6 May 2009, 22:54:55 UTC - in response to Message 9400.  

I had three go quickly one after the other in a 40 mins period today on a 9800GTX+ errors were similar to the above:

Two were the same:
Cuda error: Kernel [shake_step_1] failed in file 'shake.cu' in line 79

The third was:
Cuda error: Kernel [PmeRealSpace_compute_forces] failed in file 'PmeRealSpace.cu' in line 172 : unknown error.

Had a replacement running for about three hours - no problems so far, see what we shall see in the morning :)

Regards
Zy
ID: 9405 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
schizo1988

Send message
Joined: 16 Dec 08
Posts: 16
Credit: 10,644,256
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwat
Message 9414 - Posted: 7 May 2009, 1:59:15 UTC - in response to Message 9405.  

I have a thread about failed jobs as well, one machine lost 5 jobs and I thought it was machine specific but then one of my other machines got the same error, and had some that were valid but listed warnings messages that seem related to the actual errors, but this is after it finished but a real time system would be impossible not to mention useless unless you could sit and monitor your apps 24/7. they have come out with quite a few new software updates and problems can always arise, and not making it manditory to use the new version would not work either. If we post the errors and make the people who actually understand the software aware of errors I have found this site to be about the best for getting help when you do encounter any type of problem.
ID: 9414 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
loki126

Send message
Joined: 18 Nov 08
Posts: 14
Credit: 30,687,791
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 9415 - Posted: 7 May 2009, 4:11:56 UTC

Same here. Its the new 7000 Credit WU´s, IBUCH_KID_shao.
Here the failed tasks: 1 and 2

I guess they dont get along well with OC:

ID: 9415 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile K1atOdessa

Send message
Joined: 25 Feb 08
Posts: 249
Credit: 444,646,963
RAC: 0
Level
Gln
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9416 - Posted: 7 May 2009, 4:11:59 UTC
Last modified: 7 May 2009, 4:19:42 UTC

I really think there is some issue related to "IBUCH_KID" and "KASHIF_HIVPR" WU's. I have had 4 errors today and those have also errored out for other users.

My Tasks


Error tasks:

KASHIF_HIVPR

IBUCH_KID

IBUCH_KID

IBUCH_KID



<edit>

I've turn back clocks to stock to see if that matters. I've had them OC'd for 8 months, but we'll see if the new WU's are more sensitive.

</edit>
ID: 9416 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mike047

Send message
Joined: 21 Dec 08
Posts: 47
Credit: 7,330,049
RAC: 0
Level
Ser
Scientific publications
watwatwatwatwatwatwat
Message 9418 - Posted: 7 May 2009, 6:47:32 UTC - in response to Message 9416.  

I really think there is some issue related to "IBUCH_KID" and "KASHIF_HIVPR" WU's. I have had 4 errors today and those have also errored out for other users.

My Tasks


Error tasks:

KASHIF_HIVPR

IBUCH_KID

IBUCH_KID

IBUCH_KID



<edit>

I've turn back clocks to stock to see if that matters. I've had them OC'd for 8 months, but we'll see if the new WU's are more sensitive.

</edit>



I have had error with this series[IBUCH KID] of work units also. My cards run stock. Same cards seem to run the HIV ones OK.
mike
ID: 9418 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Zydor

Send message
Joined: 8 Feb 09
Posts: 252
Credit: 1,309,451
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 9419 - Posted: 7 May 2009, 7:32:41 UTC - in response to Message 9418.  

Another one last night

ERROR: c:\cygwin\home\speechserver\gpumd2\src\pme\CPME_cufft.cu, line 84: cufftExecC2C (gridCalc2.2)

There is an issue lurking somewhere with these WUs.

For me it started when the new ones with the Amber facility came out, shortlky after the failures started.

I am trying one more - if that fails, I stop until this is resolved

Zy

ID: 9419 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Paul D. Buck

Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 9420 - Posted: 7 May 2009, 7:50:32 UTC

There can be bad "batches" or tasks within a batch that are just plain bad. The good news such as it is, is that here at GPU Grid the tasks tend to die fairly quickly. I will note that they have just changed and are using some new tool and this may be part of the problem.

I have seen similar issues in other projects where a change in direction can lead to significant issues with tasks failing. Rosetta when they went in the direction of starting up the effort on Mini-Rosetta caused me to leave the project for a long time as far as major support because so many tasks failed. Now they have most of the bugs out and I am back again.

Keep reporting the bad tasks and I am sure they will figure it out ...
ID: 9420 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
MarkJ
Volunteer moderator
Volunteer tester

Send message
Joined: 24 Dec 08
Posts: 738
Credit: 200,909,904
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9421 - Posted: 7 May 2009, 8:21:49 UTC - in response to Message 9415.  

Same here. Its the new 7000 Credit WU´s, IBUCH_KID_shao.
Here the failed tasks: 1 and 2

I guess they dont get along well with OC:


I had a similar issue. It went away when I went back to 182.50 drivers. You seem to be running beta drivers.
BOINC blog
ID: 9421 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Snow Crash

Send message
Joined: 4 Apr 09
Posts: 450
Credit: 539,316,349
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9422 - Posted: 7 May 2009, 8:22:28 UTC - in response to Message 9420.  

I got a bunch of errors also and was wondering if we add system specs (including driver version) wold it help narrow down were the real issue is?

i7-920 HT, 4 GHz on P6T
Corsair Dominator 1600 2Gx3
EVGA GTX 295 (626/1496/1036) 185.81
Corsair TX750W, WD Caviar Black 1TB
Cool Master HAF 932
Xigmatek Dark Knight-S1283V
BOINC 6.6.20 for WCG + GPUGrid 24/7/365

Steve
ID: 9422 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
MarkJ
Volunteer moderator
Volunteer tester

Send message
Joined: 24 Dec 08
Posts: 738
Credit: 200,909,904
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9423 - Posted: 7 May 2009, 8:23:54 UTC - in response to Message 9404.  

Cuda error: Kernel [fft_data_swizzle_in] failed in file 'c:\cygwin\home\speechserver\gpumd2\src\pme\CPME_cufft.cu' in line 44 : unknown error.

More errors ... :(


If you have beta drivers installed (your computers are hidden so I can't look) try the 182.50 drivers.
BOINC blog
ID: 9423 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ignasi

Send message
Joined: 10 Apr 08
Posts: 254
Credit: 16,836,000
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 9424 - Posted: 7 May 2009, 9:04:14 UTC - in response to Message 9423.  

On the new IBUCH_KID batch errors...
They don't fail completely, but the error rate is apparently higher.

We are stopping them for safety at the moment.

thanks for your patience,
ignasi
ID: 9424 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Bender10
Avatar

Send message
Joined: 3 Dec 07
Posts: 167
Credit: 8,368,897
RAC: 0
Level
Ser
Scientific publications
watwatwatwatwatwatwat
Message 9426 - Posted: 7 May 2009, 10:27:54 UTC - in response to Message 9422.  

Yes Steve WCG,

Posting the specs (driver ver, boinc ver, gpu, gpu overclock, os), help to narrow down where your issue may be.

But 'un-hiding' your computers so the MODS can look at your output files also helps (they may ask for this sometimes), when you have a problem. That and enabling 'debugging' if you have a pesky problem...


Consciousness: That annoying time between naps......

Experience is a wonderful thing: it enables you to recognize a mistake every time you repeat it.
ID: 9426 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Snow Crash

Send message
Joined: 4 Apr 09
Posts: 450
Credit: 539,316,349
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9431 - Posted: 7 May 2009, 13:25:14 UTC - in response to Message 9426.  

Specs including versions are in my sig. I will also try to provide more specifics when I post about errors but it sounds like this round is semi-global so I doubt they need any more info at this time. If mods want details of my logs all they need to do is ask and I will "unhide". Interesting way to phrase that ... I prefer to think of it as "Public" or "Private" and in general I like to keep "Private" as much is possible.
Thanks - Steve
ID: 9431 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mike047

Send message
Joined: 21 Dec 08
Posts: 47
Credit: 7,330,049
RAC: 0
Level
Ser
Scientific publications
watwatwatwatwatwatwat
Message 9432 - Posted: 7 May 2009, 13:33:09 UTC - in response to Message 9431.  

Specs including versions are in my sig. I will also try to provide more specifics when I post about errors but it sounds like this round is semi-global so I doubt they need any more info at this time. If mods want details of my logs all they need to do is ask and I will "unhide". Interesting way to phrase that ... I prefer to think of it as "Public" or "Private" and in general I like to keep "Private" as much is possible.



I'll show mine if you'll show me yours:D
mike
ID: 9432 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Zydor

Send message
Joined: 8 Feb 09
Posts: 252
Credit: 1,309,451
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 9433 - Posted: 7 May 2009, 13:43:26 UTC - in response to Message 9420.  

Keep reporting the bad tasks and I am sure they will figure it out ...

Absolutely - am totally behind them in trying to find out whats wrong, it could be at my end, I dont know. Its no good just pumping out errored ones though, there is only so many they need to track an issue. Meanwhile by stopping for a while I can put the hardware through proper testing, just to eliminate that side of the equation.

Having said all that, at present the one I started this morning still running fine, 63% done, which given the others that failed on mine, is illogical on the face of it.

Regards
Zy
ID: 9433 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · 3 · 4 . . . 5 · Next

Message boards : Graphics cards (GPUs) : 6 Errors Today [Problems with "KASHIF_HIVPR" and "IBUCH_KID"-WUs]

©2025 Universitat Pompeu Fabra