Message boards :
Graphics cards (GPUs) :
6 Errors Today [Problems with "KASHIF_HIVPR" and "IBUCH_KID"-WUs]
Message board moderation
| Author | Message |
|---|---|
datamanSend message Joined: 18 Sep 08 Posts: 36 Credit: 100,352,867 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Everything has been running well but had 6 errors today across 3 diffrent cards (9800GT's) 1 of these: ERROR: c:\cygwin\home\speechserver\gpumd2\src\pme\CPME_cufft.cu, line 84: cufftExecC2C (gridCalc2.2) ]]> 1 of these: Cuda error: Kernel [shake_step_2] failed in file 'shake.cu' in line 128 : unknown error. 4 of these: Cuda error: Kernel [PmeRealSpace_compute_forces] failed in file 'PmeRealSpace.cu' in line 172 : unknown error. What's going on?
|
|
Send message Joined: 28 Aug 08 Posts: 7 Credit: 60,897,550 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I have a "PmeRealSpace" error too, with a 8800GT here http://www.gpugrid.net/result.php?resultid=631932 |
K1atOdessaSend message Joined: 25 Feb 08 Posts: 249 Credit: 444,646,963 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Same here, meRealSpace error, running an 8800GT. "IBUCH_KID" WU's. Do I see a pattern forming, or just a coincidence? Error WU 634715 |
|
Send message Joined: 4 Sep 08 Posts: 44 Credit: 3,685,033 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]()
|
I have the same error on three WU. GPU is a 8800GT.... |
datamanSend message Joined: 18 Sep 08 Posts: 36 Credit: 100,352,867 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Cuda error: Kernel [fft_data_swizzle_in] failed in file 'c:\cygwin\home\speechserver\gpumd2\src\pme\CPME_cufft.cu' in line 44 : unknown error. More errors ... :(
|
ZydorSend message Joined: 8 Feb 09 Posts: 252 Credit: 1,309,451 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]()
|
I had three go quickly one after the other in a 40 mins period today on a 9800GTX+ errors were similar to the above: Two were the same: Cuda error: Kernel [shake_step_1] failed in file 'shake.cu' in line 79 The third was: Cuda error: Kernel [PmeRealSpace_compute_forces] failed in file 'PmeRealSpace.cu' in line 172 : unknown error. Had a replacement running for about three hours - no problems so far, see what we shall see in the morning :) Regards Zy |
|
Send message Joined: 16 Dec 08 Posts: 16 Credit: 10,644,256 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
I have a thread about failed jobs as well, one machine lost 5 jobs and I thought it was machine specific but then one of my other machines got the same error, and had some that were valid but listed warnings messages that seem related to the actual errors, but this is after it finished but a real time system would be impossible not to mention useless unless you could sit and monitor your apps 24/7. they have come out with quite a few new software updates and problems can always arise, and not making it manditory to use the new version would not work either. If we post the errors and make the people who actually understand the software aware of errors I have found this site to be about the best for getting help when you do encounter any type of problem. |
|
Send message Joined: 18 Nov 08 Posts: 14 Credit: 30,687,791 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
|
K1atOdessaSend message Joined: 25 Feb 08 Posts: 249 Credit: 444,646,963 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I really think there is some issue related to "IBUCH_KID" and "KASHIF_HIVPR" WU's. I have had 4 errors today and those have also errored out for other users. My Tasks Error tasks: KASHIF_HIVPR IBUCH_KID IBUCH_KID IBUCH_KID <edit> I've turn back clocks to stock to see if that matters. I've had them OC'd for 8 months, but we'll see if the new WU's are more sensitive. </edit> |
mike047Send message Joined: 21 Dec 08 Posts: 47 Credit: 7,330,049 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]()
|
I really think there is some issue related to "IBUCH_KID" and "KASHIF_HIVPR" WU's. I have had 4 errors today and those have also errored out for other users. I have had error with this series[IBUCH KID] of work units also. My cards run stock. Same cards seem to run the HIV ones OK. mike |
ZydorSend message Joined: 8 Feb 09 Posts: 252 Credit: 1,309,451 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]()
|
Another one last night ERROR: c:\cygwin\home\speechserver\gpumd2\src\pme\CPME_cufft.cu, line 84: cufftExecC2C (gridCalc2.2) There is an issue lurking somewhere with these WUs. For me it started when the new ones with the Amber facility came out, shortlky after the failures started. I am trying one more - if that fails, I stop until this is resolved Zy |
Paul D. BuckSend message Joined: 9 Jun 08 Posts: 1050 Credit: 37,321,185 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
There can be bad "batches" or tasks within a batch that are just plain bad. The good news such as it is, is that here at GPU Grid the tasks tend to die fairly quickly. I will note that they have just changed and are using some new tool and this may be part of the problem. I have seen similar issues in other projects where a change in direction can lead to significant issues with tasks failing. Rosetta when they went in the direction of starting up the effort on Mini-Rosetta caused me to leave the project for a long time as far as major support because so many tasks failed. Now they have most of the bugs out and I am back again. Keep reporting the bad tasks and I am sure they will figure it out ... |
|
Send message Joined: 24 Dec 08 Posts: 738 Credit: 200,909,904 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Same here. Its the new 7000 Credit WU´s, IBUCH_KID_shao. I had a similar issue. It went away when I went back to 182.50 drivers. You seem to be running beta drivers. BOINC blog |
|
Send message Joined: 4 Apr 09 Posts: 450 Credit: 539,316,349 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I got a bunch of errors also and was wondering if we add system specs (including driver version) wold it help narrow down were the real issue is? i7-920 HT, 4 GHz on P6T Corsair Dominator 1600 2Gx3 EVGA GTX 295 (626/1496/1036) 185.81 Corsair TX750W, WD Caviar Black 1TB Cool Master HAF 932 Xigmatek Dark Knight-S1283V BOINC 6.6.20 for WCG + GPUGrid 24/7/365 Steve |
|
Send message Joined: 24 Dec 08 Posts: 738 Credit: 200,909,904 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Cuda error: Kernel [fft_data_swizzle_in] failed in file 'c:\cygwin\home\speechserver\gpumd2\src\pme\CPME_cufft.cu' in line 44 : unknown error. If you have beta drivers installed (your computers are hidden so I can't look) try the 182.50 drivers. BOINC blog |
|
Send message Joined: 10 Apr 08 Posts: 254 Credit: 16,836,000 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
On the new IBUCH_KID batch errors... They don't fail completely, but the error rate is apparently higher. We are stopping them for safety at the moment. thanks for your patience, ignasi |
Bender10Send message Joined: 3 Dec 07 Posts: 167 Credit: 8,368,897 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]()
|
Yes Steve WCG, Posting the specs (driver ver, boinc ver, gpu, gpu overclock, os), help to narrow down where your issue may be. But 'un-hiding' your computers so the MODS can look at your output files also helps (they may ask for this sometimes), when you have a problem. That and enabling 'debugging' if you have a pesky problem... Consciousness: That annoying time between naps...... Experience is a wonderful thing: it enables you to recognize a mistake every time you repeat it. |
|
Send message Joined: 4 Apr 09 Posts: 450 Credit: 539,316,349 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Specs including versions are in my sig. I will also try to provide more specifics when I post about errors but it sounds like this round is semi-global so I doubt they need any more info at this time. If mods want details of my logs all they need to do is ask and I will "unhide". Interesting way to phrase that ... I prefer to think of it as "Public" or "Private" and in general I like to keep "Private" as much is possible. Thanks - Steve |
mike047Send message Joined: 21 Dec 08 Posts: 47 Credit: 7,330,049 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]()
|
Specs including versions are in my sig. I will also try to provide more specifics when I post about errors but it sounds like this round is semi-global so I doubt they need any more info at this time. If mods want details of my logs all they need to do is ask and I will "unhide". Interesting way to phrase that ... I prefer to think of it as "Public" or "Private" and in general I like to keep "Private" as much is possible. I'll show mine if you'll show me yours:D mike |
ZydorSend message Joined: 8 Feb 09 Posts: 252 Credit: 1,309,451 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]()
|
Keep reporting the bad tasks and I am sure they will figure it out ... Absolutely - am totally behind them in trying to find out whats wrong, it could be at my end, I dont know. Its no good just pumping out errored ones though, there is only so many they need to track an issue. Meanwhile by stopping for a while I can put the hardware through proper testing, just to eliminate that side of the equation. Having said all that, at present the one I started this morning still running fine, 63% done, which given the others that failed on mine, is illogical on the face of it. Regards Zy |
©2025 Universitat Pompeu Fabra