Message boards :
Graphics cards (GPUs) :
unspecified launch failure
Message board moderation
| Author | Message |
|---|---|
|
Send message Joined: 18 Sep 08 Posts: 368 Credit: 4,174,624,885 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I get the following error every so often on This Box It's a BFG 8800GT OC running at the speed when I bought it ... Cuda error: Kernel [frc_sum_kernel_dihed] failed in file 'force.cu' in line 252 : unspecified launch failure. |
K1atOdessaSend message Joined: 25 Feb 08 Posts: 249 Credit: 444,646,963 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Cuda error: Kernel [frc_sum_kernel_dihed] failed in file 'force.cu' in line 252 : unspecified launch failure. I've received the same issue on a single task recently and I've never seen it before. I do have both 8800GT's OC'd some, but I haven't changed that in well over a month. I wouldn't think it is related. I've since completed a couple WU's fine, so I just chalked it up to something strange happened at one point in time. If it happens again, I'll have more reason to be concerned. http://www.gpugrid.net/result.php?resultid=115911 |
rebirtherSend message Joined: 7 Jul 07 Posts: 53 Credit: 3,048,781 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
My first error with this log on 8800GT 1GB: http://www.ps3grid.net/result.php?resultid=124548 Any solution or info about this error yet? |
DoctorNowSend message Joined: 18 Aug 07 Posts: 83 Credit: 135,208,752 RAC: 4 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Just found out that my WU which crashed this morning (near before it was finished!) had the same error: http://www.gpugrid.net/result.php?resultid=122585 My card is a 9600GT. And it seems it crashed my Windows too! As I came back some hours later I just found out my comp had a reboot to Linux (I have a dual-boot and Linux is standard). Member of BOINC@Heidelberg and ATA!
|
DoctorNowSend message Joined: 18 Aug 07 Posts: 83 Credit: 135,208,752 RAC: 4 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Another one killed itself with such a message. What's wrong with them? It gets really annoying, that costs me almost an entire day of crunching every time... >:-\ Member of BOINC@Heidelberg and ATA!
|
GDFSend message Joined: 14 Mar 07 Posts: 1958 Credit: 629,356 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() |
These are the same wus as before. Have you updated the drivers? Which drivers do you have? gdf |
DoctorNowSend message Joined: 18 Aug 07 Posts: 83 Credit: 135,208,752 RAC: 4 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
It's driver version 177.84, no change since I started crunching here. I have no clue what could be wrong, crunched two other WUs right before without any problems: http://www.gpugrid.net/result.php?resultid=126657 http://www.gpugrid.net/result.php?resultid=125288 Edit: Just found out on the NVidia page that version 180.48 is now recommended for my card. I will install and try it out, maybe it fixes the problem... Will take some days to discover that. ;-) Member of BOINC@Heidelberg and ATA!
|
|
Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
2 observations: 1. WUs which give this error generally run fine on other machines 2. All who reported this error in this thread are running (factory) overclocked GPUs. I think it's worth testing if there's a link between 1. and 2. DrNow, you seem to get the errors most frequently. Could you take the core and shader clock of your card back a bit? On G92 the core can only be adjusted on 27 MHz steps and the shader in 54 MHz steps. GPU-Z and other tools do not show you the real clock speed, but RivaTuners hardware monitor does. So I suggest you to either check the clocks with RivaTuner or to back off enough to be in a safe range, where you really change clocks. Say 54 MHz for the core and 108 MHz for the shader. Then let it run for some time and if the error happens again we know clock speed was not the cause. Oh, and it might be a good idea to do a complete restart of your machine before the clock speed experiment. That means switch it off, take the power cord off the power supply for >15 min and switch on again. BTW: driver 177.84 has been fine before, so I doubt it causes the errors. Could be possible, though, since the application code has changed since the time when most people ran 177.84. MrS Scanning for our furry friends since Jan 2002 |
DoctorNowSend message Joined: 18 Aug 07 Posts: 83 Credit: 135,208,752 RAC: 4 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
2 observations: Well, you could be right. First WU after the driver change did run fine so far. I will crunch two, three other WUs to see if the error appears again. If yes, I will take the shader rate down a bit. As you may have readed in one of the other threads, RivaTuner accidentally did take down my shader rate without my knowledge and the WUs took much longer, but all finished without problems... Member of BOINC@Heidelberg and ATA!
|
rebirtherSend message Joined: 7 Jul 07 Posts: 53 Credit: 3,048,781 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Next one: http://www.ps3grid.net/result.php?resultid=131339 driver 178.24 WinXP, RivaTuner 2.20 I cannot explain me why?! And after loosing many hours, why this error is not coming on start ^^ Before I run older version of RivaTuner to decrease the speed of the fan, around 57°C on 8800GT, so no problem. My first week all WUs are fine, but after 3-4d 2 times same error, something must be wrong somewhere but where? Edit: Checked all results, the error came with 6.3.21, before 6.3.19 all ok. If anyone run >6.4 or <6.3.21, pls me know if you get the same error or not! |
[BOINC@Poland]AiDecSend message Joined: 2 Sep 08 Posts: 53 Credit: 9,213,937 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]()
|
As I have read and heard many times Riva is not best software and can make problems with Nvidia graphic cards. Specially with newest GPU. I`ve get similar problems as long as I`ve used Riva. I would like to suggest you to use nTune which can give you bigger chance for `correct` OC. This software makes my GPUs really stable after hard OC (3x280GTX 600@702MHz). |
rebirtherSend message Joined: 7 Jul 07 Posts: 53 Credit: 3,048,781 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
As I have read and heard many times Riva is not best software and can make problems with Nvidia graphic cards. Specially with newest GPU. I`ve get similar problems as long as I`ve used Riva. I would like to suggest you to use nTune which can give you bigger chance for `correct` OC. This software makes my GPUs really stable after hard OC (3x280GTX 600@702MHz). I havent OC my card, I will try ntune next time and uninstall RivaTuner to see if that issue is still present, but I have read some with linux got this error too. |
|
Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
- people got the error before 6.3.21 - Rebirther, your card is factory overclocked (shader at 1.67 GHz instead of 1.50 GHz) I cannot explain me why?! And after loosing many hours, why this error is not coming on start ^^ It is a temporary error on your machine. That means normally your machine is fine and the WUs are (normally) fine for others. That the error occurs after many hours of crunching tells you that probably something goes wrong during the calculations. It's not a permanent error, it's a "transient" one. Such errors may be caused by really weird software constellations, bit-flips in the chip due to cosmic rays, hardware design faults which only occur in rare, exceptional situations (e.g. for CPUs several interrupts at the same time etc.) or by a chip which is just borderline to become unstable in the balance between clock frequency, voltage and operating temperature. Saying "but it was stable for ..." does not really help. It could be that a few transistors are worse than the others (or have degraded more over time) and fail every 10^15 cycles or so, leading to a "mean time between failures" of days. - And I don't think the mere presence of RivaTuner causes these errors. I mean, it's not even running all the time, is it? Also Rebirthers GPU is *old* enough (G92) to be supported properly. I`ve get similar problems as long as I`ve used Riva. Which problems do you mean exactly? The "unspecified launch failure"? I would like to suggest you to use nTune which can give you bigger chance for `correct` OC Well, RivaTuner and (I think) Everest are the only tools which can show you the real clock of your NV card, all others only show you the clock which you request from the system. The real clock is adjusted in steps. So if you can clock higher using nTune it may be that you're just below the next step, where it would become unstable. The internal clocks would be the same, but the number shown to you would be higher, hence it seems to be a higher OC. MrS Scanning for our furry friends since Jan 2002 |
rebirtherSend message Joined: 7 Jul 07 Posts: 53 Credit: 3,048,781 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
- people got the error before 6.3.21 Factory oc, yes, but this is not a problem, you got also this error as many others too on newer cards or old ones, I dont think this is a hardware failure in all models of cards?! I have asked on alpha mailing list about this issue to limit the error, still waiting for an answer, so is it the hardware, boinc client or the project application? Drivers and other programs can be excluded. The GPU waiting for CPU could be an issue so the WU abort by itself with this error because it can not crunch furthermore from the last point. Update: thx to nicolas to pointed out its not the boinc client, app 6.48 with 0% error rate, 6.52 with 20% error rate. @GDF: can you check the application code to find out whats wrong? Or can you switch back to the old app? |
|
Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Factory oc, yes, but this is not a problem How can you be sure? Hardware errors can pop up quite seldomly. These are actually the hardest to detect, because you can never be sure if (i) your test software can reproduce the error at all and (ii) you tested long enough. you got also this error as many others too Yeah, I also noticed this one yesterday.. and guess what, I'm also running OC'ed. I dont think this is a hardware failure in all models of cards?! Not every OC'ed card produces these errors, don't they? I have asked on alpha mailing list about this issue to limit the error, still waiting for an answer, so is it the hardware, boinc client or the project application? Drivers and other programs can be excluded. I agree that we can exclude drivers and other programs. However, I'd also suspect that the BOINC client has absolutely nothing to do with this. It just launches the aecmd_.exe and all further CUDA related launches are done by the science app. The GPU waiting for CPU could be an issue so the WU abort by itself with this error because it can not crunch furthermore from the last point. Sounds somewhat unprobable. The GPU can not talk to BOINC, so if the CPU app stops working then "noone" would tell BOINC that an error happened. It would likely detect after a short time that the app has quit and restart it. This is the point where some trouble may be caused, when the GPU / driver is a strange state because the CUDA app was not terminated properly. Is this just a guess on your side or do you have anything hinting at such a scenario? Update: - Where do you get that 20% error rate from? - I also had another one of these "unspecified launch failure" errors - with app 6.45. - Switching back to the old app is probably not feasible, since there were changes in the science code. - Oh, and who's Nicolas? MrS Scanning for our furry friends since Jan 2002 |
rebirtherSend message Joined: 7 Jul 07 Posts: 53 Credit: 3,048,781 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
- 20% is my error rate estimated from last calculation - Nicolas Alvarez, also a developer of BOINC/Primegrid/IMP/Renderfarm - we must sort out what was changed in code and causes this error - cannot find any scenario yet (removed rivatuner, installed ntune), will see what happens... (2 cores running vmware with ubuntu linux 64bit + ABC, other 2 cores BOINC in windows with GPU + Milkyway, RCN, yoyo evo) |
|
Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Yeah, let's get some new hard facts. But by saying - we must sort out what was changed in code and causes this error you imply that you already know it's the science apps fault. We can not know that yet. I think it's not the app, because these errors happen with different clients and the WUs run fine on other machines. .. gotta go to bed for today ;) MrS Scanning for our furry friends since Jan 2002 |
DoctorNowSend message Joined: 18 Aug 07 Posts: 83 Credit: 135,208,752 RAC: 4 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Well, you could be right. Well, after having finished three WUs without a problem (see here, here and here) now I have the error again with this WU, fortunately very early during the crunching. After looking on my host-list it seems the error comes in repeatedly times and is not caused by something special. Okay, I will reduce my shader clock now to see if it breaks the rule then. ;-) Member of BOINC@Heidelberg and ATA!
|
|
Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Well, the period of succesful WUs between failures is anything between 2 and 6.. I'd rather call that a guideline ;) MrS Scanning for our furry friends since Jan 2002 |
|
Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I had another one, luckily in the beginning of the WU. I scaled back the OC and will see what I get. MrS Scanning for our furry friends since Jan 2002 |
©2025 Universitat Pompeu Fabra