FAQ - Why does my run fail? Some answers

Message boards : Frequently Asked Questions (FAQ) : FAQ - Why does my run fail? Some answers
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist

Send message
Joined: 14 Mar 07
Posts: 1958
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 12178 - Posted: 28 Aug 2009, 8:19:01 UTC
Last modified: 11 Sep 2009, 22:50:10 UTC

We are monitoring causes for some crashes and their causes, in some cases even with the help of the volunteer which gives us access to their machine to do tests.

NORMAL BEHAVIOUR IS THAT YOU EXPERIENCE NO CRASH AT ALL (let's say <1%).

These are the most common cases of errors:
1) OVERCLOCKING.
SYNTHOMS: the application succeed but more or less often the application crashes with errors randomly appearing in several different GPU kernels (shake, langevin,pme, whatever).
SOLUTION. Reduce the clocks to the reccomended clocks for your board (note that some manufacturers increase the clock, so it might be that you did not do anything but the gpu is actually overclocked). See wikipedia for correct frequencies.

2) POOR COOLING.
SYNTHOMS: Same as before random errors on different kernels.
SOLUTIONS: If your board is not overclocked according to the number given by wikipedia, then it could cooling. Open your case or buy extra fans. Air has to come in from the front of the gpu and leave from the rear.

3) NVIDIA bugs
SYNTHOMS: you change driver and it stops working or if the error is always on the same kernel (PME, FFT. Now for instance we have the infamous FFT bug)
SOLUTIONS: If the driver works do not update unless you need it for some game. If it stops working, then try to update the driver.
The fft bug reported to Nvidia by us was solved on 190 drivers for G80 chips. It is still there for some GTX216 cards (it is unclear if these 216 work with 182 drivers. Try.)

4) BOINC bug
SYNTHOMS: Various
SOLUTIONS: Stick to a client that works for you, only change if we require to do so or you are willing to experiment.

5) POOR DRIVER INSTALLATION
SYNTHOMS: You can't run any workunits at all and the application crashes immediately. This is ofter a problems for Windows users.
SOLUTIONS. Reinstall the drivers in a proper way. Try this: http://www.guru3d.com/category/driversweeper/

This is rather common on WIndows machines.
In general, new drivers and new BOINC versions add features and solve old bugs, but as well introduce new ones. This is normal, find your best equilibrium.

Happy crunching.

GDF
ID: 12178 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
123bob

Send message
Joined: 21 Dec 08
Posts: 7
Credit: 251,750,735
RAC: 0
Level
Asn
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 12186 - Posted: 28 Aug 2009, 16:43:04 UTC - in response to Message 12178.  


The fft bug reported to Nvidia by us was solved on 190 drivers for G80 chips. It is still there for some GTX216 cards (it is unclear if these 216 work with 182 drivers. Try.)

GDF


GDF, it is very clear to me that the one eVGA 260-216 that I'm having issues with works just fine on driver 182.50. It will not work on anything 185.xx or higher. It just shoots out errors on the higher drivers. (Machine #20013) The card part number is 896-P3-1267-FR. It's their "superclocked" edition.

My other 260-216s seem to be working fine on a mix from 185.85 to the newest driver, 190.62.

Hope this helps others.

Bob
ID: 12186 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
zpm
Avatar

Send message
Joined: 2 Mar 09
Posts: 159
Credit: 13,639,818
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 12191 - Posted: 28 Aug 2009, 20:45:04 UTC - in response to Message 12186.  
Last modified: 28 Aug 2009, 20:46:28 UTC

TBH, my card is already FOC, but just to try some stuff, i did overclock my 216 core 260 card to 675 mhz from 630. this didn't produce any errors and not much of a speed up really.

i keep my fan on a constant 70 % Fan speed whether i'm gaming or cudaing.

i'm gone throug the 6 series, and no real errors to speak of.

running 6.10.0 right now.
ID: 12191 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
poppageek
Avatar

Send message
Joined: 4 Jul 09
Posts: 76
Credit: 114,610,402
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 12193 - Posted: 28 Aug 2009, 21:24:27 UTC - in response to Message 12186.  

I have a GTX 260 192 that will not run any game or GPUgrid on any driver above 182.50. With 182.50 it ran GPUGrid fine and F@H fine with never an error.
ID: 12193 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
zpm
Avatar

Send message
Joined: 2 Mar 09
Posts: 159
Credit: 13,639,818
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 12194 - Posted: 28 Aug 2009, 21:47:44 UTC - in response to Message 12193.  

I have a GTX 260 192 that will not run any game or GPUgrid on any driver above 182.50. With 182.50 it ran GPUGrid fine and F@H fine with never an error.


no game, sounds like you now have a bad card!!!!!
ID: 12194 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
RalphEllis

Send message
Joined: 11 Dec 08
Posts: 43
Credit: 2,216,617
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwat
Message 12201 - Posted: 29 Aug 2009, 6:46:54 UTC - in response to Message 12194.  

Just as a general comment across different operating systems using WindowsXP x64, Suse Linux, Sabayon and Ubuntu, I have found the following to be true with my GTX 260 and AMD X2.
1. In Linux, I cannot start the BOINC manager and Gpugrid with the video card overclocked. I have to start the program, run it for 5 minutes and then suspend calculations to start the overclock.
2. To overclock successfully, I do much better if I use a light window manager like IceWM in Linux or change the Windows video settings to maximum performance versus highest quality.
3. I have been able to run Gpugrid with any Nvidia driver that was supported except for the period when the Linux 185 drivers would error out all the time. I am now using Ubuntu 64bit with the Nvidia 190 driver with no issues.
4. I can run any video operation with the current drivers and run Gpugrid. I prefer to suspend Gpugrid if I am transcoding video or doing heavy file copying operations.
5. Linux has been the most stable setup for running Gpugrid day in and day out without issues. Windows XP x64 was a close second. Windows Vista 32 bit was less stable with me for some reason.
6. If the system crashes for any reason in any operating system, I am better off either deleting the affected work units or resetting the operation as soon as I start BOINC.
7. The first time that I start a new install of BOINC and Gpugrid, it will always freeze. I then reboot, do a reset of the project and proceed.
8. Being more demanding on the video card means that Gpugrid is less stable than Folding@home, Seti, Aqua or Einstein cuda applications. The Gpugrid program does more useful work and demands less from my CPU.
ID: 12201 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Zydor

Send message
Joined: 8 Feb 09
Posts: 252
Credit: 1,309,451
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 12205 - Posted: 29 Aug 2009, 10:28:19 UTC - in response to Message 12201.  

Too many people trust graphics driver deinstall routines via Windows deinstall, used to include me, despite being told many times over the years - clean out drivers once you have deinstalled and before installing the new graphics drivers. I learnt recently the hard way not to be idle about this, and to religiously go through the clean out routine. There has been yet another example of this over on the number crunching forum.

It boils down to a simple fact - Windows deinstall routine will not delete a file if its flagged as "in use". On top of that is the fact that not all graphics drivers file setup is the same, will vary from version to version. End result - bits of old driver installs left behind that will cause issues with the new driver.

NVidia used to make great play of proper deinstall, they dont now, I suspect some PR guru clown has poked his nose into the real world ... and this is not only NVidia, it applies to ATI drivers as well. The Guru3d Driver Cleaner includes sweeping ATI drivers as well for that reason. Similar issues occur with sound drivers.

Some make great play of switching drivers left right and centre trying to squeeze out the last performance drop - whether or not it achieves that is for another day - what it will achieve is a rapid build up of undeleted garbage that is not always over-written by a new install, no matter how a deinstall was done (indeed if they did a deinstall at all - many are just installing over the top of existing installations.

Deinstall old drivers via windows, then boot into safe mode, run Guru3d driver sweeper, reboot, and install the new drivers. Its only a few minutes extra work, but will save days if not weeks of grief.

Its not the beall and endall of all graphics issues, thats for sure, but I'll bet my pension its in the majority .....

Regards
Zy
ID: 12205 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 428
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 12251 - Posted: 1 Sep 2009, 10:37:31 UTC

Has anybody done any analysis of when tasks fail? I mean, what time of day? Mine show a distinct tendency to fail in the early hours of the morning (between 3am and 6am, local time).

It's not so farfetched that there could be a correlation. I've noticed when working on server installations with UPSs that the electricity supply voltage can vary over 24 hours - lower when local demand is high, higher when everyone is asleep in bed and most appliances are switched off.

So a cooling solution which is adequate when you're around to measure it might be inadequate under higher power draw (likely if the input voltage is higher).
ID: 12251 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
=Lupus=

Send message
Joined: 10 Nov 07
Posts: 10
Credit: 12,777,491
RAC: 0
Level
Pro
Scientific publications
watwatwatwat
Message 12263 - Posted: 2 Sep 2009, 2:31:12 UTC

Palit GeForce GTX 260 Sonic 216 SP - Vista64 - 190.62 x64 drivers, no issues, everything working fine. No failure until now.
This card is slightly overclocked from vendor-side, 625 mhz instead of 585 mhz. But has two fans so it is on 55°C even after some days of permanent work.
ID: 12263 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Victor

Send message
Joined: 16 Aug 09
Posts: 1
Credit: 542,905
RAC: 0
Level
Gly
Scientific publications
watwatwatwat
Message 12412 - Posted: 7 Sep 2009, 18:57:38 UTC

I've got a MSI GTX 260 OCv3, windows Media center (32bits), AMD64 3400+ (such an old processor for this video card) in a emachines and I've never have had a problem with errors in GPUGRID, the only error I got was from cancelling the first task because I had a GeForce 8400GS and tried GPU grid (until I read the GPU supported for this project and really, the speed it was processing, was ridiculous (like .04% in an hour or so).
keep in mind that this graphic card is 655/1408/2100 in comparison to the stock one of 576/1240/1998
I have to recognize tough that in Seti@Home it produced a lot of errors in comparison to when I was using the 8400GS for that project where i didn't have errors or maybe one or two only. Altough I had a lot of tasks turned in fast, I got a lot of errors some times, like 2 to 5 in a row.
ID: 12412 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ocgbargas

Send message
Joined: 18 Jun 09
Posts: 12
Credit: 4,327,530
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwat
Message 12416 - Posted: 8 Sep 2009, 7:41:12 UTC

Hello everyone. First ask forgiveness for my English, as it is translated with google.
Again I post because I still can not make any GPUGRID unit. I've marked this post http://www.gpugrid.net/forum_thread.php?id=1172&nowrap=true#10792 and still the same problem.
I have tried all these drivers 181.20, 185.85, 186.18, 190.38, 190.6 and all these versions of boinc 6.6.20, 6.6.28 and 6.6.36.
The graph is a Zotac gtx 260 (216) with the values of manufactures 576/1242/999. The computer is a i7 920 with 6 gb ram on gigabyte board. OS win7 64. The temperature of the graph does not exceed 70 º.
I know that are having more trouble with the 260 than with any but not normal since I have been more than 1 month with folding Collanzo or task and not one has given me error. I squeezed the playing card for over 4 hours to games that require it the most and not a single error.
Something is happening and are not resolved. The error always comes from the same site, the famous nvlddmkm you see in the event viewer.
I hope I can help. Thanks and best regards.
ID: 12416 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Jet

Send message
Joined: 14 Jun 09
Posts: 25
Credit: 5,835,455
RAC: 0
Level
Ser
Scientific publications
watwatwatwatwat
Message 12427 - Posted: 9 Sep 2009, 18:44:45 UTC - in response to Message 12251.  
Last modified: 9 Sep 2009, 18:46:13 UTC

I was make some analyses of fails, basically, no system.
Should say, that the key problem is OC'ing, the consequence of this is slight overheating (close to the edge of stability), than, probably, power surge on the edge of the load. Small spike is enough for fail if your cards are running very close to the max power rate of the power supply. In short words with facts:
1. PowerLux power supply, rated 750 watt.
2. 3 x GTX 260 Matrix by ASUS, with two fans & heat pipe system.
3. Intel Quad Core Q 9550, om ASUS WS Evolution board, + 10 % OC'ed. Running MW on all 4 cores, so 100+% of the power load.
4. Manually OC'ed from 576 Mhz gpu / 999 mhz mem to 756 Mhz gpu / 1111 Mhz mem.
5. Cards are sitting very close to each other. The card in the middle, due to the lack of the incoming air, are normally 57-60C. This to high, that the reason for additional external big fan mounted over it + constantly running room conditioning system with a 23C level.
6. Every day I've one typical error: "redundant result" or "computation error", some times this errors are combined, sometimes - not.
7.Additionally should state, that alt GTX 260 are not mechanically fixed in the slots, just used their weight to be fixed on the MoBo. So, could be some pure mechanical \ misconducting reasons, as well.

So, in general, taking a.m. facts into consideration, described system should be an error generator, but in fact - not so.

Regards,

Jet
ID: 12427 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist

Send message
Joined: 14 Mar 07
Posts: 1958
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 12440 - Posted: 10 Sep 2009, 10:58:39 UTC - in response to Message 12427.  

I have added cause 5 to the starting message.
gdf

5) POOR DRIVER INSTALLATION
SYNTHOMS: You can't run any workunits at all and the application crashes immediately. This is ofter a problems for Windows users.
SOLUTIONS. Reinstall the drivers in a proper way. Try this: http://www.gpugrid.net/forum_thread.php?id=1293
ID: 12440 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ocgbargas

Send message
Joined: 18 Jun 09
Posts: 12
Credit: 4,327,530
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwat
Message 12447 - Posted: 10 Sep 2009, 14:05:07 UTC

I have always followed these steps to uninstall-install the drivers.
Restarting a test mode failures, driver and uninstall the program you step driver sweeper, cleaning all that is nvidia. Step ccleaner to clean debris.
Reboot again to test failure mode and install new driver. Reboot again and normal.
A greeting.
ID: 12447 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
_hiVe*

Send message
Joined: 18 Feb 09
Posts: 12
Credit: 13,624,069
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwat
Message 12460 - Posted: 11 Sep 2009, 19:00:58 UTC - in response to Message 12440.  

I have added cause 5 to the starting message.
gdf

5) POOR DRIVER INSTALLATION
SYNTHOMS: You can't run any workunits at all and the application crashes immediately. This is ofter a problems for Windows users.
SOLUTIONS. Reinstall the drivers in a proper way. Try this: http://www.gpugrid.net/forum_thread.php?id=1293


I suggest linking directly to http://www.guru3d.com/category/driversweeper/
Will save a lot of people the trouble of reading through loads of text, erm. it would be more efficient.
The thread itself doesn't include any practical info, other than the link for the unexperienced.
ID: 12460 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist

Send message
Joined: 14 Mar 07
Posts: 1958
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 12463 - Posted: 11 Sep 2009, 22:50:30 UTC - in response to Message 12460.  

Done.

gdf
ID: 12463 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
BarryAZ

Send message
Joined: 16 Apr 09
Posts: 163
Credit: 921,733,849
RAC: 0
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 12490 - Posted: 13 Sep 2009, 19:22:17 UTC - in response to Message 12178.  

Nice troubleshooting note. As a follow up, I've one workstation that has started erroring out on GPUGrid (but not Collatz) in the last couple of days.

It is the only workstation I have with a GTS 250. Running Windows XP, on an AMD 945. Some comments to the troubleshooting note for this workstation.






These are the most common cases of errors:

1) OVERCLOCKING.
SYMPTOMS: the application succeed but more or less often the application crashes with errors randomly appearing in several different GPU kernels (shake, langevin,pme, whatever).
SOLUTION. Reduce the clocks to the reccomended clocks for your board (note that some manufacturers increase the clock, so it might be that you did not do anything but the gpu is actually overclocked). See wikipedia for correct frequencies.

I thought that might be the case but I checked to make sure that I'm not overclocking this system.

2) POOR COOLING.
SYMPTOMS: Same as before random errors on different kernels.
SOLUTIONS: If your board is not overclocked according to the number given by wikipedia, then it could cooling. Open your case or buy extra fans. Air has to come in from the front of the gpu and leave from the rear.

Not likely -- I have very good air flow on the workstation and as noted, another BOINC GPU application (Collatz) is quite happy.


3) NVIDIA bugs
SYMPTOMS: you change driver and it stops working or if the error is always on the same kernel (PME, FFT. Now for instance we have the infamous FFT bug)
SOLUTIONS: If the driver works do not update unless you need it for some game. If it stops working, then try to update the driver.
The fft bug reported to Nvidia by us was solved on 190 drivers for G80 chips. It is still there for some GTX216 cards (it is unclear if these 216 work with 182 drivers. Try.)

This was a recent build and the only driver installed was the 190.38. Again, the problem didn't show up until a couple of days ago -- it was working just fine with GPUGrid earlier in the week.

I will agree that Nvidia driver bugs are possible, though one would think this might show up more generally (I've not seen this with the same driver running on 9800GT cards.


4) BOINC bug
SYMPTOMS: Various
SOLUTIONS: Stick to a client that works for you, only change if we require to do so or you are willing to experiment.

Agreed big time -- for non GPU BOINC workstations, I run 5.4.5. For most GPU BOINC workstations (including the one having the *new* problem), I run 6.4.5. I have one workstation running 6.5 and another running the new 6.10 beta (for ATI GPU support).


5) POOR DRIVER INSTALLATION
SYMPTOMS: You can't run any workunits at all and the application crashes immediately. This is ofter a problems for Windows users.
SOLUTIONS. Reinstall the drivers in a proper way. Try this: http://www.guru3d.com/category/driversweeper/

This is rather common on WIndows machines.
In general, new drivers and new BOINC versions add features and solve old bugs, but as well introduce new ones. This is normal, find your best equilibrium.

Not a problem here -- but this point is a very good one generally.


Happy crunching.

GDF

ID: 12490 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
[AF] Profanateur
Avatar

Send message
Joined: 25 Oct 08
Posts: 42
Credit: 42,812,268
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 12491 - Posted: 13 Sep 2009, 21:33:10 UTC

Always the same problem at home, with my GTX260 (216) and one 8800 gt on Intel 8400. Vista 32, driver 182.5 and boinc 6.6.36

13/09/2009 16:32:48 GPUGRID Computation for task 225-GIANNI_BIND001-24-100-RND7793_0 finished
13/09/2009 16:32:48 GPUGRID Output file 225-GIANNI_BIND001-24-100-RND7793_0_1 for task 225-GIANNI_BIND001-24-100-RND7793_0 absent
13/09/2009 16:32:48 GPUGRID Output file 225-GIANNI_BIND001-24-100-RND7793_0_2 for task 225-GIANNI_BIND001-24-100-RND7793_0 absent
13/09/2009 16:32:48 GPUGRID Output file 225-GIANNI_BIND001-24-100-RND7793_0_3 for task 225-GIANNI_BIND001-24-100-RND7793_0 absent

ID: 12491 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
BarryAZ

Send message
Joined: 16 Apr 09
Posts: 163
Credit: 921,733,849
RAC: 0
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 12499 - Posted: 14 Sep 2009, 16:58:23 UTC

One of the other components of troubleshooting for root causes probably should include a look at the work units being sent.

When computation errors go from say one in 25 (still too high) to 1 in 4 or 1 in 3, with no change on the end user hardware or software configuration, it strikes me that another variable should be considered.

So far, it seems that possible problem source is not being considered, and frankly, from the end user point of view, there is nothing the end user can do to address it.
ID: 12499 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ocgbargas

Send message
Joined: 18 Jun 09
Posts: 12
Credit: 4,327,530
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwat
Message 12505 - Posted: 14 Sep 2009, 22:19:24 UTC

I always and state medical projects and I would love to continue with GPUGRID, but after the last update made grpugrid cuda and I have not able to make a single unit of this project.
I have my card in Collanzo processing that is not a project but I especially like doing that better than this to stand.
I would like to sacasen as a solution because we are many people that this problem is happening to us that no processing occurs in other projects.
A greeting.
ID: 12505 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Frequently Asked Questions (FAQ) : FAQ - Why does my run fail? Some answers

©2025 Universitat Pompeu Fabra