Work unit failure rate

Message boards : Graphics cards (GPUs) : Work unit failure rate
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
BarryAZ

Send message
Joined: 16 Apr 09
Posts: 163
Credit: 921,733,849
RAC: 0
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 13290 - Posted: 27 Oct 2009, 17:49:29 UTC

I just looked into this on the seven systems I have actively running GPUGrid at the moment.

Of the past 70 workunits completed (10 per system), I had a total of 8 failures -- most of these quite close to completion. The systems involved have no overclocked GPU's. The GPU's range from one 9600GT to one 250GTS, the rest being 9800GT. The OS is either Windows XP, or Windows 7. Driver version is either the 190.38 or 190.62.

All workstations had one failure (one had two).

Seems to be a pretty high failure rate to cope with.
ID: 13290 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
philip

Send message
Joined: 29 Jan 09
Posts: 1
Credit: 562,650
RAC: 0
Level
Gly
Scientific publications
watwatwatwat
Message 13292 - Posted: 27 Oct 2009, 19:24:49 UTC - in response to Message 13290.  

Since I have upgraded to Windows 7 I had nothing but failures. Running 191.07 with 6.10.16 on Quad X9770 with 9800GX2.

Bailing out. Just not worth it.

ID: 13292 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
BarryAZ

Send message
Joined: 16 Apr 09
Posts: 163
Credit: 921,733,849
RAC: 0
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 13300 - Posted: 28 Oct 2009, 15:46:42 UTC - in response to Message 13292.  

Interesting -- I've not seen things a being worse with Win7 versus XP, but I think a 10% failure rate is 'suboptimal'.

Since I have upgraded to Windows 7 I had nothing but failures. Running 191.07 with 6.10.16 on Quad X9770 with 9800GX2.

Bailing out. Just not worth it.


ID: 13300 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Dennis-TW

Send message
Joined: 15 Jan 09
Posts: 6
Credit: 113,514,591
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwat
Message 13304 - Posted: 29 Oct 2009, 7:49:51 UTC

Recently I have the same kind of failure rate....2 of the last 15 workunits failed, both also quite close to completion with a total loss of about 34 hours of GPU time...

Also on 9800 GT, however, I didn't change anything on the software part recently. Started a thread on my own to compare some details since the workunits don't seem to be total crashs as they were perfectly crunched by a GTX 260.
ID: 13304 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Dennis-TW

Send message
Joined: 15 Jan 09
Posts: 6
Credit: 113,514,591
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwat
Message 13358 - Posted: 2 Nov 2009, 16:20:37 UTC

And just another one bit the dust......that makes it 3 fails out of the last 13.

Any comment on this issue or should I look for another CUDA project??
ID: 13358 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
BarryAZ

Send message
Joined: 16 Apr 09
Posts: 163
Credit: 921,733,849
RAC: 0
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 13361 - Posted: 3 Nov 2009, 2:05:03 UTC - in response to Message 13358.  

The thing is, at the moment, GPU project options are rather limited.

That 9800GT is supported on SETI -- when they have work, and on Collatz -- when it is running. One thing with Collatz, it is a very low resource project which also is the only one supporting lower power ATI GPU's as well and when the other CUDA ATI GPU project (MW -- which supports only double precision GPU's and not that 9800GT for instance) is in trouble -- like it is at the moment, the load on Collatz seems simply too much. At this moment both Collatz and MW are offline.

I use Collatz as my primary GPU project with GPUGrid these days as my backup GPU project.

Add to that the current work unit famine here (which probably is only short term), and for GPU BOINC folks, life can get rather tedious.

One thing that I've seen here is that problems with workunits (such as we are reporting here) - or when we note that workunits are running longer for the same credit pay out, there seems to be a response here (if there is a response), that no, you are not seeing what you are seeing, or, 'what have you (the user) changed (when you, the user) haven't changed things. That tends to be a tad frustrating for me.


And just another one bit the dust......that makes it 3 fails out of the last 13.

Any comment on this issue or should I look for another CUDA project??

ID: 13361 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Paul D. Buck

Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 13365 - Posted: 3 Nov 2009, 15:59:28 UTC - in response to Message 13361.  

The thing is, at the moment, GPU project options are rather limited.

Still sad but true ... the only good news is that Einstein is working on a GPU version ... though it is taking a lot longer than I have been expecting... especially in that EaH has been one of the most reliable projects to have work up and available and to handle outages with grace ... and volume too ...

One thing that I've seen here is that problems with workunits (such as we are reporting here) - or when we note that workunits are running longer for the same credit pay out, there seems to be a response here (if there is a response), that no, you are not seeing what you are seeing, or, 'what have you (the user) changed (when you, the user) haven't changed things. That tends to be a tad frustrating for me.

I have been wondering if this is just a symptom of the "admin fatigue" that we have been discussing elsewhere ... it is about that time here too ...

As an answer to the situation, and as a reply to the admin's suggestion that problems with Resource Share are not a BOINC Problem that is of interest to projects I put GPU Grid in rotation with MW and Collatz and have noted that not only has my earnings here dropped though the floor, even the amount of time spend does not seem to be properly balanced ... of course I have seen this all for a long time but have not been able to get UCB's attention ... and am not likely to get GPU Grid's either ... then again ... a pox on both their houses ...

I guess that means I am frustrated too ... :)
ID: 13365 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist

Send message
Joined: 14 Mar 07
Posts: 1958
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 13398 - Posted: 7 Nov 2009, 10:12:09 UTC - in response to Message 13365.  

the very large majority of errors are produced by overclocking and/or poor cooling of the GPUs.
You might think that your GPUs is not overclocked but the manufacturer did it for you. Look up the suggested by Nvidia clock of your cards here:
http://en.wikipedia.org/wiki/GeForce_200_Series

and compare it with your card. Reducing it to the lower, recommended values might well fix all your problems.
With a proper installation you should be able to get nearly 100% success rate as several users do

GDF
ID: 13398 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Snow Crash

Send message
Joined: 4 Apr 09
Posts: 450
Credit: 539,316,349
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 13399 - Posted: 7 Nov 2009, 12:12:42 UTC

OC, Heat, under powered PSU, drivers.
Lately I have been taking BOINC offline, making a copy of my BOINC data folder and then using the copy to test out my new OC settings. This way if I crash I can keep making copies without hammering the GPUGrid servers and I also don't run into any "allowed WUs per day" limits. I have found that this is about the best way for me as there really does not seem to be any really good testing tools for GPU ... sorry but relying on my visual inspection for artifacts is not a particularly rigorous process and I wonderif they are stressing the GPU the same way that GPUGrid does. Aren;t we more concerned with shaders first, memory second , and core really does not matter?

@GDF - Could you please tell me which type of WUs are typically the most GPU intensive? This way I could refine my test process to make sure I am doing the best testing possible.

Thank you,
Steve
Thanks - Steve
ID: 13399 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
fractal

Send message
Joined: 16 Aug 08
Posts: 87
Credit: 1,248,879,715
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 13403 - Posted: 7 Nov 2009, 16:09:03 UTC
Last modified: 7 Nov 2009, 16:14:12 UTC

I have been running GPUGRID for more than a year now with this machine on it the whole time. It had been pretty regular about no errors most of the time. Some batches of work have had it produce one error out of 20 work units. Lately it failed on five out of the last 18 work units, or just under over 1/3 of them.

The only change of late has been upgrading to Boinc 6.10.3. This was nice since now all my work units (that complete) get me the bonus without my having to manually delete the 4 or 5 it would download with the older version of boinc.

A typical failure looks like:

Using CUDA device 0
# There is 1 device supporting CUDA
# Device 0: "GeForce 9600 GSO"
# Clock rate: 1.46 GHz
# Total amount of global memory: 805044224 bytes
# Number of multiprocessors: 12
# Number of cores: 96
MDIO ERROR: cannot open file "restart.coor"
Cuda error: Kernel [pme_fill_charges_accumulate] failed in file 'fillcharges.cu' in line 73 : unspecified launch failure.

I will open the box and blow the dust out of it in case it is overheating, but I can confirm that the error rate appears to have increased of late.
ID: 13403 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Dennis-TW

Send message
Joined: 15 Jan 09
Posts: 6
Credit: 113,514,591
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwat
Message 13436 - Posted: 10 Nov 2009, 0:46:15 UTC - in response to Message 13403.  

but I can confirm that the error rate appears to have increased of late.

Copy that, it turned out that about 1/3 of all my WUs do fail recently.

Though it's nice to read about the general hints.....overclocking, heat, driver, blabla, it's just the fact that nothing of it applies to my station here....in July/August everything was running fine, but now in October/November I get this failure rate.
ID: 13436 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Beyond
Avatar

Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 13441 - Posted: 10 Nov 2009, 6:09:38 UTC - in response to Message 13403.  

I have been running GPUGRID for more than a year now with this machine on it the whole time. It had been pretty regular about no errors most of the time. Some batches of work have had it produce one error out of 20 work units. Lately it failed on five out of the last 18 work units, or just under over 1/3 of them.

The only change of late has been upgrading to Boinc 6.10.3.

Not sure of your video driver version but this was posted today in the BOINC change log:

The new Nvidia API that BOINC 6.10 uses has a minimum driver set of CUDA 2.2, 185.85.

If your present drivers are below this version, update first.
If your present drivers are below this and they're the last available for your hardware, you cannot update to 6.10; stay at the last 6.6.x version for your OS.

Spread the word.

ID: 13441 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
fractal

Send message
Joined: 16 Aug 08
Posts: 87
Credit: 1,248,879,715
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 13554 - Posted: 14 Nov 2009, 20:57:25 UTC - in response to Message 13436.  
Last modified: 14 Nov 2009, 20:58:00 UTC

It has gone downhill for me. Up to 100% error rate. Nothing completed since the 11th. I just checked the machine and all the fans are running just the same as they have been for the past half year. I will let it run with the case open for a while to see if the new work units need better cooling.

oh, and fwiw, I am running UNIX x86_64 Kernel Module 190.18 on http://www.gpugrid.net/show_host_detail.php?hostid=35424
ID: 13554 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
BarryAZ

Send message
Joined: 16 Apr 09
Posts: 163
Credit: 921,733,849
RAC: 0
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 13601 - Posted: 18 Nov 2009, 16:09:50 UTC - in response to Message 13398.  

Right, I understand the natural administrative inclination to blame the user. This a natural and human failing, and it has the advantage of avoiding a review of work for which the administrator has some accountability. (in my earlier life I had the admin tasks and know this inclination).

That being said, these are not overclocked CPU's or cards, and the problem is spread across 7 different configurations with three different GPU processors - 9600GT, 9800GT, GS 250). And it is getting WORSE. Over the past two weeks, 29 out 85 completed results were failures.

Frankly, if Collatz were not so overburdened with work (they support *normal* ATI cards as well as *normal* CUDA cards), I'd simply back off of GPUGrid and wait for this project to resolve it's problems (note, I have not had failures over at Collatz for the much larger sampling over there).

Perhaps when MW comes back to life, and Slicker returns from vacation over at GPU, I will fully move over to there and watch to see here if there is a response other than user error.


the very large majority of errors are produced by overclocking and/or poor cooling of the GPUs.
You might think that your GPUs is not overclocked but the manufacturer did it for you. Look up the suggested by Nvidia clock of your cards here:
http://en.wikipedia.org/wiki/GeForce_200_Series

and compare it with your card. Reducing it to the lower, recommended values might well fix all your problems.
With a proper installation you should be able to get nearly 100% success rate as several users do

GDF

ID: 13601 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
BarryAZ

Send message
Joined: 16 Apr 09
Posts: 163
Credit: 921,733,849
RAC: 0
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 13602 - Posted: 18 Nov 2009, 16:13:51 UTC - in response to Message 13436.  

Same failure rate here -- and *with the same systems* the rate was much lower in September. The change for me included a move to 6.10.x for the client. It certainly is a possible culprit. But that is not something *I* can control in that any project supporting ATI GPU's requires the 6.10 client. Then again, getting Berkeley to accept that they are part of any problem is even less likely than getting folks here to step up to the plate.

Curiously, I am NOT seeing these failures over at Collatz with either ATI or CUDA GPU's on the same computers.

but I can confirm that the error rate appears to have increased of late.

Copy that, it turned out that about 1/3 of all my WUs do fail recently.

Though it's nice to read about the general hints.....overclocking, heat, driver, blabla, it's just the fact that nothing of it applies to my station here....in July/August everything was running fine, but now in October/November I get this failure rate.

ID: 13602 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
BarryAZ

Send message
Joined: 16 Apr 09
Posts: 163
Credit: 921,733,849
RAC: 0
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 13603 - Posted: 18 Nov 2009, 16:16:04 UTC - in response to Message 13554.  

Perhaps if enough of us report this here with enough variation in OS and hardware, eventually the spotlight on user error might be changed to a mirror....

It has gone downhill for me. Up to 100% error rate. Nothing completed since the 11th. I just checked the machine and all the fans are running just the same as they have been for the past half year. I will let it run with the case open for a while to see if the new work units need better cooling.

oh, and fwiw, I am running UNIX x86_64 Kernel Module 190.18 on http://www.gpugrid.net/show_host_detail.php?hostid=35424

ID: 13603 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 351
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 13604 - Posted: 18 Nov 2009, 16:21:48 UTC

I second this call. My recent failure rate has been 12 out of 58 - over 20% - across three cards: two Zotac 9800GT at completely stock speeds, and a Zotac 'AMP' edition (factory overclock) 9800GTX+. The cards have all been running on SETI since January with no sign of failure, and are doing so again as I type.
ID: 13604 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist

Send message
Joined: 14 Mar 07
Posts: 1958
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 13605 - Posted: 18 Nov 2009, 16:27:13 UTC - in response to Message 13604.  
Last modified: 18 Nov 2009, 16:45:25 UTC

We will try to upload a new application compiled with cuda 2.3.
Let's see if this serves the problem. The only change we had was that we are now distributing only a cuda2.2 application.

gdf
ID: 13605 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
BarryAZ

Send message
Joined: 16 Apr 09
Posts: 163
Credit: 921,733,849
RAC: 0
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 13606 - Posted: 18 Nov 2009, 16:32:41 UTC - in response to Message 13605.  

OK -- let folks know when that is in place -- for the moment, by GPUGrid processing is going on hold.

We will try to upload a new application compiled with cuda 2.3.
Let's see if this serves the problem. The only change we had was that we are not distributing only a cuda2.2 application.

gdf

ID: 13606 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Bertrand

Send message
Joined: 15 Nov 09
Posts: 1
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 13614 - Posted: 19 Nov 2009, 11:26:47 UTC - in response to Message 13603.  

To put in my 2cts...

I have got cuda working on a 9600M GS card within Kubuntu amd64 with the nVidia 190.42 driver and have a 100% failure rate thusfar.

Just started the 15th of November with Boinc 6.4.5 from the Ubuntu repositories. GPU grid is running the v6.70 cuda version.

Error messages state that a file couldn't be renamed, so I am not 100% sure whether it is the same issue as discussed here, but inspection of the permissions revealed no problems as far as I can see.
ID: 13614 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Graphics cards (GPUs) : Work unit failure rate

©2025 Universitat Pompeu Fabra