Recent problems for WUs on older GPUs

Message boards : Graphics cards (GPUs) : Recent problems for WUs on older GPUs
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · Next

AuthorMessage
Profile Bymark
Avatar

Send message
Joined: 23 Feb 09
Posts: 30
Credit: 5,897,921
RAC: 0
Level
Ser
Scientific publications
watwatwatwatwat
Message 9721 - Posted: 13 May 2009, 18:20:00 UTC - in response to Message 9719.  

Yep, the best driver for a 260 is Boinc 6.4.7 and driver 178.28. and cuda 2.
Working fine.........



Today the error rate for Kashif wus is lower, so it could have been a problem with drivers.

In the next few days we will perform a server update and application updates to use CUDA2.2.

gdf

Just finished looking at a LOT of KASHIF_HIVPR WUs. The situation is not improving at all and is not a driver issue. What happens is these WUs are downloaded and either fail or are aborted repeatedly until they happen to be assigned to a GTX 260 or above, then they complete. The problem is not fixed and is not improving. IMO it needs to be dealt with ASAP.

Here's just a few examples:

http://www.gpugrid.net/workunit.php?wuid=440561
http://www.gpugrid.net/workunit.php?wuid=442250
http://www.gpugrid.net/workunit.php?wuid=454479
http://www.gpugrid.net/workunit.php?wuid=449101
http://www.gpugrid.net/workunit.php?wuid=457871
http://www.gpugrid.net/workunit.php?wuid=458509

Here's a new KASHIF_HIVPR that was just downloaded to me (and I aborted). Notice that it just caused an error on a GTX 260 {after running a long time I might add).

http://www.gpugrid.net/workunit.php?wuid=459189

That same GTX 260 has only 3 recent failed WUs, all of them KASHIF_HIVPR. Take a look for yourself:

http://www.gpugrid.net/results.php?hostid=32169

It sure looks like the KASHIF_HIVPR problem also bites the faster cards, just not as often. Our team members have also been reporting the same problem on the GTX 260 and above. So it's documented. Any chance of getting this fixed?


"Silakka"
Hello from Turku > Åbo.
ID: 9721 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Alain Maes

Send message
Joined: 8 Sep 08
Posts: 63
Credit: 1,696,957,181
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9722 - Posted: 13 May 2009, 18:31:01 UTC - in response to Message 9714.  
Last modified: 13 May 2009, 18:31:51 UTC

Ubuntu 9.04 comes standard with driver version 180.44, which avoids so far to have to fiddle with manual interventions.

Wiil they follow before or after GPUGRID decides to require the 185 version drivers? If a manual update of the Linux community is required, please advise in advance.

Many thanks

Kind regards

Alain
ID: 9722 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile K1atOdessa

Send message
Joined: 25 Feb 08
Posts: 249
Credit: 444,646,963
RAC: 0
Level
Gln
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9723 - Posted: 13 May 2009, 19:06:03 UTC - in response to Message 9719.  

That same GTX 260 has only 3 recent failed WUs, all of them KASHIF_HIVPR.


It looks like you have a GTX 260 and an 8800GT. All three tasks failed while running on the 8800GT (device 1), not on the GTX 260.
ID: 9723 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile K1atOdessa

Send message
Joined: 25 Feb 08
Posts: 249
Credit: 444,646,963
RAC: 0
Level
Gln
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9724 - Posted: 13 May 2009, 19:12:45 UTC

In light of the issues with the older GPU's and the KASHIR_HIVPR WU's, what is the best version of nvidia driver to use?

I have been aborting them when I see them, to get them over to a 200 series as quick as possible. I don't think it is beneficial for the project for me to let this sit in my queue for 12 hours, then run for another several before failing anyway.

I'd prefer not to babysit, so should I roll back my current 185.66 to the last WHQL approved non-185.xx driver, which is 182.50?

I guess I could just try this and report the results, but I wanted to know if anyone has already tried this 182.50 driver w/ an older (non-200-series) card.
ID: 9724 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Beyond
Avatar

Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9725 - Posted: 13 May 2009, 20:21:33 UTC - in response to Message 9723.  

That same GTX 260 has only 3 recent failed WUs, all of them KASHIF_HIVPR.


It looks like you have a GTX 260 and an 8800GT. All three tasks failed while running on the 8800GT (device 1), not on the GTX 260.

You're right. Not my machine and I didn't see the 2 cards. But OK here's an example from a machine with only a GTX 260:

http://www.gpugrid.net/result.php?resultid=663665

ID: 9725 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile K1atOdessa

Send message
Joined: 25 Feb 08
Posts: 249
Credit: 444,646,963
RAC: 0
Level
Gln
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9727 - Posted: 13 May 2009, 21:16:45 UTC - in response to Message 9725.  

That same GTX 260 has only 3 recent failed WUs, all of them KASHIF_HIVPR.


It looks like you have a GTX 260 and an 8800GT. All three tasks failed while running on the 8800GT (device 1), not on the GTX 260.

You're right. Not my machine and I didn't see the 2 cards. But OK here's an example from a machine with only a GTX 260:

http://www.gpugrid.net/result.php?resultid=663665



:-) That one reports as "Aborted by user". So I don't think it errored out under normal circumstances -- it's was manually aborted.
ID: 9727 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mike047

Send message
Joined: 21 Dec 08
Posts: 47
Credit: 7,330,049
RAC: 0
Level
Ser
Scientific publications
watwatwatwatwatwatwat
Message 9728 - Posted: 13 May 2009, 21:36:49 UTC - in response to Message 9715.  

CUDA 2.2 libs will be distributed with the application, but you will need to upgrade the driver to the latest 185 version.

gdf



Are you saying that without 185 version drivers we will not be able to successfully do GPU Grid work.

I have card/box combinations that will not accept that version and run properly.

If 185 version driver and above is "required" to crunch here, I will be taking my farm to FAH.



Is this query unworthy of an answer?
mike
ID: 9728 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
MarkJ
Volunteer moderator
Volunteer tester

Send message
Joined: 24 Dec 08
Posts: 738
Credit: 200,909,904
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9730 - Posted: 13 May 2009, 21:56:37 UTC - in response to Message 9714.  

CUDA 2.2 libs will be distributed with the application, but you will need to upgrade the driver to the latest 185 version.

gdf


Thanks.

I'd suggest a note in the news section on the home page. That way people can start organising things. I have already set GPUgrid to "no new work" so I can finish off what I have before doing the driver upgrades. I've got a few machines to do :)
BOINC blog
ID: 9730 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Aardvark
Avatar

Send message
Joined: 27 Nov 08
Posts: 28
Credit: 82,362,324
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 9731 - Posted: 13 May 2009, 22:07:07 UTC

Task ID 665546 had been running well along with another task. As I was about to run a program that would "use" the GPU I decided to suspend all tasks and exit Boinc. Once I had completed my task I launched Boinc, all tasks appeared still suspended. So far so good.I then resumed all tasks, and task 665546 immediately went to "compute error". I also had another task 652947 that had been running for 29 out of about 30 hours and failed (different machine). When I get the time I will compile a list of the failures and successes over the past few days.
ID: 9731 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
uBronan
Avatar

Send message
Joined: 1 Feb 09
Posts: 139
Credit: 575,023
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 9734 - Posted: 13 May 2009, 22:38:33 UTC

which card/machine combinations are not possible to use the 185.85 version may i ask mike047 ?
ID: 9734 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Beyond
Avatar

Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9736 - Posted: 13 May 2009, 23:19:18 UTC - in response to Message 9727.  
Last modified: 13 May 2009, 23:20:16 UTC

That same GTX 260 has only 3 recent failed WUs, all of them KASHIF_HIVPR.

It looks like you have a GTX 260 and an 8800GT. All three tasks failed while running on the 8800GT (device 1), not on the GTX 260.

You're right. Not my machine and I didn't see the 2 cards. But OK here's an example from a machine with only a GTX 260:

http://www.gpugrid.net/result.php?resultid=663665


:-) That one reports as "Aborted by user". So I don't think it errored out under normal circumstances -- it's was manually aborted.

The user is one of my team members and he reported it as being stuck. It had processed for over twice as long as his other WUs and showed no progress. He was using BOINC client v6.6.28, not v6.6.20 so that wasn't the problem. :-)
ID: 9736 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Beyond
Avatar

Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9737 - Posted: 13 May 2009, 23:49:00 UTC - in response to Message 9727.  

That same GTX 260 has only 3 recent failed WUs, all of them KASHIF_HIVPR.


It looks like you have a GTX 260 and an 8800GT. All three tasks failed while running on the 8800GT (device 1), not on the GTX 260.

You're right. Not my machine and I didn't see the 2 cards. But OK here's an example from a machine with only a GTX 260:

http://www.gpugrid.net/result.php?resultid=663665


:-) That one reports as "Aborted by user". So I don't think it errored out under normal circumstances -- it's was manually aborted.

Here's a bunch more for your viewing pleasure:

http://www.gpugrid.net/result.php?resultid=659111
http://www.gpugrid.net/result.php?resultid=664645
http://www.gpugrid.net/result.php?resultid=666952
http://www.gpugrid.net/result.php?resultid=647270
http://www.gpugrid.net/result.php?resultid=660927
http://www.gpugrid.net/result.php?resultid=666863


Certainly not as common as with the slower cards, but not at all hard to find.
The last 2 are test WUs...

ID: 9737 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile K1atOdessa

Send message
Joined: 25 Feb 08
Posts: 249
Credit: 444,646,963
RAC: 0
Level
Gln
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9738 - Posted: 14 May 2009, 0:00:01 UTC - in response to Message 9737.  

@Beyond - I didn't doubt you. :-)


@GDF/Admin:

Given these KASHIF_HIVPR seems to error out a lot, especially with "older, slower" cards, but also with new 200-series occasionally as well (as shown by Beyond), are no new ones going to be created?

I can understand cleaning out the queue, but I have gotten several today and with my cards I almost certainly expect them to error out. If I catch them in my queue, I try to abort them so they can move to a 200-series with a better change of finishing in a timely manner.

Is there any analysis from the project on why these particular WU's are an issue? I've read comments about the drivers possibly being an issue, but given the 2.2 CUDA software on the server will require these 185.xx drivers I expect to continue having issues with these WU's if they are still in queue. All others work fine.
ID: 9738 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile dataman
Avatar

Send message
Joined: 18 Sep 08
Posts: 36
Credit: 100,352,867
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 9739 - Posted: 14 May 2009, 2:10:31 UTC

As GPUGrid clearly does not want to put in much effort to support 8 and 9 series cards, I'm done here for now. I'd rather shut them down than to waste time and electricity in an endless circle jerk of BOINC versions and drivers. But hey, 3.7 million credits was a good run for me here. There will be a new GPU project out soon.
Sad really, as I think some of the science was worth doing here. :)
Ciao.

ID: 9739 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mike047

Send message
Joined: 21 Dec 08
Posts: 47
Credit: 7,330,049
RAC: 0
Level
Ser
Scientific publications
watwatwatwatwatwatwat
Message 9742 - Posted: 14 May 2009, 6:51:56 UTC - in response to Message 9734.  

which card/machine combinations are not possible to use the 185.85 version may i ask mike047 ?


I don't have that information at hand presently. Basically I use Ubuntu 8.04lts. The 260 and 250 cards have no trouble using 180.22 and might be able to use a higher driver without issue. Some of my 8800/9600gso/9800 cards will not accept any driver above 177.82. All mother boards are Gigabyte P35/45.

I don't know what the issues are with this project and I am willing "to do" a little work to be able to run this project. BUT, I am unwilling to babysit and periodically change drivers to suit a project that is becoming unwilling to respond to my queries and the queries of others.

Unfortunately I have invested in many Nvidia cards that at the present cannot be used else where in Boinc. FAH is the only other place that can use my cards. I have one box working there now and it has run absolutely trouble free with NO intervention on my part. The + to FAH is that my internet is not shut down when it has to upload, the 50+m uploads from here shut my internet down...I know that is not a project fault but it is an issue for me.

This is a good project with good science but it has gotten away from communicating with the participants in a timely manners. IMHO the project has slipped badly from where it was several months ago.
mike
ID: 9742 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile JockMacMad TSBT

Send message
Joined: 26 Jan 09
Posts: 31
Credit: 3,877,912
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwat
Message 9743 - Posted: 14 May 2009, 8:15:20 UTC - in response to Message 9742.  
Last modified: 14 May 2009, 8:20:00 UTC

I can confirm my BFG GTX-260 192 Shader card is also getting alot of these errors with 185.81.

One example
ID: 9743 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile JockMacMad TSBT

Send message
Joined: 26 Jan 09
Posts: 31
Credit: 3,877,912
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwat
Message 9751 - Posted: 14 May 2009, 13:16:48 UTC - in response to Message 9743.  

Oh and SETI has nVidia support so there is another BOINC project.
ID: 9751 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist

Send message
Joined: 14 Mar 07
Posts: 1958
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 9752 - Posted: 14 May 2009, 14:10:05 UTC - in response to Message 9751.  
Last modified: 14 May 2009, 14:18:20 UTC

We have tested with drivers 185.xx on a 8800GT. All the WUs fail.
With driver 180.xx all WU are fine.

So, we can just suggest to downgrade to older drivers (180.xx) seem to work.

We have reported the issue to Nvidia.

gdf
ID: 9752 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Paul D. Buck

Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 9761 - Posted: 14 May 2009, 16:06:59 UTC - in response to Message 9736.  
Last modified: 14 May 2009, 16:13:19 UTC

The user is one of my team members and he reported it as being stuck. It had processed for over twice as long as his other WUs and showed no progress. He was using BOINC client v6.6.28, not v6.6.20 so that wasn't the problem. :-)

Yes, and maybe no ...

6.6.20 stunk in this regard... it really sucked swamp water ...

6.6.23 and later, *I* for one thought, fixed it ... now I am not so sure.

What I ***THINK*** happened is that most of the causes have been cleaned up ... but sometimes something bad happens. And THEN, you get a task that runs long.

There are still issues with the way that the resource scheduling is done. I am banging my head on the wall about things that *I* think I can clearly demonstrate to be patted on the head and told to go 'way you bother me ...

I mean, just last night I had five tasks all started and die in less than a second. At the moment the answer is that this is not possible. My 2,200+ log file of those two seconds notwithstanding ...

Anyway, ... I am far less sanguine about how "fixed" we are ...

{edit}

An example: 12-TONI_HIVPR_mon_ba20-7-100-RND1398_0 and that was run on a 6.6.25 client ... 182.50 drivers I think at the time. 115 ms step size ...
ID: 9761 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist

Send message
Joined: 14 Mar 07
Posts: 1958
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 9830 - Posted: 16 May 2009, 9:38:33 UTC - in response to Message 9761.  

We have managed to replicate the problem on one of our machines.
This should lead to a solution soon.

Be patient.

gdf
ID: 9830 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · Next

Message boards : Graphics cards (GPUs) : Recent problems for WUs on older GPUs

©2025 Universitat Pompeu Fabra