Recent problems for WUs on older GPUs

Message boards : Graphics cards (GPUs) : Recent problems for WUs on older GPUs
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 5 · Next

AuthorMessage
Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist

Send message
Joined: 14 Mar 07
Posts: 1958
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 9642 - Posted: 11 May 2009, 17:05:36 UTC

We are having problems with several workunits and GPUs which are NOT 260/275/285/295. As we test on newer cards, we have not spotted the problem before.

The problem appears only for workunits using Amber format (all the KASHIF ones).

We are now removed all that we could remove, but left some KASHIF out as they do run on newer cards just fine.

We are testing KASHIF_HIV_* on two 8800 cards under windows and Linux , running fine so far.

Keep you updated.

gdf
ID: 9642 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Blackbird74

Send message
Joined: 20 Nov 08
Posts: 3
Credit: 362,118
RAC: 0
Level

Scientific publications
watwatwatwatwat
Message 9670 - Posted: 12 May 2009, 12:38:36 UTC - in response to Message 9642.  
Last modified: 12 May 2009, 12:39:03 UTC

I had a bunch of compute errors on my 8800GT, but then the latest KASHIF_HIVPR completed OK over a couple of days.
Full task list: http://www.gpugrid.net/results.php?userid=9833
Latest KASHIF_HIVPR WU completed fine: http://www.gpugrid.net/workunit.php?wuid=449234
Comp specs: http://www.gpugrid.net/show_host_detail.php?hostid=17613

Doesn't seem much rhyme nor reason to the fails other than the recent probs with WUs in general (blackout).
ID: 9670 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist

Send message
Joined: 14 Mar 07
Posts: 1958
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 9673 - Posted: 12 May 2009, 13:17:43 UTC - in response to Message 9670.  
Last modified: 12 May 2009, 13:36:30 UTC

Today the error rate for Kashif wus is lower, so it could have been a problem with drivers.

In the next few days we will perform a server update and application updates to use CUDA2.2.

gdf
ID: 9673 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
TomaszPawel

Send message
Joined: 18 Aug 08
Posts: 121
Credit: 59,836,411
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9678 - Posted: 12 May 2009, 19:46:57 UTC - in response to Message 9673.  

2009-05-12 15:08:32 GPUGRID Starting task 4-KASHIF_HIVPRFE_dim_ba1-2-4-RND6858_0 using acemd version 664

hmmm it is now 39.8% after 6:40H, and it's says that it remains 10H....

Is it normal on GTX260 and 182.08 and 6.6.20 and XP 32 ?
POLISH NATIONAL TEAM - Join! Crunch! Win!
ID: 9678 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Beyond
Avatar

Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9682 - Posted: 12 May 2009, 20:36:42 UTC - in response to Message 9673.  
Last modified: 12 May 2009, 20:39:12 UTC

Today the error rate for Kashif wus is lower, so it could have been a problem with drivers.

Or people avoiding them like the plague. People on our team have been reporting stuck and failed WUs like never before.

In the next few days we will perform a server update and application updates to use CUDA2.2.

Will we still be able to use our older non-CUDA2.2 cards?
ID: 9682 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9687 - Posted: 12 May 2009, 21:34:54 UTC - in response to Message 9682.  

Will we still be able to use our older non-CUDA2.2 cards?


That's just the software version and depends on the driver. There's also the CUDA hardware capability, which is the critical one. This one *should* stay as it was before (minimum of 1.1 required).

Thomasz, your GTX 260 is not exactly an older card (as stated in the first post of this thread).

MrS
Scanning for our furry friends since Jan 2002
ID: 9687 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
TomaszPawel

Send message
Joined: 18 Aug 08
Posts: 121
Credit: 59,836,411
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9689 - Posted: 12 May 2009, 22:09:10 UTC - in response to Message 9687.  

It is as clear as crystal ...

But it usually crunch 7-8h a WU not 17!!!

And in this tread this type of WU is mentioned so maby it is relevant?
POLISH NATIONAL TEAM - Join! Crunch! Win!
ID: 9689 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9691 - Posted: 12 May 2009, 22:23:59 UTC - in response to Message 9689.  

Alright.. could be the *usual* 6.6.20 bug.

MrS
Scanning for our furry friends since Jan 2002
ID: 9691 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Paul D. Buck

Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 9693 - Posted: 12 May 2009, 23:21:21 UTC - in response to Message 9691.  

Alright.. could be the *usual* 6.6.20 bug.

MrS

Sadly I may have seen it on a 6.6.23 processed task. That means that the real problem has not been addressed, though the changes in 6.6.23 and later make it better, but not cured.
ID: 9693 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
The Brain QC

Send message
Joined: 27 Oct 08
Posts: 27
Credit: 3,211,916
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwatwatwatwat
Message 9699 - Posted: 13 May 2009, 8:58:22 UTC - in response to Message 9693.  

Have 5-KASHIF_HIVPR_dim_ba1-4-100-RND6112_0 using acemd version 664 running since 21 hours on 9800gx2, 68% done, never had such long wu on gpugrid, usually i make like 3/4 wus in 21 hour. Hope credit will be as great as the time it takes to compute ;).
ID: 9699 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
MarkJ
Volunteer moderator
Volunteer tester

Send message
Joined: 24 Dec 08
Posts: 738
Credit: 200,909,904
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9701 - Posted: 13 May 2009, 9:47:29 UTC - in response to Message 9673.  
Last modified: 13 May 2009, 9:58:36 UTC

In the next few days we will perform a server update and application updates to use CUDA2.2.

gdf


So do we need to upgrade to 185.85 drivers and cuda 2.2 dll's? Or will the app work out which cuda version and only use the instruction set that is supported?

Will GPUgrid download the cuda 2.2 dll's or will we need to put them somewhere (like the projects\gpugrid folder) when the new app is released?

Oh and seeing as you are changing the app, is there a chance you could report the driver version and the cuda version in the wu info. It might help with the debugging.

core_client_version>6.6.28</core_client_version>
<![CDATA[
<stderr_txt>
# Using CUDA device 0
# Device 0: "GeForce GTS 250"
#  Clock rate: 1836000 kilohertz
#  Total amount of global memory:                 536543232 bytes
#  Number of multiprocessors:                     16
#  Number of cores:                               128
# Amber: readparm : Reading parm file parameters
# PARM file in AMBER 7 format
#  Encounter 10-12 H-bond term
WARNING: parameters.cu, line 568:  Found zero 10-12 H-bond term.
WARNING: parameters.cu, line 568:  Found zero 10-12 H-bond term.
MDIO ERROR: cannot open file "restart.coor"
# Time per step: 	46.163 ms
# Approximate elapsed time for entire WU:  	46163.094 s
called boinc_finish

</stderr_txt>
]]>

BOINC blog
ID: 9701 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
uBronan
Avatar

Send message
Joined: 1 Feb 09
Posts: 139
Credit: 575,023
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 9707 - Posted: 13 May 2009, 12:12:15 UTC
Last modified: 13 May 2009, 12:18:19 UTC

Well i downloaded as allways both the driver and the cuda toolkit from nvidia site.
After the initial pause on gpugrid i have to report that i did not have a failing unit for a few days now.
Not sure if anyone else does download the Cuda toolkit or just the driver.
I am almost done with the test unit which finishes in about an 1/2 hour or so
I hope the new received IBUCH ones will finish also without issues.
If they all finish without error i start to get the feeling the problems are solved .... i hope :D
ID: 9707 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Beyond
Avatar

Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9711 - Posted: 13 May 2009, 13:42:36 UTC - in response to Message 9673.  
Last modified: 13 May 2009, 14:07:26 UTC

Today the error rate for Kashif wus is lower, so it could have been a problem with drivers.

In the next few days we will perform a server update and application updates to use CUDA2.2.

gdf

Just finished looking at a LOT of KASHIF_HIVPR WUs. The situation is not improving at all and is not a driver issue. What happens is these WUs are downloaded and either fail or are aborted repeatedly until they happen to be assigned to a GTX 260 or above, then they complete. The problem is not fixed and is not improving. IMO it needs to be dealt with ASAP.

Here's just a few examples:

http://www.gpugrid.net/workunit.php?wuid=440561
http://www.gpugrid.net/workunit.php?wuid=442250
http://www.gpugrid.net/workunit.php?wuid=454479
http://www.gpugrid.net/workunit.php?wuid=449101
http://www.gpugrid.net/workunit.php?wuid=457871
http://www.gpugrid.net/workunit.php?wuid=458509
ID: 9711 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
TomaszPawel

Send message
Joined: 18 Aug 08
Posts: 121
Credit: 59,836,411
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9713 - Posted: 13 May 2009, 14:40:23 UTC - in response to Message 9711.  
Last modified: 13 May 2009, 14:41:25 UTC

2009-05-12 15:08:32 GPUGRID Starting task 4-KASHIF_HIVPRFE_dim_ba1-2-4-RND6858_0 using acemd version 664

hmmm it is now 39.8% after 6:40H, and it's says that it remains 10H....

Is it normal on GTX260 and 182.08 and 6.6.20 and XP 32 ?"


whell, now it crunch that WU 18H and it is 83%, it says 3h30min remaining...
POLISH NATIONAL TEAM - Join! Crunch! Win!
ID: 9713 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist

Send message
Joined: 14 Mar 07
Posts: 1958
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 9714 - Posted: 13 May 2009, 14:52:02 UTC - in response to Message 9713.  

CUDA 2.2 libs will be distributed with the application, but you will need to upgrade the driver to the latest 185 version.

gdf
ID: 9714 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mike047

Send message
Joined: 21 Dec 08
Posts: 47
Credit: 7,330,049
RAC: 0
Level
Ser
Scientific publications
watwatwatwatwatwatwat
Message 9715 - Posted: 13 May 2009, 15:01:22 UTC - in response to Message 9714.  

CUDA 2.2 libs will be distributed with the application, but you will need to upgrade the driver to the latest 185 version.

gdf



Are you saying that without 185 version drivers we will not be able to successfully do GPU Grid work.

I have card/box combinations that will not accept that version and run properly.

If 185 version driver and above is "required" to crunch here, I will be taking my farm to FAH.
mike
ID: 9715 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
uBronan
Avatar

Send message
Joined: 1 Feb 09
Posts: 139
Credit: 575,023
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 9716 - Posted: 13 May 2009, 15:06:05 UTC

The test unit ended also without problem and new ibuchs on the way.
I haven't had any cancelled other then one being in queue for almost 2 days so nothing speical on that.
I am still running the 185.85 and boinc 6.6.28
Except the usual problems with boinc issues like fetch and such it runs stable for me, my slow 9600 Gt seems to do well.
But i had to lower my clock on my cpu since i had to disable my watercooling, the 9850 BE is a hothead because with the huge cooler on it becomes 55 C.
But its today extremly warm here i measured 31 C in the room ambient temp.
ID: 9716 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Beyond
Avatar

Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9717 - Posted: 13 May 2009, 15:31:00 UTC - in response to Message 9716.  

The test unit ended also without problem and new ibuchs on the way.
I haven't had any cancelled other then one being in queue for almost 2 days so nothing speical on that.
I am still running the 185.85 and boinc 6.6.28
Except the usual problems with boinc issues like fetch and such it runs stable for me, my slow 9600 Gt seems to do well.
But i had to lower my clock on my cpu since i had to disable my watercooling, the 9850 BE is a hothead because with the huge cooler on it becomes 55 C.
But its today extremly warm here i measured 31 C in the room ambient temp.

Your computers are hidden so how can we verify?
ID: 9717 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Beyond
Avatar

Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9719 - Posted: 13 May 2009, 17:06:28 UTC - in response to Message 9711.  
Last modified: 13 May 2009, 17:09:56 UTC

Today the error rate for Kashif wus is lower, so it could have been a problem with drivers.

In the next few days we will perform a server update and application updates to use CUDA2.2.

gdf

Just finished looking at a LOT of KASHIF_HIVPR WUs. The situation is not improving at all and is not a driver issue. What happens is these WUs are downloaded and either fail or are aborted repeatedly until they happen to be assigned to a GTX 260 or above, then they complete. The problem is not fixed and is not improving. IMO it needs to be dealt with ASAP.

Here's just a few examples:

http://www.gpugrid.net/workunit.php?wuid=440561
http://www.gpugrid.net/workunit.php?wuid=442250
http://www.gpugrid.net/workunit.php?wuid=454479
http://www.gpugrid.net/workunit.php?wuid=449101
http://www.gpugrid.net/workunit.php?wuid=457871
http://www.gpugrid.net/workunit.php?wuid=458509

Here's a new KASHIF_HIVPR that was just downloaded to me (and I aborted). Notice that it just caused an error on a GTX 260 {after running a long time I might add).

http://www.gpugrid.net/workunit.php?wuid=459189

That same GTX 260 has only 3 recent failed WUs, all of them KASHIF_HIVPR. Take a look for yourself:

http://www.gpugrid.net/results.php?hostid=32169

It sure looks like the KASHIF_HIVPR problem also bites the faster cards, just not as often. Our team members have also been reporting the same problem on the GTX 260 and above. So it's documented. Any chance of getting this fixed?
ID: 9719 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
TomaszPawel

Send message
Joined: 18 Aug 08
Posts: 121
Credit: 59,836,411
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9720 - Posted: 13 May 2009, 18:14:44 UTC - in response to Message 9713.  
Last modified: 13 May 2009, 18:16:05 UTC

2009-05-12 15:08:32 GPUGRID Starting task 4-KASHIF_HIVPRFE_dim_ba1-2-4-RND6858_0 using acemd version 664

hmmm it is now 39.8% after 6:40H, and it's says that it remains 10H....

Is it normal on GTX260 and 182.08 and 6.6.20 and XP 32 ?"


whell, now it crunch that WU 18H and it is 83%, it says 3h30min remaining...

lol after 24h of crunching - 3600 pionts...
POLISH NATIONAL TEAM - Join! Crunch! Win!
ID: 9720 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · 3 · 4 . . . 5 · Next

Message boards : Graphics cards (GPUs) : Recent problems for WUs on older GPUs

©2025 Universitat Pompeu Fabra