Message boards :
Graphics cards (GPUs) :
Recent problems for WUs on older GPUs
Message board moderation
| Author | Message |
|---|---|
GDFSend message Joined: 14 Mar 07 Posts: 1958 Credit: 629,356 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() |
We are having problems with several workunits and GPUs which are NOT 260/275/285/295. As we test on newer cards, we have not spotted the problem before. The problem appears only for workunits using Amber format (all the KASHIF ones). We are now removed all that we could remove, but left some KASHIF out as they do run on newer cards just fine. We are testing KASHIF_HIV_* on two 8800 cards under windows and Linux , running fine so far. Keep you updated. gdf |
|
Send message Joined: 20 Nov 08 Posts: 3 Credit: 362,118 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
I had a bunch of compute errors on my 8800GT, but then the latest KASHIF_HIVPR completed OK over a couple of days. Full task list: http://www.gpugrid.net/results.php?userid=9833 Latest KASHIF_HIVPR WU completed fine: http://www.gpugrid.net/workunit.php?wuid=449234 Comp specs: http://www.gpugrid.net/show_host_detail.php?hostid=17613 Doesn't seem much rhyme nor reason to the fails other than the recent probs with WUs in general (blackout). |
GDFSend message Joined: 14 Mar 07 Posts: 1958 Credit: 629,356 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() |
Today the error rate for Kashif wus is lower, so it could have been a problem with drivers. In the next few days we will perform a server update and application updates to use CUDA2.2. gdf |
|
Send message Joined: 18 Aug 08 Posts: 121 Credit: 59,836,411 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
2009-05-12 15:08:32 GPUGRID Starting task 4-KASHIF_HIVPRFE_dim_ba1-2-4-RND6858_0 using acemd version 664 hmmm it is now 39.8% after 6:40H, and it's says that it remains 10H.... Is it normal on GTX260 and 182.08 and 6.6.20 and XP 32 ? POLISH NATIONAL TEAM - Join! Crunch! Win! |
BeyondSend message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Today the error rate for Kashif wus is lower, so it could have been a problem with drivers. Or people avoiding them like the plague. People on our team have been reporting stuck and failed WUs like never before. In the next few days we will perform a server update and application updates to use CUDA2.2. Will we still be able to use our older non-CUDA2.2 cards? |
|
Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Will we still be able to use our older non-CUDA2.2 cards? That's just the software version and depends on the driver. There's also the CUDA hardware capability, which is the critical one. This one *should* stay as it was before (minimum of 1.1 required). Thomasz, your GTX 260 is not exactly an older card (as stated in the first post of this thread). MrS Scanning for our furry friends since Jan 2002 |
|
Send message Joined: 18 Aug 08 Posts: 121 Credit: 59,836,411 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
It is as clear as crystal ... But it usually crunch 7-8h a WU not 17!!! And in this tread this type of WU is mentioned so maby it is relevant? POLISH NATIONAL TEAM - Join! Crunch! Win! |
|
Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
|
Paul D. BuckSend message Joined: 9 Jun 08 Posts: 1050 Credit: 37,321,185 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Alright.. could be the *usual* 6.6.20 bug. Sadly I may have seen it on a 6.6.23 processed task. That means that the real problem has not been addressed, though the changes in 6.6.23 and later make it better, but not cured. |
|
Send message Joined: 27 Oct 08 Posts: 27 Credit: 3,211,916 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Have 5-KASHIF_HIVPR_dim_ba1-4-100-RND6112_0 using acemd version 664 running since 21 hours on 9800gx2, 68% done, never had such long wu on gpugrid, usually i make like 3/4 wus in 21 hour. Hope credit will be as great as the time it takes to compute ;). |
|
Send message Joined: 24 Dec 08 Posts: 738 Credit: 200,909,904 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
In the next few days we will perform a server update and application updates to use CUDA2.2. So do we need to upgrade to 185.85 drivers and cuda 2.2 dll's? Or will the app work out which cuda version and only use the instruction set that is supported? Will GPUgrid download the cuda 2.2 dll's or will we need to put them somewhere (like the projects\gpugrid folder) when the new app is released? Oh and seeing as you are changing the app, is there a chance you could report the driver version and the cuda version in the wu info. It might help with the debugging. core_client_version>6.6.28</core_client_version> <![CDATA[ <stderr_txt> # Using CUDA device 0 # Device 0: "GeForce GTS 250" # Clock rate: 1836000 kilohertz # Total amount of global memory: 536543232 bytes # Number of multiprocessors: 16 # Number of cores: 128 # Amber: readparm : Reading parm file parameters # PARM file in AMBER 7 format # Encounter 10-12 H-bond term WARNING: parameters.cu, line 568: Found zero 10-12 H-bond term. WARNING: parameters.cu, line 568: Found zero 10-12 H-bond term. MDIO ERROR: cannot open file "restart.coor" # Time per step: 46.163 ms # Approximate elapsed time for entire WU: 46163.094 s called boinc_finish </stderr_txt> ]]> BOINC blog |
|
Send message Joined: 1 Feb 09 Posts: 139 Credit: 575,023 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Well i downloaded as allways both the driver and the cuda toolkit from nvidia site. After the initial pause on gpugrid i have to report that i did not have a failing unit for a few days now. Not sure if anyone else does download the Cuda toolkit or just the driver. I am almost done with the test unit which finishes in about an 1/2 hour or so I hope the new received IBUCH ones will finish also without issues. If they all finish without error i start to get the feeling the problems are solved .... i hope :D |
BeyondSend message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Today the error rate for Kashif wus is lower, so it could have been a problem with drivers. Just finished looking at a LOT of KASHIF_HIVPR WUs. The situation is not improving at all and is not a driver issue. What happens is these WUs are downloaded and either fail or are aborted repeatedly until they happen to be assigned to a GTX 260 or above, then they complete. The problem is not fixed and is not improving. IMO it needs to be dealt with ASAP. Here's just a few examples: http://www.gpugrid.net/workunit.php?wuid=440561 http://www.gpugrid.net/workunit.php?wuid=442250 http://www.gpugrid.net/workunit.php?wuid=454479 http://www.gpugrid.net/workunit.php?wuid=449101 http://www.gpugrid.net/workunit.php?wuid=457871 http://www.gpugrid.net/workunit.php?wuid=458509 |
|
Send message Joined: 18 Aug 08 Posts: 121 Credit: 59,836,411 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
2009-05-12 15:08:32 GPUGRID Starting task 4-KASHIF_HIVPRFE_dim_ba1-2-4-RND6858_0 using acemd version 664 whell, now it crunch that WU 18H and it is 83%, it says 3h30min remaining... POLISH NATIONAL TEAM - Join! Crunch! Win! |
GDFSend message Joined: 14 Mar 07 Posts: 1958 Credit: 629,356 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() |
CUDA 2.2 libs will be distributed with the application, but you will need to upgrade the driver to the latest 185 version. gdf |
mike047Send message Joined: 21 Dec 08 Posts: 47 Credit: 7,330,049 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]()
|
CUDA 2.2 libs will be distributed with the application, but you will need to upgrade the driver to the latest 185 version. Are you saying that without 185 version drivers we will not be able to successfully do GPU Grid work. I have card/box combinations that will not accept that version and run properly. If 185 version driver and above is "required" to crunch here, I will be taking my farm to FAH. mike |
|
Send message Joined: 1 Feb 09 Posts: 139 Credit: 575,023 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
The test unit ended also without problem and new ibuchs on the way. I haven't had any cancelled other then one being in queue for almost 2 days so nothing speical on that. I am still running the 185.85 and boinc 6.6.28 Except the usual problems with boinc issues like fetch and such it runs stable for me, my slow 9600 Gt seems to do well. But i had to lower my clock on my cpu since i had to disable my watercooling, the 9850 BE is a hothead because with the huge cooler on it becomes 55 C. But its today extremly warm here i measured 31 C in the room ambient temp. |
BeyondSend message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
The test unit ended also without problem and new ibuchs on the way. Your computers are hidden so how can we verify? |
BeyondSend message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Today the error rate for Kashif wus is lower, so it could have been a problem with drivers. Here's a new KASHIF_HIVPR that was just downloaded to me (and I aborted). Notice that it just caused an error on a GTX 260 {after running a long time I might add). http://www.gpugrid.net/workunit.php?wuid=459189 That same GTX 260 has only 3 recent failed WUs, all of them KASHIF_HIVPR. Take a look for yourself: http://www.gpugrid.net/results.php?hostid=32169 It sure looks like the KASHIF_HIVPR problem also bites the faster cards, just not as often. Our team members have also been reporting the same problem on the GTX 260 and above. So it's documented. Any chance of getting this fixed? |
|
Send message Joined: 18 Aug 08 Posts: 121 Credit: 59,836,411 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
2009-05-12 15:08:32 GPUGRID Starting task 4-KASHIF_HIVPRFE_dim_ba1-2-4-RND6858_0 using acemd version 664 lol after 24h of crunching - 3600 pionts... POLISH NATIONAL TEAM - Join! Crunch! Win! |
©2025 Universitat Pompeu Fabra