6 Errors Today [Problems with "KASHIF_HIVPR" and "IBUCH_KID"-WUs]

Message boards : Graphics cards (GPUs) : 6 Errors Today [Problems with "KASHIF_HIVPR" and "IBUCH_KID"-WUs]
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · Next

AuthorMessage
uBronan
Avatar

Send message
Joined: 1 Feb 09
Posts: 139
Credit: 575,023
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 9481 - Posted: 8 May 2009, 16:31:50 UTC
Last modified: 8 May 2009, 16:38:38 UTC

I have had my fair share of those also and installed all latest drivers Win7 185.85 which include cuda 2.2 on this machine and boinc 6.6.28.
To my surprise i see now in boinc that my 9600 GT seems only be able todo cuda 1.0 instructions.
So maybe the errors created by these workunits are related to instruction which only can be performed by the newest 2x5 models.
Since non of them seem to have much errors on these units
But somehow i have had less problems with my machine since the latest drivers am installed, it runs kinda rock solid (only BF2 and gameguard games are an issue)
BUT i'll remind you guys everything i run is BETA so problems can occur.
That it runs almost without a problem on my machine is no garantee it will on yours.

I guess if you have a 2X5 card you probably will see a gain in processing speed if some of the cuda 2.2 intructions can or/and are implemented
ID: 9481 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Zydor

Send message
Joined: 8 Feb 09
Posts: 252
Credit: 1,309,451
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 9482 - Posted: 8 May 2009, 17:12:29 UTC - in response to Message 9481.  

Some positives for comparison as the KASHIFs are going bang with me, I've left the hardware/software setup alone so there is fair comparison.

GIANNIs seem to run fine. I am 7hrs into a TONI_HIVPR, so touch wood that seems like it will go through, will finish in about 5/6 hours. I have a IBUCH_HIVPR lined up as the next to go.

Regards
Zy
ID: 9482 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
naja002
Avatar

Send message
Joined: 25 Sep 08
Posts: 111
Credit: 10,352,599
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwat
Message 9484 - Posted: 8 May 2009, 19:47:18 UTC
Last modified: 8 May 2009, 20:00:00 UTC

I have aborted all:

KASHIF_HIVPR
and
IBUCH_KID

and will now continue to do so.

I have 5x 8800GS and 1x 8800GT--those WUs do not complete on my rigs and most of them hang. Yesterday I completed ONE WU instead of 9-11. 5K ppd instead of 50Kppd.

Was on 6.6.17, 3 rigs 185.26, 1 rig 182.50

As of last night all rigs are: 6.6.28 and 3x 185.26, 1x 185.85--seems to have helped some.

This is an "across the farm" thing for me now. Problems initially started on the dual gpu rigs, but now it's across the board....

My rigs are not hidden. The Phunam-PC is a new setup--the intial errors are from setup, OCing, etc. I understand those. The new ones are part of this mess.

Hoping it gets sorted out soon....

EDIT: I have kept 1 KASHIF_HIVPR that appears to be running ok on a single Gpu rig. However, 1st sign of trouble and it's history.....
ID: 9484 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Aardvark
Avatar

Send message
Joined: 27 Nov 08
Posts: 28
Credit: 82,362,324
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 9488 - Posted: 8 May 2009, 21:46:09 UTC
Last modified: 8 May 2009, 21:54:19 UTC

Likewise here, failures on

KASHIF_HIVPR
and
IBUCH_KID

Two different machines. One with 32 bit Vista, 8800GT (O/C), client 6.6.20 & 185.86 driver. The other with 64 bit Vista, 9800 GX2 (Not O/C), client 6.6.20 & 182.50 driver. I have now updated both drivers to 185.85, which is latest release.
ID: 9488 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
SkyeHunter

Send message
Joined: 7 Mar 09
Posts: 12
Credit: 1,254,285
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwat
Message 9492 - Posted: 8 May 2009, 22:58:30 UTC - in response to Message 9471.  



Ahh the "never ending wu" bug. What version of BOINC are you running? It seems to have been fixed in 6.6.23 onwards.


Indeed, nice description of what happened here. Installed Boinc 6.5.0 and WU picked up nicely where it blocked ...

Although it was KASHIF WU, it apparently was the scheduler to blame ....
ID: 9492 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
uBronan
Avatar

Send message
Joined: 1 Feb 09
Posts: 139
Credit: 575,023
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 9493 - Posted: 8 May 2009, 23:31:56 UTC

Well again had a unit error out of 13 hours of work, and looks like the big gun machines run them all fine.
I can't go on like this i lost hundreds of hours of time and money for nothing.
For the time being i am also shutting down the gpugrid till this issue is solved.
ID: 9493 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Zydor

Send message
Joined: 8 Feb 09
Posts: 252
Credit: 1,309,451
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 9495 - Posted: 8 May 2009, 23:50:11 UTC - in response to Message 9461.  

I am aware that there is hard work going on re finding the cause/fix. If its possible that someone could timeout for 2 mins to advise us all whether you still want the KASHIFs run by lower based cards, I suspect it would help enourmously as we could then abort to leave them to the big guns knowing its not going to cause issues in the bug-finding, and we carry on with the other WUs.

At present it seems lots are shutting down from doing anything in the absense of any advice, understandably, but the other WUs seem ok.

Just a gentle suggestion ...

Regards
Zy
ID: 9495 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
naja002
Avatar

Send message
Joined: 25 Sep 08
Posts: 111
Credit: 10,352,599
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwat
Message 9499 - Posted: 9 May 2009, 4:27:01 UTC
Last modified: 9 May 2009, 4:51:52 UTC

The last KASHIF_HIVPR did in fact error out.....No more for me. I'm just going to have to check my rigs 1-2x/day and send them back....


I am aware that there is hard work going on re finding the cause/fix. If its possible that someone could timeout for 2 mins to advise us all whether you still want the KASHIFs run by lower based cards, I suspect it would help enourmously as we could then abort to leave them to the big guns knowing its not going to cause issues in the bug-finding, and we carry on with the other WUs.

At present it seems lots are shutting down from doing anything in the absense of any advice, understandably, but the other WUs seem ok.

Just a gentle suggestion ...

Regards
Zy



My guess would be that they are still releasing them because they run on the higher end cards. They can still get the work completed. However, if that's the case, then I think the server needs to be setup to issue specific WU to specific cards. The server gets plenty of info from our rigs---so I don't see why that can't be done....
ID: 9499 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mike047

Send message
Joined: 21 Dec 08
Posts: 47
Credit: 7,330,049
RAC: 0
Level
Ser
Scientific publications
watwatwatwatwatwatwat
Message 9504 - Posted: 9 May 2009, 7:23:36 UTC

Nothing will likely be done until sometime Monday, I am also at No New Work until problem is resolved.
mike
ID: 9504 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist

Send message
Joined: 14 Mar 07
Posts: 1958
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 9506 - Posted: 9 May 2009, 8:34:26 UTC - in response to Message 9504.  

The real problem is that we do not understand why these WUs crash. There are several Kashif_XXX workunits and only a set of them does crash on some machines.

We will stop the crashing WUs as more testing did not really help.

gdf
ID: 9506 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Bymark
Avatar

Send message
Joined: 23 Feb 09
Posts: 30
Credit: 5,897,921
RAC: 0
Level
Ser
Scientific publications
watwatwatwatwat
Message 9515 - Posted: 9 May 2009, 10:34:36 UTC
Last modified: 9 May 2009, 10:40:31 UTC

I have a big problem with my new asus 260:

hostid=35303

I downgraded all drivers, and now waiting to get more task.
"reached daily quota of 4 results" heh ;),
Any suggestion? Seti gpus working fine.......
"Silakka"
Hello from Turku > Åbo.
ID: 9515 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
uBronan
Avatar

Send message
Joined: 1 Feb 09
Posts: 139
Credit: 575,023
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 9521 - Posted: 9 May 2009, 10:57:09 UTC
Last modified: 9 May 2009, 11:06:44 UTC

Sadly yes the famous units which we discussing all over the forum
ID: 9521 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Zydor

Send message
Joined: 8 Feb 09
Posts: 252
Credit: 1,309,451
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 9525 - Posted: 9 May 2009, 11:28:04 UTC - in response to Message 9515.  
Last modified: 9 May 2009, 11:30:34 UTC

I have a big problem with my new asus 260:

hostid=35303

I downgraded all drivers, and now waiting to get more task.
"reached daily quota of 4 results" heh ;),
Any suggestion? Seti gpus working fine.......


The ones crashing on that machine are not the suspect WUs that they have now stopped issuing, those crashing on that machine usually run fine. He also has a 260 which is outside the problems, its the lower cards that did have issues in the past. Something else lurketh. No idea what personally, over to the Gurus for that.

Regards
Zy
ID: 9525 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Sandro

Send message
Joined: 19 Aug 08
Posts: 22
Credit: 3,660,304
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 9528 - Posted: 9 May 2009, 11:59:27 UTC - in response to Message 9464.  

I am right to say that all the problems are related to older cards, like 8800,9800 and so on?
Did anyone experience repeated failures on those workunits with a 260,275,295 or 285?

gdf

Yes. My GTX 260 running under 64bit Ubuntu also crashes WUs

<core_client_version>6.4.5</core_client_version>
<![CDATA[
<message>
process got signal 11
</message>
<stderr_txt>
# Using CUDA device 0
# Device 0: "GeForce GTX 260"
# Clock rate: 1242000 kilohertz
# Total amount of global memory: 938803200 bytes
# Number of multiprocessors: 27
# Number of cores: 216
# Amber: readparm : Reading parm file parameters
# PARM file in AMBER 7 format
# Encounter 10-12 H-bond term
WARNING: parameters.cu, line 568: Found zero 10-12 H-bond term.
WARNING: parameters.cu, line 568: Found zero 10-12 H-bond term.
MDIO ERROR: cannot open file "restart.coor"

</stderr_txt>
]]>


exit status: 11 (0xb)
<core_client_version>6.4.5</core_client_version>
<![CDATA[
<message>
process got signal 11
</message>
<stderr_txt>
# Using CUDA device 0
# Device 0: "GeForce GTX 260"
# Clock rate: 1242000 kilohertz
# Total amount of global memory: 938803200 bytes
# Number of multiprocessors: 27
# Number of cores: 216
MDIO ERROR: cannot open file "restart.coor"

</stderr_txt>
]]>
ID: 9528 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9530 - Posted: 9 May 2009, 12:22:08 UTC
Last modified: 9 May 2009, 12:33:49 UTC

Let's gather some of that information:

- all failures reported here affect G92 and G9x-class chips
- G200 usually runs them just fine
- there are some errors with G200 as well, but this could just be the normal error rate
- Pauls G92 runs fine (and hopefully others)

-> it's a bug which is triggered by a special client configuration

- BOINC 6.6.x, 6.5.0 and 6.4.7 are definitely affected -> the version likely doen't matter
- driver 185.8x, 185.6x and 182.50 are reported to be affected, but 182.50 for XP32 works for Paul

-> did anyone try older drivers? E.g. 182.08, which has a very solid track record

- Pauls card has 1 GB of memory, whereas most G92 cards have 512 MB or less

Do we have any other reports of G9x cards, which run these tasks fine? Could anyone check the memory consumption of these WUs with RivaTuner?

EDIT: only certain WUs of the "IBUCH_KID" and "KASHIF_HIVPR" series are affected. Do we know which ones? Are the ones which work for Pauls card by pure coincidence all of the type which works?

For example my 9800GTX+ 512MB on Vista 64, 185.66 and 6.5.0 finished:

    *88-KASHIF_HIVPR_dim_ba2-2-100-RND8763_0
    *7-KASHIF_HIVPR_mon_ba5-6-100-RND3602_1
    *57-KASHIF_HIVPR_mon_ba4-4-100-RND1833_1


and failed


    *79-KASHIF_HIVPR_n1_for_ba1-4-100-RND9984_0
    *175-IBUCH_KID_shao_ba1-1-100-RND4198_2
    *93-IBUCH_KID_shao_ba2-0-100-RND9546_1



MrS


Scanning for our furry friends since Jan 2002
ID: 9530 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mike047

Send message
Joined: 21 Dec 08
Posts: 47
Credit: 7,330,049
RAC: 0
Level
Ser
Scientific publications
watwatwatwatwatwatwat
Message 9534 - Posted: 9 May 2009, 13:09:50 UTC

I am on 6.4.5 and use either 177.82 or 180.22 on Ubuntu 64.

I have had many failures on all cards Except my 260's[192/216]
mike
ID: 9534 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
MarkJ
Volunteer moderator
Volunteer tester

Send message
Joined: 24 Dec 08
Posts: 738
Credit: 200,909,904
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9538 - Posted: 9 May 2009, 13:39:04 UTC - in response to Message 9530.  

Let's gather some of that information:

- all failures reported here affect G92 and G9x-class chips
- G200 usually runs them just fine
- there are some errors with G200 as well, but this could just be the normal error rate
- Pauls G92 runs fine (and hopefully others)

-> it's a bug which is triggered by a special client configuration

- BOINC 6.6.x, 6.5.0 and 6.4.7 are definitely affected -> the version likely doen't matter
- driver 185.8x, 185.6x and 182.50 are reported to be affected, but 182.50 for XP32 works for Paul

-> did anyone try older drivers? E.g. 182.08, which has a very solid track record

- Pauls card has 1 GB of memory, whereas most G92 cards have 512 MB or less

Do we have any other reports of G9x cards, which run these tasks fine? Could anyone check the memory consumption of these WUs with RivaTuner?

EDIT: only certain WUs of the "IBUCH_KID" and "KASHIF_HIVPR" series are affected. Do we know which ones? Are the ones which work for Pauls card by pure coincidence all of the type which works?

For example my 9800GTX+ 512MB on Vista 64, 185.66 and 6.5.0 finished:

    *88-KASHIF_HIVPR_dim_ba2-2-100-RND8763_0
    *7-KASHIF_HIVPR_mon_ba5-6-100-RND3602_1
    *57-KASHIF_HIVPR_mon_ba4-4-100-RND1833_1


and failed


    *79-KASHIF_HIVPR_n1_for_ba1-4-100-RND9984_0
    *175-IBUCH_KID_shao_ba1-1-100-RND4198_2
    *93-IBUCH_KID_shao_ba2-0-100-RND9546_1



MrS



I have 4 machines with GTS250's (512Mb). They are running under XP32 with 182.50 drivers and seem fine.

I have an i7 with dual GTX260's. It is running under XP32 with 182.50 drivers and also seems fine. I had problems a week ago with 185.xx (beta) drivers and uninstalled them before reinstalling 182.50 drivers. Problems seemed to go away after that.

All machines currently running BOINC 6.6.28.

I had one IBUCH_KID wu, which I aborted after seeing post from GDF regarding them being in error. KASHIF_HIVPR seem fine.
BOINC blog
ID: 9538 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9540 - Posted: 9 May 2009, 13:45:59 UTC - in response to Message 9538.  

Oh, so it also affects linux. MAybe it's not much point searching for windows and drivers versions then.

I had one IBUCH_KID wu, which I aborted after seeing post from GDF regarding them being in error. KASHIF_HIVPR seem fine.


Some WUs of both series are affected, but not on G200 based cards (GTX 2xx).

MrS
Scanning for our furry friends since Jan 2002
ID: 9540 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Paul D. Buck

Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 9545 - Posted: 9 May 2009, 14:15:43 UTC

Well, I just had a crash on the i7 67-KASHIF_HIVPR_n1_for_ba3-2-100-RND8737, this is a task that died at least twice before.

The thing is, I was playing a game at the time. Low intensity turn based strategy game. But, I cannot say if that had any effect. THe game seemed to die and the graphics driver crashed. That said, the other tasks in progress seemed to stay Ok ...

More interesting is that there were three different errors ...

Of course, the task was run on three different class cards.

And I am running BOINC 6.6.28 on that machine ... still 182.50 drivers though.
ID: 9545 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Zydor

Send message
Joined: 8 Feb 09
Posts: 252
Credit: 1,309,451
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 9549 - Posted: 9 May 2009, 14:32:07 UTC - in response to Message 9545.  
Last modified: 9 May 2009, 14:34:33 UTC

I have been having a closer look at my errors , and a few from others. This bares some checking, but it appears on the face of it that the crashed ones do have a common element "signal 11". The "h-bond" message is a red herring to this. as it refers to the "Amber" processes (is that right ?), no matter the detail, it was cleared up in another thread as a non issue, just a text message re the internal processes in the WU, not its validity as a successful WU.

"Signal 11" does appear vertually every time from the ones I looked at. I am aware signal 11 is an issue way down in the Communication Layer - which in itself rings a bell considering the way current problems effects some cards and not others - some operating systems not others - but I have no idea of where to take that logic further, or even if indeed it has validity, I dont have that level of knowledge. Signal 11 I am aware can appear for many many reasons, and can be difficult to work out what the reason is, but if its the case this time, at least its the start down the right road.

Regards
Zy
ID: 9549 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · Next

Message boards : Graphics cards (GPUs) : 6 Errors Today [Problems with "KASHIF_HIVPR" and "IBUCH_KID"-WUs]

©2025 Universitat Pompeu Fabra