Message boards :
Graphics cards (GPUs) :
6 Errors Today [Problems with "KASHIF_HIVPR" and "IBUCH_KID"-WUs]
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · Next
| Author | Message |
|---|---|
|
Send message Joined: 1 Feb 09 Posts: 139 Credit: 575,023 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
I have had my fair share of those also and installed all latest drivers Win7 185.85 which include cuda 2.2 on this machine and boinc 6.6.28. To my surprise i see now in boinc that my 9600 GT seems only be able todo cuda 1.0 instructions. So maybe the errors created by these workunits are related to instruction which only can be performed by the newest 2x5 models. Since non of them seem to have much errors on these units But somehow i have had less problems with my machine since the latest drivers am installed, it runs kinda rock solid (only BF2 and gameguard games are an issue) BUT i'll remind you guys everything i run is BETA so problems can occur. That it runs almost without a problem on my machine is no garantee it will on yours. I guess if you have a 2X5 card you probably will see a gain in processing speed if some of the cuda 2.2 intructions can or/and are implemented |
ZydorSend message Joined: 8 Feb 09 Posts: 252 Credit: 1,309,451 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]()
|
Some positives for comparison as the KASHIFs are going bang with me, I've left the hardware/software setup alone so there is fair comparison. GIANNIs seem to run fine. I am 7hrs into a TONI_HIVPR, so touch wood that seems like it will go through, will finish in about 5/6 hours. I have a IBUCH_HIVPR lined up as the next to go. Regards Zy |
|
Send message Joined: 25 Sep 08 Posts: 111 Credit: 10,352,599 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I have aborted all: KASHIF_HIVPR and IBUCH_KID and will now continue to do so. I have 5x 8800GS and 1x 8800GT--those WUs do not complete on my rigs and most of them hang. Yesterday I completed ONE WU instead of 9-11. 5K ppd instead of 50Kppd. Was on 6.6.17, 3 rigs 185.26, 1 rig 182.50 As of last night all rigs are: 6.6.28 and 3x 185.26, 1x 185.85--seems to have helped some. This is an "across the farm" thing for me now. Problems initially started on the dual gpu rigs, but now it's across the board.... My rigs are not hidden. The Phunam-PC is a new setup--the intial errors are from setup, OCing, etc. I understand those. The new ones are part of this mess. Hoping it gets sorted out soon.... EDIT: I have kept 1 KASHIF_HIVPR that appears to be running ok on a single Gpu rig. However, 1st sign of trouble and it's history..... |
AardvarkSend message Joined: 27 Nov 08 Posts: 28 Credit: 82,362,324 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Likewise here, failures on KASHIF_HIVPR and IBUCH_KID Two different machines. One with 32 bit Vista, 8800GT (O/C), client 6.6.20 & 185.86 driver. The other with 64 bit Vista, 9800 GX2 (Not O/C), client 6.6.20 & 182.50 driver. I have now updated both drivers to 185.85, which is latest release. |
|
Send message Joined: 7 Mar 09 Posts: 12 Credit: 1,254,285 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Indeed, nice description of what happened here. Installed Boinc 6.5.0 and WU picked up nicely where it blocked ... Although it was KASHIF WU, it apparently was the scheduler to blame .... |
|
Send message Joined: 1 Feb 09 Posts: 139 Credit: 575,023 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Well again had a unit error out of 13 hours of work, and looks like the big gun machines run them all fine. I can't go on like this i lost hundreds of hours of time and money for nothing. For the time being i am also shutting down the gpugrid till this issue is solved. |
ZydorSend message Joined: 8 Feb 09 Posts: 252 Credit: 1,309,451 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]()
|
I am aware that there is hard work going on re finding the cause/fix. If its possible that someone could timeout for 2 mins to advise us all whether you still want the KASHIFs run by lower based cards, I suspect it would help enourmously as we could then abort to leave them to the big guns knowing its not going to cause issues in the bug-finding, and we carry on with the other WUs. At present it seems lots are shutting down from doing anything in the absense of any advice, understandably, but the other WUs seem ok. Just a gentle suggestion ... Regards Zy |
|
Send message Joined: 25 Sep 08 Posts: 111 Credit: 10,352,599 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
The last KASHIF_HIVPR did in fact error out.....No more for me. I'm just going to have to check my rigs 1-2x/day and send them back.... I am aware that there is hard work going on re finding the cause/fix. If its possible that someone could timeout for 2 mins to advise us all whether you still want the KASHIFs run by lower based cards, I suspect it would help enourmously as we could then abort to leave them to the big guns knowing its not going to cause issues in the bug-finding, and we carry on with the other WUs. My guess would be that they are still releasing them because they run on the higher end cards. They can still get the work completed. However, if that's the case, then I think the server needs to be setup to issue specific WU to specific cards. The server gets plenty of info from our rigs---so I don't see why that can't be done.... |
mike047Send message Joined: 21 Dec 08 Posts: 47 Credit: 7,330,049 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]()
|
Nothing will likely be done until sometime Monday, I am also at No New Work until problem is resolved. mike |
GDFSend message Joined: 14 Mar 07 Posts: 1958 Credit: 629,356 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() |
The real problem is that we do not understand why these WUs crash. There are several Kashif_XXX workunits and only a set of them does crash on some machines. We will stop the crashing WUs as more testing did not really help. gdf |
BymarkSend message Joined: 23 Feb 09 Posts: 30 Credit: 5,897,921 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
I have a big problem with my new asus 260: hostid=35303 I downgraded all drivers, and now waiting to get more task. "reached daily quota of 4 results" heh ;), Any suggestion? Seti gpus working fine....... "Silakka" Hello from Turku > Åbo. |
|
Send message Joined: 1 Feb 09 Posts: 139 Credit: 575,023 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Sadly yes the famous units which we discussing all over the forum |
ZydorSend message Joined: 8 Feb 09 Posts: 252 Credit: 1,309,451 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]()
|
I have a big problem with my new asus 260: The ones crashing on that machine are not the suspect WUs that they have now stopped issuing, those crashing on that machine usually run fine. He also has a 260 which is outside the problems, its the lower cards that did have issues in the past. Something else lurketh. No idea what personally, over to the Gurus for that. Regards Zy |
SandroSend message Joined: 19 Aug 08 Posts: 22 Credit: 3,660,304 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]()
|
I am right to say that all the problems are related to older cards, like 8800,9800 and so on? Yes. My GTX 260 running under 64bit Ubuntu also crashes WUs <core_client_version>6.4.5</core_client_version> <![CDATA[ <message> process got signal 11 </message> <stderr_txt> # Using CUDA device 0 # Device 0: "GeForce GTX 260" # Clock rate: 1242000 kilohertz # Total amount of global memory: 938803200 bytes # Number of multiprocessors: 27 # Number of cores: 216 # Amber: readparm : Reading parm file parameters # PARM file in AMBER 7 format # Encounter 10-12 H-bond term WARNING: parameters.cu, line 568: Found zero 10-12 H-bond term. WARNING: parameters.cu, line 568: Found zero 10-12 H-bond term. MDIO ERROR: cannot open file "restart.coor" </stderr_txt> ]]> exit status: 11 (0xb) <core_client_version>6.4.5</core_client_version> <![CDATA[ <message> process got signal 11 </message> <stderr_txt> # Using CUDA device 0 # Device 0: "GeForce GTX 260" # Clock rate: 1242000 kilohertz # Total amount of global memory: 938803200 bytes # Number of multiprocessors: 27 # Number of cores: 216 MDIO ERROR: cannot open file "restart.coor" </stderr_txt> ]]> |
|
Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Let's gather some of that information: - all failures reported here affect G92 and G9x-class chips - G200 usually runs them just fine - there are some errors with G200 as well, but this could just be the normal error rate - Pauls G92 runs fine (and hopefully others) -> it's a bug which is triggered by a special client configuration - BOINC 6.6.x, 6.5.0 and 6.4.7 are definitely affected -> the version likely doen't matter - driver 185.8x, 185.6x and 182.50 are reported to be affected, but 182.50 for XP32 works for Paul -> did anyone try older drivers? E.g. 182.08, which has a very solid track record - Pauls card has 1 GB of memory, whereas most G92 cards have 512 MB or less Do we have any other reports of G9x cards, which run these tasks fine? Could anyone check the memory consumption of these WUs with RivaTuner? EDIT: only certain WUs of the "IBUCH_KID" and "KASHIF_HIVPR" series are affected. Do we know which ones? Are the ones which work for Pauls card by pure coincidence all of the type which works? For example my 9800GTX+ 512MB on Vista 64, 185.66 and 6.5.0 finished: *88-KASHIF_HIVPR_dim_ba2-2-100-RND8763_0 *7-KASHIF_HIVPR_mon_ba5-6-100-RND3602_1 *57-KASHIF_HIVPR_mon_ba4-4-100-RND1833_1
*79-KASHIF_HIVPR_n1_for_ba1-4-100-RND9984_0 *175-IBUCH_KID_shao_ba1-1-100-RND4198_2 *93-IBUCH_KID_shao_ba2-0-100-RND9546_1
Scanning for our furry friends since Jan 2002 |
mike047Send message Joined: 21 Dec 08 Posts: 47 Credit: 7,330,049 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]()
|
I am on 6.4.5 and use either 177.82 or 180.22 on Ubuntu 64. I have had many failures on all cards Except my 260's[192/216] mike |
|
Send message Joined: 24 Dec 08 Posts: 738 Credit: 200,909,904 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Let's gather some of that information: I have 4 machines with GTS250's (512Mb). They are running under XP32 with 182.50 drivers and seem fine. I have an i7 with dual GTX260's. It is running under XP32 with 182.50 drivers and also seems fine. I had problems a week ago with 185.xx (beta) drivers and uninstalled them before reinstalling 182.50 drivers. Problems seemed to go away after that. All machines currently running BOINC 6.6.28. I had one IBUCH_KID wu, which I aborted after seeing post from GDF regarding them being in error. KASHIF_HIVPR seem fine. BOINC blog |
|
Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Oh, so it also affects linux. MAybe it's not much point searching for windows and drivers versions then. I had one IBUCH_KID wu, which I aborted after seeing post from GDF regarding them being in error. KASHIF_HIVPR seem fine. Some WUs of both series are affected, but not on G200 based cards (GTX 2xx). MrS Scanning for our furry friends since Jan 2002 |
Paul D. BuckSend message Joined: 9 Jun 08 Posts: 1050 Credit: 37,321,185 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Well, I just had a crash on the i7 67-KASHIF_HIVPR_n1_for_ba3-2-100-RND8737, this is a task that died at least twice before. The thing is, I was playing a game at the time. Low intensity turn based strategy game. But, I cannot say if that had any effect. THe game seemed to die and the graphics driver crashed. That said, the other tasks in progress seemed to stay Ok ... More interesting is that there were three different errors ... Of course, the task was run on three different class cards. And I am running BOINC 6.6.28 on that machine ... still 182.50 drivers though. |
ZydorSend message Joined: 8 Feb 09 Posts: 252 Credit: 1,309,451 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]()
|
I have been having a closer look at my errors , and a few from others. This bares some checking, but it appears on the face of it that the crashed ones do have a common element "signal 11". The "h-bond" message is a red herring to this. as it refers to the "Amber" processes (is that right ?), no matter the detail, it was cleared up in another thread as a non issue, just a text message re the internal processes in the WU, not its validity as a successful WU. "Signal 11" does appear vertually every time from the ones I looked at. I am aware signal 11 is an issue way down in the Communication Layer - which in itself rings a bell considering the way current problems effects some cards and not others - some operating systems not others - but I have no idea of where to take that logic further, or even if indeed it has validity, I dont have that level of knowledge. Signal 11 I am aware can appear for many many reasons, and can be difficult to work out what the reason is, but if its the case this time, at least its the start down the right road. Regards Zy |
©2025 Universitat Pompeu Fabra