SANTI Errors

Message boards : Number crunching : SANTI Errors
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 · Next

AuthorMessage
tomba

Send message
Joined: 21 Feb 09
Posts: 497
Credit: 700,690,702
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34297 - Posted: 14 Dec 2013, 12:41:33 UTC

My last five WUs were SANTIs. Four gave errors. I wasted 25 hours of electricity.

Until someone fixes this, I will now abort immediately any SANTI I get. Sorry...

ID: 34297 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34302 - Posted: 14 Dec 2013, 13:38:09 UTC - in response to Message 34297.  

I can "only" see 3 failed WUs in your account, with 1 of them also failing for others. And lot's of successful, including Santi's. From this data I'm not convinced something's fundamentally broken here. Could be as simple as a machine needing a cold-boot.

MrS
Scanning for our furry friends since Jan 2002
ID: 34302 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
tomba

Send message
Joined: 21 Feb 09
Posts: 497
Credit: 700,690,702
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34304 - Posted: 14 Dec 2013, 14:08:31 UTC - in response to Message 34302.  

I can "only" see 3 failed WUs in your account, with 1 of them also failing for others.

Sorry - I added in the first (active) WU I aborted. But I still wasted a day's electric!

Could be as simple as a machine needing a cold-boot.

I'll give that a try and not abort any more.

Thanks for posting.
ID: 34304 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
John

Send message
Joined: 15 Oct 11
Posts: 17
Credit: 81,085,378
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 34330 - Posted: 15 Dec 2013, 16:06:51 UTC - in response to Message 34304.  

I have had 5 SANTI's fail in the last couple of day's. 8 in total if I go back 4-5 day's.
I have shut down the computer completely and restared twice now.
Yes, 1 or 2 have completed but the fail rate is unacceptable.
I have changed nothing regarding system setup so it's starting to look like these SANTI's are faulty ??...and yes it is a waste of electricity...
ID: 34330 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile dskagcommunity
Avatar

Send message
Joined: 28 Apr 11
Posts: 463
Credit: 958,266,958
RAC: 34
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34331 - Posted: 15 Dec 2013, 17:02:00 UTC
Last modified: 15 Dec 2013, 17:02:32 UTC

Hm i didnt wanted to write my single failure down about santi, but as i see here in this thread....on a 1GB 560TI (384 cores) i had a santi long fail too i immediate switched it back to short because after POEM GPU Stopped i need every credit i can get to "hold" the OverallRAC. But i have still 310.70 drivers on it.
DSKAG Austria Research Team: http://www.research.dskag.at



ID: 34331 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34334 - Posted: 15 Dec 2013, 21:20:00 UTC

I looked into the driver versions, but there's great variation and since you guys are able to process some of these WUs I couldn't expect to find anything conclusive there. However, the track record of 331.82 and 327.23 has been quite good - maybe try these?

MrS
Scanning for our furry friends since Jan 2002
ID: 34334 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Coleslaw

Send message
Joined: 24 Jul 08
Posts: 36
Credit: 363,857,679
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34338 - Posted: 16 Dec 2013, 3:36:56 UTC

I had to set my third computer so far to No New Work because of the SANTI work units causing BSODs. So far the systems I have had them on Windows 7 premium and professional x64. The cards were GT430, 650Ti, and 660Ti. From what I could see, it has been caused after the drivers crashing many times in a very short period of time. These devices have run GPUGrid for a while and still run solid on other GPU projects. The BSODs only happen when running GPU Grid SANTI work.

Drivers version on these machines:
331.82
320.49
331.82
ID: 34338 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Beyond
Avatar

Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34341 - Posted: 16 Dec 2013, 8:45:38 UTC - in response to Message 34338.  

I had to set my third computer so far to No New Work because of the SANTI work units causing BSODs. So far the systems I have had them on Windows 7 premium and professional x64. The cards were GT430, 650Ti, and 660Ti. From what I could see, it has been caused after the drivers crashing many times in a very short period of time..

Just had the same thing happen with a SANTI_bax2 WU on one machine. It BSODed everytime BOINC started the SANTI WU. Even caused disk corruption once. 650ti GPU. If I suspended the WU the machine ran fine. Finally aborted the WU and DLed another SANTI_bax2 which is so far running OK. Before that WU the box in question had run for a VERY long time without a single crash. I've run 18 other SANTI_bax2 WUs without an issue. Wonder if perhaps some WUs got released with bad parameters? Maybe a corrupted DL?
ID: 34341 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Stefan
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 5 Mar 13
Posts: 348
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 34342 - Posted: 16 Dec 2013, 10:24:40 UTC

Hm I don't know what could be causing this, as it doesn't seem to be something systematic. Santi_bax2 WU's only have a 6% error rate which I would say is nearly a historical low for this project.
ID: 34342 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
tomba

Send message
Joined: 21 Feb 09
Posts: 497
Credit: 700,690,702
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34346 - Posted: 16 Dec 2013, 12:18:11 UTC

Another SANTI just wasted five hours of electric; here.

"The simulation has become unstable. Terminating to avoid lock-up".
ID: 34346 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34347 - Posted: 16 Dec 2013, 12:42:14 UTC

Different WU types need different GPU (electrical) power (at a given GPU frequency). This kind of error could be caused when the processing of the WU tricks the GPU's power scheme, and it gives slightly lower voltage for the GPU than it needs (or slightly higher frequency it can run at). It can be fixed either lowering the GPU frequency, or raising the voltage. Sometimes it's not easy to do on a Kepler (i.e. MSI Afterburner). I had to use the Kepler BIOS tweaker utility to permanently fix this kind of errors on my overclocked ASUS GTX 670DC2OC. This is a very useful tool. If you put nvflash to it's working directory, it can directly flash the modified BIOS to the card.
ID: 34347 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
tomba

Send message
Joined: 21 Feb 09
Posts: 497
Credit: 700,690,702
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34354 - Posted: 17 Dec 2013, 15:30:53 UTC

Another Santi errored out today. That's five in five days out of a total of 10.

That's 36 hours of wasted electricity.

No comment here from the scientist...
ID: 34354 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Stefan
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 5 Mar 13
Posts: 348
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 34355 - Posted: 17 Dec 2013, 16:02:37 UTC - in response to Message 34354.  

Sorry tomba but there is not really anything Santi can help you with. These WU's are a continuation of previous "bax" WU's which were simulated successfully. Also as I mentioned the error rate is around 6% which is really very low. I asked him if there is anything fancy with the system but it's apparently not very large, doesn't use any weird barely-tested functionality, so there is really nothing we can do about it.

About the system, it's a protein that is responsible for the activation of apoptosis, a process that controls cell death and we are looking for a specific conformation of this protein.

The only one that could help would be Matt from a technical side-point but I don't know if he really has that much time right now. If you want you can message him (username MJH).
I would suggest to maybe switch to the short queue for a while. Or do what Zoltan suggested about frequencies and voltages (I have no clue about that though, the forum members can help you).
ID: 34355 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34356 - Posted: 17 Dec 2013, 16:17:05 UTC - in response to Message 34341.  
Last modified: 17 Dec 2013, 17:09:57 UTC

Just had the same thing happen with a SANTI_bax2 WU on one machine. It BSODed everytime BOINC started the SANTI WU. Even caused disk corruption once. 650ti GPU. If I suspended the WU the machine ran fine. Finally aborted the WU and DLed another SANTI_bax2 which is so far running OK. Before that WU the box in question had run for a VERY long time without a single crash. I've run 18 other SANTI_bax2 WUs without an issue. Wonder if perhaps some WUs got released with bad parameters? Maybe a corrupted DL?


Hummm. Now I am beginning to wonder. I just had a similar situation (I think), where I was copying a 5 GB video file from one drive to another, and it kept BSODing the machine, which I have never seen it do before. Since it would copy fine to another drive, I put it down to a controller/disk drive compatibility problem, since the drive with problems was on a Marvell controller, not the main Intel controller.

But it just so happens I was running a Santi_bax2 at the time, and noticed that it was taking a very long time to complete, and even increasing in estimated time left after 16 hours (only 26% complete), so I aborted it. But that card (a GTX 650 Ti 1 GB) has been very stable otherwise with all the other work units, including a couple of Santi_bax2 types. That work unit may be bad, but it has not finished yet on another machine, so I don't know. I had assumed that the drive problem had corrupted the Santi_bax2, but it could be the other way around.

EDIT: Actually, it started out on the GTX 650 Ti, but had switched over to a GTX 660 by the time I ended it. So it seems not to be a memory limitation, since it was running slowly even with 2 GB, unless they have gotten worse than that.
http://www.gpugrid.net/result.php?resultid=7557083
(The restarts are due to the BSODs.)
ID: 34356 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
tomba

Send message
Joined: 21 Feb 09
Posts: 497
Credit: 700,690,702
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34357 - Posted: 17 Dec 2013, 18:12:33 UTC - in response to Message 34355.  

Also as I mentioned the error rate is around 6%

For me, right now, it's 50%...
ID: 34357 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Stefan
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 5 Mar 13
Posts: 348
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 34358 - Posted: 17 Dec 2013, 18:48:46 UTC - in response to Message 34357.  
Last modified: 17 Dec 2013, 18:53:06 UTC

For me, right now, it's 50%...

Statistics are statistics...
It doesn't mean unfortunately that there are no outliers.
ID: 34358 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
John

Send message
Joined: 15 Oct 11
Posts: 17
Credit: 81,085,378
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 34361 - Posted: 17 Dec 2013, 20:04:36 UTC - in response to Message 34347.  

I noticed that most of my failed WU's occured on 1 of my 2 cards. Upon investigation I noticed that the card with the WU failures was running at a slightly lower voltage.Rather than mess with the voltage ( up till now both cards have worked well) I lowered the GPU clocks by 10mhz and so far all WU's have completed successfully..... finger's crossed...
ID: 34361 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34365 - Posted: 17 Dec 2013, 21:02:28 UTC - in response to Message 34361.  

I lowered the GPU clocks by 10mhz and so far all WU's have completed successfully..... finger's crossed...

Please try the same, Tomba. Lower the clock speed of the offending GPU by 13 or 26 MHz and see if it helps too.

MrS
Scanning for our furry friends since Jan 2002
ID: 34365 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Dagorath

Send message
Joined: 16 Mar 11
Posts: 509
Credit: 179,005,236
RAC: 0
Level
Ile
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34367 - Posted: 17 Dec 2013, 21:15:22 UTC - in response to Message 34357.  

Also as I mentioned the error rate is around 6%

For me, right now, it's 50%...


Check my results. Out of 66 results, I have 63 success, 1 failed SANTI, 1 failed NOELIA, 1 aborted. It seems SANTI and NOELIA are difficult tasks but not impossible.
BOINC <<--- credit whores, pedants, alien hunters
ID: 34367 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Damaraland

Send message
Joined: 7 Nov 09
Posts: 152
Credit: 16,181,924
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwat
Message 34368 - Posted: 17 Dec 2013, 22:02:55 UTC - in response to Message 34367.  

Impossible for me to find out if there's a relation, but I had 98% processor usage. Changued to 100% and got 2 Santi Errors...
May be just luck?? Or the continuos interruptions I had before made these units go better. I think someone should have a look... I simple test I see would be put forced interruptions and send the same units to the same computer.
Maybe just hazarous, or Santiago doesn't get on well with the GPUs. :p
ID: 34368 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · 3 · 4 · Next

Message boards : Number crunching : SANTI Errors

©2025 Universitat Pompeu Fabra