Message boards :
Number crunching :
SANTI Errors
Message board moderation
| Author | Message |
|---|---|
|
Send message Joined: 21 Feb 09 Posts: 497 Credit: 700,690,702 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
My last five WUs were SANTIs. Four gave errors. I wasted 25 hours of electricity. Until someone fixes this, I will now abort immediately any SANTI I get. Sorry... |
|
Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I can "only" see 3 failed WUs in your account, with 1 of them also failing for others. And lot's of successful, including Santi's. From this data I'm not convinced something's fundamentally broken here. Could be as simple as a machine needing a cold-boot. MrS Scanning for our furry friends since Jan 2002 |
|
Send message Joined: 21 Feb 09 Posts: 497 Credit: 700,690,702 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I can "only" see 3 failed WUs in your account, with 1 of them also failing for others. Sorry - I added in the first (active) WU I aborted. But I still wasted a day's electric! Could be as simple as a machine needing a cold-boot. I'll give that a try and not abort any more. Thanks for posting. |
|
Send message Joined: 15 Oct 11 Posts: 17 Credit: 81,085,378 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I have had 5 SANTI's fail in the last couple of day's. 8 in total if I go back 4-5 day's. I have shut down the computer completely and restared twice now. Yes, 1 or 2 have completed but the fail rate is unacceptable. I have changed nothing regarding system setup so it's starting to look like these SANTI's are faulty ??...and yes it is a waste of electricity... |
dskagcommunitySend message Joined: 28 Apr 11 Posts: 463 Credit: 958,266,958 RAC: 34 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Hm i didnt wanted to write my single failure down about santi, but as i see here in this thread....on a 1GB 560TI (384 cores) i had a santi long fail too i immediate switched it back to short because after POEM GPU Stopped i need every credit i can get to "hold" the OverallRAC. But i have still 310.70 drivers on it. DSKAG Austria Research Team: http://www.research.dskag.at
|
|
Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I looked into the driver versions, but there's great variation and since you guys are able to process some of these WUs I couldn't expect to find anything conclusive there. However, the track record of 331.82 and 327.23 has been quite good - maybe try these? MrS Scanning for our furry friends since Jan 2002 |
ColeslawSend message Joined: 24 Jul 08 Posts: 36 Credit: 363,857,679 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I had to set my third computer so far to No New Work because of the SANTI work units causing BSODs. So far the systems I have had them on Windows 7 premium and professional x64. The cards were GT430, 650Ti, and 660Ti. From what I could see, it has been caused after the drivers crashing many times in a very short period of time. These devices have run GPUGrid for a while and still run solid on other GPU projects. The BSODs only happen when running GPU Grid SANTI work. Drivers version on these machines: 331.82 320.49 331.82
|
BeyondSend message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I had to set my third computer so far to No New Work because of the SANTI work units causing BSODs. So far the systems I have had them on Windows 7 premium and professional x64. The cards were GT430, 650Ti, and 660Ti. From what I could see, it has been caused after the drivers crashing many times in a very short period of time.. Just had the same thing happen with a SANTI_bax2 WU on one machine. It BSODed everytime BOINC started the SANTI WU. Even caused disk corruption once. 650ti GPU. If I suspended the WU the machine ran fine. Finally aborted the WU and DLed another SANTI_bax2 which is so far running OK. Before that WU the box in question had run for a VERY long time without a single crash. I've run 18 other SANTI_bax2 WUs without an issue. Wonder if perhaps some WUs got released with bad parameters? Maybe a corrupted DL? |
|
Send message Joined: 5 Mar 13 Posts: 348 Credit: 0 RAC: 0 Level ![]() Scientific publications ![]() |
Hm I don't know what could be causing this, as it doesn't seem to be something systematic. Santi_bax2 WU's only have a 6% error rate which I would say is nearly a historical low for this project. |
|
Send message Joined: 21 Feb 09 Posts: 497 Credit: 700,690,702 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Another SANTI just wasted five hours of electric; here. "The simulation has become unstable. Terminating to avoid lock-up". |
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Different WU types need different GPU (electrical) power (at a given GPU frequency). This kind of error could be caused when the processing of the WU tricks the GPU's power scheme, and it gives slightly lower voltage for the GPU than it needs (or slightly higher frequency it can run at). It can be fixed either lowering the GPU frequency, or raising the voltage. Sometimes it's not easy to do on a Kepler (i.e. MSI Afterburner). I had to use the Kepler BIOS tweaker utility to permanently fix this kind of errors on my overclocked ASUS GTX 670DC2OC. This is a very useful tool. If you put nvflash to it's working directory, it can directly flash the modified BIOS to the card. |
|
Send message Joined: 21 Feb 09 Posts: 497 Credit: 700,690,702 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Another Santi errored out today. That's five in five days out of a total of 10. That's 36 hours of wasted electricity. No comment here from the scientist... |
|
Send message Joined: 5 Mar 13 Posts: 348 Credit: 0 RAC: 0 Level ![]() Scientific publications ![]() |
Sorry tomba but there is not really anything Santi can help you with. These WU's are a continuation of previous "bax" WU's which were simulated successfully. Also as I mentioned the error rate is around 6% which is really very low. I asked him if there is anything fancy with the system but it's apparently not very large, doesn't use any weird barely-tested functionality, so there is really nothing we can do about it. About the system, it's a protein that is responsible for the activation of apoptosis, a process that controls cell death and we are looking for a specific conformation of this protein. The only one that could help would be Matt from a technical side-point but I don't know if he really has that much time right now. If you want you can message him (username MJH). I would suggest to maybe switch to the short queue for a while. Or do what Zoltan suggested about frequencies and voltages (I have no clue about that though, the forum members can help you). |
|
Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Just had the same thing happen with a SANTI_bax2 WU on one machine. It BSODed everytime BOINC started the SANTI WU. Even caused disk corruption once. 650ti GPU. If I suspended the WU the machine ran fine. Finally aborted the WU and DLed another SANTI_bax2 which is so far running OK. Before that WU the box in question had run for a VERY long time without a single crash. I've run 18 other SANTI_bax2 WUs without an issue. Wonder if perhaps some WUs got released with bad parameters? Maybe a corrupted DL? Hummm. Now I am beginning to wonder. I just had a similar situation (I think), where I was copying a 5 GB video file from one drive to another, and it kept BSODing the machine, which I have never seen it do before. Since it would copy fine to another drive, I put it down to a controller/disk drive compatibility problem, since the drive with problems was on a Marvell controller, not the main Intel controller. But it just so happens I was running a Santi_bax2 at the time, and noticed that it was taking a very long time to complete, and even increasing in estimated time left after 16 hours (only 26% complete), so I aborted it. But that card (a GTX 650 Ti 1 GB) has been very stable otherwise with all the other work units, including a couple of Santi_bax2 types. That work unit may be bad, but it has not finished yet on another machine, so I don't know. I had assumed that the drive problem had corrupted the Santi_bax2, but it could be the other way around. EDIT: Actually, it started out on the GTX 650 Ti, but had switched over to a GTX 660 by the time I ended it. So it seems not to be a memory limitation, since it was running slowly even with 2 GB, unless they have gotten worse than that. http://www.gpugrid.net/result.php?resultid=7557083 (The restarts are due to the BSODs.) |
|
Send message Joined: 21 Feb 09 Posts: 497 Credit: 700,690,702 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Also as I mentioned the error rate is around 6% For me, right now, it's 50%... |
|
Send message Joined: 5 Mar 13 Posts: 348 Credit: 0 RAC: 0 Level ![]() Scientific publications ![]() |
For me, right now, it's 50%... Statistics are statistics... It doesn't mean unfortunately that there are no outliers. |
|
Send message Joined: 15 Oct 11 Posts: 17 Credit: 81,085,378 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I noticed that most of my failed WU's occured on 1 of my 2 cards. Upon investigation I noticed that the card with the WU failures was running at a slightly lower voltage.Rather than mess with the voltage ( up till now both cards have worked well) I lowered the GPU clocks by 10mhz and so far all WU's have completed successfully..... finger's crossed... |
|
Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I lowered the GPU clocks by 10mhz and so far all WU's have completed successfully..... finger's crossed... Please try the same, Tomba. Lower the clock speed of the offending GPU by 13 or 26 MHz and see if it helps too. MrS Scanning for our furry friends since Jan 2002 |
|
Send message Joined: 16 Mar 11 Posts: 509 Credit: 179,005,236 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Also as I mentioned the error rate is around 6% Check my results. Out of 66 results, I have 63 success, 1 failed SANTI, 1 failed NOELIA, 1 aborted. It seems SANTI and NOELIA are difficult tasks but not impossible. BOINC <<--- credit whores, pedants, alien hunters |
DamaralandSend message Joined: 7 Nov 09 Posts: 152 Credit: 16,181,924 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Impossible for me to find out if there's a relation, but I had 98% processor usage. Changued to 100% and got 2 Santi Errors... May be just luck?? Or the continuos interruptions I had before made these units go better. I think someone should have a look... I simple test I see would be put forced interruptions and send the same units to the same computer. Maybe just hazarous, or Santiago doesn't get on well with the GPUs. :p |
©2025 Universitat Pompeu Fabra