Errors piling up, bad batch of NOELIA?

Message boards : Number crunching : Errors piling up, bad batch of NOELIA?
Message board moderation

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
RaymondFO*

Send message
Joined: 22 Nov 12
Posts: 72
Credit: 14,040,706,346
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 37857 - Posted: 7 Sep 2014, 11:04:36 UTC - in response to Message 37853.  

Any up-date? Can I change back to long runs? My computers run without supervision over the week-end, so I do not like to pile up errors.


While you may still get a bad task here or there, I would venture to say the number of current bad tasks has sharply dwindled.
ID: 37857 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Bjarke

Send message
Joined: 1 Mar 09
Posts: 8
Credit: 95,935,146
RAC: 142
Level
Thr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 37862 - Posted: 9 Sep 2014, 5:39:43 UTC

I have had 5 NOELIA fails since 3rd of September. The last unit was sent to my workstation 7 Sep 2014 12:59:07 UTC.



Most of them run a long time before the error shows.

ID: 37862 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 37863 - Posted: 9 Sep 2014, 7:31:01 UTC - in response to Message 37862.  

Those which run for a long time before the error shows (the NOELIA_tpam2 workunits) are failing because they got a lot of "The simulation has become unstable. Terminating to avoid lock-up" messages before they actually fail. This kind of error is usually caused by:
- too high GPU frequency
- too high GDDR5 frequency
- too high GPU temperature
- too low GPU voltage
The bad batch this thread is about consists of NOELIA_TRP188 workunits, which usually fail right after the start.
ID: 37863 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Bjarke

Send message
Joined: 1 Mar 09
Posts: 8
Credit: 95,935,146
RAC: 142
Level
Thr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 37874 - Posted: 11 Sep 2014, 7:23:43 UTC - in response to Message 37863.  

Those which run for a long time before the error shows (the NOELIA_tpam2 workunits) are failing because they got a lot of "The simulation has become unstable. Terminating to avoid lock-up" messages before they actually fail. This kind of error is usually caused by:
- too high GPU frequency
- too high GDDR5 frequency
- too high GPU temperature
- too low GPU voltage
The bad batch this thread is about consists of NOELIA_TRP188 workunits, which usually fail right after the start.


I disagree either of the 4 points being the issue.

My Nvidia Quadro K4000 GPU is completely stock with absolutely no modifications or overclocking applied. So the frequencies are right. Also the case of my hostDell Precision T7610 is completely unmodified and the case fans are regulated automatically as always. The GPU run at 75 degree C which is well on the safe side. Further, I haven't performed a driver update for months.

May I add that I haven't noticed any failed WU's on my system until now. Within 5 days 5 NOELIA WU's failed.
ID: 37874 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 37875 - Posted: 11 Sep 2014, 9:38:25 UTC - in response to Message 37874.  
Last modified: 11 Sep 2014, 9:39:43 UTC

Those which run for a long time before the error shows (the NOELIA_tpam2 workunits) are failing because they got a lot of "The simulation has become unstable. Terminating to avoid lock-up" messages before they actually fail. This kind of error is usually caused by:
- too high GPU frequency
- too high GDDR5 frequency
- too high GPU temperature
- too low GPU voltage
The bad batch this thread is about consists of NOELIA_TRP188 workunits, which usually fail right after the start.

I disagree either of the 4 points being the issue.

Of course there are more possibilities, but these 4 points are the most frequent ones, also these could be checked easily by tuning the card with software tools (like MSI Afterburner). Furthermore these errors could be caused by a faulty (or inadequate) power supply, and the aging of the components (especially the GPU). These are much harder to fix, but you can still have a stable system with these components if you reduce the GPU/GDDR5 frequency. It's better to have a 10% slower system than a system producing (more and more frequent) random errors.

My Nvidia Quadro K4000 GPU is completely stock with absolutely no modifications or overclocking applied. So the frequencies are right.

The statement in the second sentence is not in consequence of the first sentence. The frequencies (for the given system) are right when there are no errors. The GPUGrid client pushes the card very hard, like the infamous FurMark GPU test, so we had a lot of surprises over the years (regarding stock frequencies).

Also the case of my host Dell Precision T7610 is completely unmodified and the case fans are regulated automatically as always. The GPU run at 75 degree C which is well on the safe side. Further, I haven't performed a driver update for months.

It is really strange, that a card could have errors even below 80°C. I have two GTX 780Ti's in the same system, one of them is an NVidia standard design, the other is an OC model (BTW both of them Gigabyte). I had errors with the OC model right from the start while its temperature was under 70°C (only with GPUGrid, no other testing tools showed any errors), but reducing its GDDR5 frequency from 3500MHz to 2700MHz (!) solved my problem. After a BIOS update this card is running error free at 2900MHz, but it's still way below the factory setting.

May I add that I haven't noticed any failed WU's on my system until now. Within 5 days 5 NOELIA WU's failed.

If you check the logs of your successful tasks, those also have this "The simulation has become unstable. Terminating to avoid lock-up" messages, so you were lucky that those workunits were successful. If you check my (similar NOELIA) workunits, none of them has these messages.
So, give it a try to reduce the GPU frequency (its harder to reduce the GDDR5 frequency, as you have to flash the GPU's BIOS).
ID: 37875 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Stefan
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 5 Mar 13
Posts: 348
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 37881 - Posted: 12 Sep 2014, 8:32:58 UTC - in response to Message 37875.  

The direct crashes should be fixed now.
ID: 37881 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
TJ

Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 37888 - Posted: 13 Sep 2014, 0:26:43 UTC

The Noelia's doing okay on my systems after last week hick up but now the latest Santi's, (final) have this error: ERROR: file mdioload.cpp line 81: Unable to read bincoordfile.
I see errors with wing(wo)man too. In run with Win7.
Greetings from TJ
ID: 37888 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2

Message boards : Number crunching : Errors piling up, bad batch of NOELIA?

©2026 Universitat Pompeu Fabra