WU: NOELIA_INS1P

Message boards : News : WU: NOELIA_INS1P
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4

AuthorMessage
ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33847 - Posted: 12 Nov 2013, 21:34:58 UTC - in response to Message 33823.  

Windows can't catch these calculation errors because, frankly, it doesn't see them. The GPU-Grid app sends some commands to the GPU, the GPU processes something and returns results to the app. Unless the GPU behaves in any different way (doesn't respond any more etc.), there's no way for the OS to tell if the data returned is correct or garbage. Specifically not even GPU-Grid can now this, unless they already know the result.. but they can check their results for sanity and, luckily for us, errors may often have no effect (on the long-term simulation result) or catastrophic effects.

I suppose molecular dynamics is comparably tolerant to single calculation errors. Imagine it this way: if a force is calculated too large in one time step and as a result an atom is moved further than it should it timestep n, then it will likely get too close to other atoms in time step n+1 and hence recieve a greater repelling force than what it would have gotten in the correct position. Thus small errors don't build up over time. Not sure it really works like this.. but I think Matt once said something which sounded to me like this :)

MrS
Scanning for our furry friends since Jan 2002
ID: 33847 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile robertmiles

Send message
Joined: 16 Apr 09
Posts: 503
Credit: 769,991,668
RAC: 0
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33975 - Posted: 22 Nov 2013, 1:26:42 UTC - in response to Message 33821.  

The claim is that errors can be caused by "not having enough voltage" or by "having too high of a temperature".

Do we have conclusive proof of this claim? Or is it more of a generalization based on experience? I'm struggling to understand how voltage or temperature can have any effect on error % rates, and would appreciate some guidance.

All semiconductor manufacturers create yield curves for their production lots. They show how much voltage/current it takes to achieve a given speed. In general, the more power you supply to the chip, the faster it can be clocked. Of course, it also gets hotter, which can eventually destroy the chip. That is why a power limit is also specified (e.g., 95 watts for some Intel CPUs, etc.). But the chips vary, with some being able to run fast at lower power, and some requiring higher power to achieve the same speeds. You can get errors due to a variety of reasons, with temperature being just one. But I have seen errors even below 70 C, so some other limitation may get you first.


[snip]

MrS


Something I've read that seems relevant to this explanation:

Today's CPU chips are approaching the lower limit of the voltages at which the transistors work properly. Therefore, the power used by each CPU core can't get much lower. Instead, the companies are increasing the total speed by putting more CPU cores in each CPU package. Intel in also using a different method - hyperthreading. This method gives each CPU core two sets of registers, so that while the CPU is waiting for memory operations for the program running with one of these sets, the CPU can use the other set to run the other program using that set. This makes the CPU act as if it had twice as many CPU cores as it actually does.

If a programmer want to use more than one of these CPU cores at the same time for the same program, that programmer must study parallel programming first, in order to handle the communications between the different CPU cores properly.

I used to be an electronic engineer, specializing in logic simulation, often including timing analysis.
ID: 33975 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
MrJo

Send message
Joined: 18 Apr 14
Posts: 43
Credit: 1,192,135,172
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwat
Message 36876 - Posted: 20 May 2014, 10:25:04 UTC
Last modified: 20 May 2014, 10:26:16 UTC

Just crunched my fist one on a GTX 770 at 76° in 31,145.32. Nice 153,150.00 Points ;-)
Regards, Josef

ID: 36876 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
TJ

Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 36877 - Posted: 20 May 2014, 11:22:37 UTC - in response to Message 36876.  

Just crunched my fist one on a GTX 770 at 76° in 31,145.32. Nice 153,150.00 Points ;-)

You finished indeed a Noelia WU, but not this one but the new one: NOELIA_BI.
But more important, your 770 can do better, mine finishes these new Noelia's in about 27000 seconds, but temperature is only 66-67°C. And the colder a GPU runs, the faster (and more error free) it does. So perhaps you can experiment with some settings to get the temperature a few degrees lower.
Greetings from TJ
ID: 36877 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
MrJo

Send message
Joined: 18 Apr 14
Posts: 43
Credit: 1,192,135,172
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwat
Message 36880 - Posted: 20 May 2014, 13:31:13 UTC - in response to Message 36877.  

[quote]your 770 can do better, mine finishes these new Noelia's in about 27000 seconds, but temperature is only 66-67°C.


THX for your advice. To lower the temperature, I'm usig the nvidia inspektor with the following settings:

I unchecked Auto-Fan and set it to 60% which speeds the fan from 1300 to 1770 1/min what is still ear-friedly. But that reduces the temperature by only 3 degrees. So I have to check the Priorize Temperature box and put the slider to 68°. Which slows down cpu-clock a little bit. Is there a better approach?




Regards, Josef

ID: 36880 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Beyond
Avatar

Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 36883 - Posted: 21 May 2014, 14:49:46 UTC - in response to Message 36877.  
Last modified: 21 May 2014, 14:50:16 UTC

But more important, your 770 can do better, mine finishes these new Noelia's in about 27000 seconds, but temperature is only 66-67°C. And the colder a GPU runs, the faster (and more error free) it does. So perhaps you can experiment with some settings to get the temperature a few degrees lower.

You might have accidently been looking at your 780 Ti. Here's your 3 Noelia results from the 770 so far:

# GPU [GeForce GTX 770] Platform [Windows] Rev [3301M] VERSION [42]
# Approximate elapsed time for entire WU: 29643.715 s

# GPU [GeForce GTX 770] Platform [Windows] Rev [3301M] VERSION [42]
# Approximate elapsed time for entire WU: 29572.861 s

# GPU [GeForce GTX 770] Platform [Windows] Rev [3301M] VERSION [42]
# Approximate elapsed time for entire WU: 29676.489 s
ID: 36883 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
TJ

Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 36886 - Posted: 21 May 2014, 17:26:50 UTC - in response to Message 36883.  

You are absolutely correct Beyond, my mistake.
Sorry for that MrJo.

Still a difference of 2000 seconds. I have never seen nVidia inspector before, I use PrecisionX from EVGA or MSI's Afterburner. I have set a fan curve that goes to 100% at 70°C but the card is allowed to go to 75% before the program must throttle the GPU clock. Power target is set to 100%. Currently with ambient temperature of 32.6°C the 770 runs at 68°C and 1149MHz. Sits in the second slot, the first is occupied by the 780Ti.
Hope this helps a bit.
Greetings from TJ
ID: 36886 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
MrJo

Send message
Joined: 18 Apr 14
Posts: 43
Credit: 1,192,135,172
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwat
Message 36888 - Posted: 22 May 2014, 5:22:01 UTC

Now I have tested the MSI Afterburner. There you can set a custom fan curve. However, I have a problem with that: In order to lower the temperature by 3-4 ° C, the fan speed increases to 3300 1/min. This is unpleasant. With my GTX 680 I was able to reduce the temperature by 8 degrees, as I dismounted the cooler and renewed the thermal paste;-) Unfortunately, the same procedure for the GTX 770 delivered nothing, since their thermal paste was not dried out. Too new;-) So I will reduce gpu-clock a little bit to remain below 70 degrees. Reducing from 1150 MHz to 1080-1100 reduces the temperature by 5 degrees.
Regards, Josef

ID: 36888 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
GoodFodder

Send message
Joined: 4 Oct 12
Posts: 53
Credit: 333,467,496
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 37368 - Posted: 23 Jul 2014, 8:06:38 UTC

Hi,

potx1x225-NOELIA_INSP-5-13-RND8250_1:

Have a odd error - task failed within 3secs. Hopefully it is a one off and not a bad batch; however in case it is not:

ERROR: file mdioload.cpp line 81: Unable to read bincoordfile

23:01:01 (3684): called boinc_finish


http://www.gpugrid.net/result.php?resultid=12864293

ID: 37368 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Bedrich Hajek

Send message
Joined: 28 Mar 09
Posts: 490
Credit: 11,731,645,728
RAC: 47,738
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 37369 - Posted: 23 Jul 2014, 10:01:55 UTC - in response to Message 37368.  

Hi,

potx1x225-NOELIA_INSP-5-13-RND8250_1:

Have a odd error - task failed within 3secs. Hopefully it is a one off and not a bad batch; however in case it is not:

ERROR: file mdioload.cpp line 81: Unable to read bincoordfile

23:01:01 (3684): called boinc_finish


http://www.gpugrid.net/result.php?resultid=12864293




I had the same error in 4 units so far. Here is an example of one:


potx1x492-NOELIA_INSP-3-13-RND4560_6
Workunit 9908013
Created 22 Jul 2014 | 19:40:28 UTC
Sent 22 Jul 2014 | 21:46:12 UTC
Received 22 Jul 2014 | 23:03:18 UTC
Server state Over
Outcome Computation error
Client state Compute error
Exit status -98 (0xffffffffffffff9e) Unknown error number
Computer ID 127986
Report deadline 27 Jul 2014 | 21:46:12 UTC
Run time 4.05
CPU time 2.06
Validate state Invalid
Credit 0.00
Application version Long runs (8-12 hours on fastest card) v8.41 (cuda60)
Stderr output

<core_client_version>7.2.42</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code -98 (0xffffff9e)
</message>
<stderr_txt>
# GPU [GeForce GTX 690] Platform [Windows] Rev [3301M] VERSION [60]
# SWAN Device 1 :
# Name : GeForce GTX 690
# ECC : Disabled
# Global mem : 2048MB
# Capability : 3.0
# PCI ID : 0000:04:00.0
# Device clock : 1019MHz
# Memory clock : 3004MHz
# Memory width : 256bit
# Driver version : r337_00 : 33788
ERROR: file mdioload.cpp line 81: Unable to read bincoordfile

19:03:38 (5576): called boinc_finish

</stderr_txt>
]]>



http://www.gpugrid.net/result.php?resultid=12864314



ID: 37369 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Grubix

Send message
Joined: 26 Sep 08
Posts: 4
Credit: 321,147,075
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 37370 - Posted: 23 Jul 2014, 10:14:07 UTC

Same error here:

ERROR: file mdioload.cpp line 81: Unable to read bincoordfile


potx1x284-NOELIA_INSP-2-13-RND0923 : WU 9908067

potx1x225-NOELIA_INSP-5-13-RND8250 : WU 9907982

Bye, Grubix.
ID: 37370 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Vagelis Giannadakis

Send message
Joined: 5 May 13
Posts: 187
Credit: 349,254,454
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 37372 - Posted: 23 Jul 2014, 10:20:25 UTC

This error does not affect NOELIAs only, I had a SANTI_p53final fail on me the other day with the exact same error:
ERROR: file mdioload.cpp line 81: Unable to read bincoordfile

ID: 37372 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4

Message boards : News : WU: NOELIA_INS1P

©2025 Universitat Pompeu Fabra