NVidia GPU Card comparisons in GFLOPS peak

Author	Message
dskagcommunity Send message Joined: 28 Apr 11 Posts: 463 Credit: 979,266,958 RAC: 76,910 Level Scientific publications	Message 33137 - Posted: 22 Sep 2013, 8:57:06 UTC Last modified: 22 Sep 2013, 9:00:54 UTC Nice stats, you must give the 560ti with 1280 mem to the 2.0 cards because it is a 570 ( in performance nearly too) with some (defective?) deactivaded shadercores. DSKAG Austria: http://www.dskag.at ID: 33137 · Rating: 0 · rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 33138 - Posted: 22 Sep 2013, 9:24:22 UTC - in response to Message 33135. Last modified: 22 Sep 2013, 9:35:24 UTC A nice overall look at field performance variation. You did well to remove the GTX460 from the chart - there are actually 5 different versions of the GTX460 with native performance variation of 601 to 1045 GFlops and bandwidth from 86 to 115GB/s. 460 (Jul ’10) 1GB 336:56:32 – let’s call this the basal reference model. 460 SE (Nov ’10) 228 shaders 460 OEM (Oct ‘10) As reference but reduced clocks 460 (Jul ’10) 768MB, 336:56:24 460 v2 (Sep ’11) higher clocks, 336:56:24 So, GPU identification may well be the issue in some cases, and as pointed out, especially the GTX560Ti (384 vs 448, and CC2.1 vs CC2.0). Last time I looked, and every time for years, there was an ~11% performance difference between Linux or XP and Vista/W7/W8. If you just look at GTX670's there is a performance range of ~10% between the reference models and the top bespoke cards. Different system architectures could also influence performance by around the same amount; an average CPU with DDR2 and PCIE 1.1 say at X4 would cause a drop from a top system of around 10% or so. Then there is downclocking, overclocking, recoverable errors... FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help ID: 33138 · Rating: 0 · rate: / Reply Quote

Jeremy Zimmerman Send message Joined: 13 Apr 13 Posts: 61 Credit: 726,605,417 RAC: 0 Level Scientific publications	Message 33139 - Posted: 22 Sep 2013, 10:53:55 UTC - in response to Message 33137. dskagcommunity, That was not confusing at all. :) I see I had pulled data from two different computers which as you noted had different memory configs also. Just changed the below capability after you pointed it out. Card_______Mem___Capability__Device clock___Mem clock___Mem width GTX560Ti___1280___2_________1520_________1700_______320 GTX560Ti___1024___2.1________1644_________2004_______256 http://www.gpugrid.net/result.php?resultid=7267807 http://www.gpugrid.net/result.php?resultid=7274793 Initially I did not take the capability from the posts since I was getting the info manually by review of work units. Had later gone to http://en.wikipedia.org/wiki/CUDA, but got it a bit wrong for this mixture. Thank you. ID: 33139 · Rating: 0 · rate: / Reply Quote

TJ Send message Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level Scientific publications	Message 33140 - Posted: 22 Sep 2013, 11:01:09 UTC - in response to Message 33133. Yes absolutely a neat overview. Two remarks from me though. If I see correct the 770 is only little faster than the 660. In my case the 770 is doing almost 35% better than my 660. But I have a lot of problems with some tasks on the 660. The GPU clock in the stderr output file is not the same as the actual run speed of the clock. I have watched this closely the last weeks, my 770 runs for most WU's faster and the 660 lower, when watching GPU-Z, Asus GPU Tweak, Precision X. Greetings from TJ ID: 33140 · Rating: 0 · rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 33142 - Posted: 22 Sep 2013, 11:35:44 UTC - in response to Message 33140. Last modified: 22 Sep 2013, 12:28:33 UTC Yes absolutely a neat overview. Two remarks from me though. If I see correct the 770 is only little faster than the 660. In my case the 770 is doing almost 35% better than my 660. But I have a lot of problems with some tasks on the 660. The presented data is an overview of what's happening. You cannot go by such measurements to accurately determine relative GPU performances - there are too many ways a system or configuration can throw the results off. If your data set for each GPU type is small (less than several hundred cards) it's inherently prone to such variation. Three bad setups would significantly throw the results if the data set was 10 or 20 cards for each type. Micro-analysing overview results is fickle and subject to continuous change, but can help understand things. It might be useful to remove the obvious erroneous results, ie the downclocks and bad setups. The GPU clock in the stderr output file is not the same as the actual run speed of the clock. I have watched this closely the last weeks, my 770 runs for most WU's faster and the 660 lower, when watching GPU-Z, Asus GPU Tweak, Precision X. Even for one type of GPU there is a range of around 10% performance variation from various vendors, and that's without individuals Overclocking their FOC's. The clock speed when crunching is the boost speed (for those cards that do boost). This is typically higher at GPUGrid than the average boost and often the max boost speed. It is different from what is quoted in the Stderr file. My GTX660Ti operates at around 1200MHz (without messing with it), but it's reported as operating at 1110MHz by the Stderr file, which is the manufacturers quoted boost speed. So Base core clock 1032MHz, Boost clock 1111MHz, Actual clock 1200MHz when crunching. There is an 8% difference between what I think is the Boost clock (1111MHz) when using full power, and the running clock (which rises until the power target is reached or something like heat or reliable voltage stops it rising). Anyway, the actual boost is subject to change (Voltage, temps...) and some bespoke cards have a higher boost delta than others. I'm not sure if the quoted reference GFlops are taken from the core clock, or Avg Boost, or 'Max' boost? Whatever way it is taken would still make comparisons slightly inaccurate. - just for ref - GTX660; Base Clock 980MHz, Boost Clock 1033MHz, Actual clock 1071MHz. FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help ID: 33142 · Rating: 0 · rate: / Reply Quote

TJ Send message Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level Scientific publications	Message 33144 - Posted: 22 Sep 2013, 12:11:18 UTC - in response to Message 33142. I'm not sure if the quoted reference GFlops are taken from the core clock, average boost, or Max boost? Whatever way it is taken would still make comparisons slightly inaccurate. That is what I would say. Comparison is overall quite hard, as you said yourself, it is the setup (and what is a good setup and what a wrong/faulty one), the type of OS, RAM, cooling and so on. Than a 650 is not the same as a 780. Very likely the 780 will be faster, but is it more efficient? That depends how you look at it and what you are comparing. If you want the compare it scientifically, you need exact the same setup, with only one variable factor and that is the graphics card. The WU need to be exact the same as well. That can never be done in the BOINC-world. Nevertheless the overview Jeremy has made, gives nice insight. Greetings from TJ ID: 33144 · Rating: 0 · rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 33146 - Posted: 22 Sep 2013, 13:06:41 UTC - in response to Message 33144. Last modified: 22 Sep 2013, 13:11:05 UTC Another minor factor for here is that different types of WU utilize the MCU to different extents. This means that running WU's that greatly utilize the MCU on memory bandwidth constrained cards would reduce their relative performance, and running WU's that don't tax the MCU as greatly would make such MCU constrained cards appear faster. The problem with this situation is that we don't know which type of WU GPUGrid will be running in the future, so it's difficult to know which card will be relatively better in terms of performance per cost. I've seen the MCU on my GTX660Ti as high as 44% and also down in the low 30's. If you take a GTX660Ti and a GTX660, the theoretical performance difference is 30.7% for reference cards, however the actual performance difference is more like 20%. Both cards have have a bandwidth of 144.2GB/s but the 660 has fewer shaders to feed, so it's bandwidth/shader count is higher. My 660Ti also boosts much higher than a reference 660, 1202MHz vs 1071MHz, so this exasperates the bandwidth limitation. My MCU loads are 38% and 28% while running the Beta WU's. Roughly speaking the GTX660Ti appears to be taking a 10% hit due to it's bandwidth restrictions. FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help ID: 33146 · Rating: 0 · rate: / Reply Quote

ExtraTerrestrial Apes Volunteer moderator Volunteer tester Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level Scientific publications	Message 33148 - Posted: 22 Sep 2013, 18:26:38 UTC Nice work, Jeremy! I'm wondering how to make the best use of this. I don't know what you're doing exactly, but you think it would be feasible to implement this as part of the GPU-Grid homepage? Provided the project staff themselves are interested in this. I'd imagine data polling maybe once a day and then presenting the user different filters on what to show. The obvious benefit would be that this is live data, so it's be definition current and doesn't have to be updated manually (just checked for sanity every now and then). And some comments regarding your current work. I'd prefer "projected RAC" or "maximum RAC" or "maximum credits per day" instead of run times for the following reasons: - WUs differ in runtime and get their credits adjusted accordingly. By looking only at task lengths you introduce unnecessary noise into your data, because you neglect how much work the WUs contained (which the credits tell you) - to take this into account you'd measure "credits per time interval", so just use 1 day, as this coincides with what's used for the RAC we all know - on the current scale we can see the differences between slow cards, but everything faster seems just like a wash in this plot, but I'd argue that the performance of the current cards is actually more important -> invert the y-value, credits per time instead of time And while showing scatter plots is nice to get a quick feeling for the noise of a data set, they're bad for presenting averages of tightly clustered data. We could use "value +/- standard deviation" instead. I suppose the OS difference could be filtered out: either generate separate graphs for Win XP + Linux and Win Vista+7+8 or normalize for any of them (use that 11% difference SK mentioned). This and calculating RAC should make your data significantly more precise. And everyone please remember that we're not shooting for the moon here, but rather "numbers as good as we can get them automatically". And for the GTX650 Ti you can see two distinct groups. I suppose one is the GTX650 Ti and the other one the GTX650 Ti Boost. Hurray for nVidias marketing team coming up with these extremely sane and intuitive names.. not sure there's anything you could do to separate the results of these two different cards. MrS Scanning for our furry friends since Jan 2002 ID: 33148 · Rating: 0 · rate: / Reply Quote

TJ Send message Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level Scientific publications	Message 33149 - Posted: 22 Sep 2013, 19:52:41 UTC - in response to Message 33148. - WUs differ in runtime and get their credits adjusted accordingly. By looking only at task lengths you introduce unnecessary noise into your data, because you neglect how much work the WUs contained (which the credits tell you) You know probably better than I ETA how it works with credit, but I see the following: A Noelia WU uses the card better than Nathan, Santi and SDoer at the moment with 95-97% GPU usage, it is faster (around 3000 seconds in my cases) but gives more credit. So task-length can not be used here to compare completely but run time neither. Greetings from TJ ID: 33149 · Rating: 0 · rate: / Reply Quote

jlhal Send message Joined: 1 Mar 10 Posts: 147 Credit: 1,077,535,540 RAC: 0 Level Scientific publications	Message 33150 - Posted: 22 Sep 2013, 19:57:07 UTC Good job indeed everybody who gathered the informations for comparison. One suggestion : there is a CPU benchmark in BOINC , why not a GPU benchmark ? Lubuntu 16.04.1 LTS x64 ID: 33150 · Rating: 0 · rate: / Reply Quote

Michael Goetz Send message Joined: 2 Mar 09 Posts: 124 Credit: 124,873,744 RAC: 0 Level Scientific publications	Message 33152 - Posted: 22 Sep 2013, 21:16:20 UTC - in response to Message 33150. Good job indeed everybody who gathered the informations for comparison. One suggestion : there is a CPU benchmark in BOINC , why not a GPU benchmark ? The CPU benchmarks are a joke. They don't give you a good indication of how well a CPU runs a real application. Different CPUs have different strengths and weaknesses, and some apps will run great on CPU X and terribly on CPU Y, while other apps will have completely opposite behavior. With GPUs, it's much worse than with CPUs as the architectures of different GPUs are completely different and the performance on different types of math varies a lot more than it does with GPUs. Standard BOINC GPU benchmark would therefore be even more worthless than the current CPU benchmarks. What you ideally would want (if it's possible) is to build a benchmark suite directly into your application so you can run it in benchmark mode on various hardware. (My apps tend to be of the "Repeat this big loop N times" type, where the time for each iteration is relatively constant, so benchmarks just consist of timing the first N iterations.) If you really want to get good data, you can have every result perform a benchmark and return the benchmark result along with the science result. Depending on how the app works, you could even gather the benchmark data while the real app is running, so there doesn't need to be any wasted time. The server could then gather solid data on exactly how fast each app works on specific GPUs. ID: 33152 · Rating: 0 · rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 33155 - Posted: 22 Sep 2013, 21:53:39 UTC - in response to Message 33152. Last modified: 22 Sep 2013, 22:00:53 UTC Mike wrote: If you really want to get good data, you can have every result perform a benchmark and return the benchmark result along with the science result. Depending on how the app works, you could even gather the benchmark data while the real app is running, so there doesn't need to be any wasted time. The server could then gather solid data on exactly how fast each app works on specific GPUs. There is something like this already present in the acemd client, as there is a line at the end of the stderr stating: # Time per step (avg over 12000000 steps): 2.104 ms ID: 33155 · Rating: 0 · rate: / Reply Quote

Jeremy Zimmerman Send message Joined: 13 Apr 13 Posts: 61 Credit: 726,605,417 RAC: 0 Level Scientific publications	Message 33156 - Posted: 22 Sep 2013, 21:56:07 UTC - in response to Message 33148. MrS, Thank you for the feedback. Sifting the data took the longest part. Basically started with the top 10 users and machines and worked down until I got enough records. Then had to review individual work units for the machines that different types of cards. Ended up with a little over a thousand records over several hours. There is so much info in this forum on the different cards, but I was curious to see some more filtering. Automation would be great; however, I am not able to do that. I do not have a complete understanding of how the credits are determined. Just that there are bonuses for 24 hour turn around. With size of personal queues, card speeds, and even time allotted to crunching, I did not use this metric because of large potential for skew. I think I heard there is something like 25-50% bonuses for credit on fast returns. For better or worse, that is why I looked at just GPU Time. You are right that the different WU's have different amounts of calculations, but I am not sure how to normalize that out. The NOELIA's have a difference. Probably the best could be a uniform WU naming from the scientists for easier filtering. I guess here the credits could help pin point. For example, the two numerical digits in the name before NOELIA are the longest running work units. Probably a different class of analysis' being done. I can not sort by just credit as they get 180,000 when <24, but 150,000 when at 26 hours. Well, while writing this, I could back calculate to a normalized credit score by using time (Sent and Reported) if I understood the complete bonus system if it is only based on time. There were a few other plots I had created, but did not post because it was rather slow for me (turn plot into image, get a place to save pics online without ads in the created link, manually look at the code of the site to get the link, and then paste here in the forum). Plus was not sure on length of post. I would probably try this again if the data collection were a bit easier. Also, if the stderr had an avg actual boost clock and not device default. Maybe through that output on each temp line output, and then have an avg in the stderr header summary. I would include a median analysis with 95% confidence intervals. Most of the data is non-normal so it skews averages. Anyway, looks like there are a few additional ways to clean up the data set for a new look. Really would like a way to know the actual boost clocks running though as that was a strong interaction. Regards, Jeremy ID: 33156 · Rating: 0 · rate: / Reply Quote

Michael Goetz Send message Joined: 2 Mar 09 Posts: 124 Credit: 124,873,744 RAC: 0 Level Scientific publications	Message 33158 - Posted: 22 Sep 2013, 22:50:13 UTC - in response to Message 33155. Mike wrote: If you really want to get good data, you can have every result perform a benchmark and return the benchmark result along with the science result. Depending on how the app works, you could even gather the benchmark data while the real app is running, so there doesn't need to be any wasted time. The server could then gather solid data on exactly how fast each app works on specific GPUs. There is something like this already present in the acemd client, as there is a line at the end of the stderr stating: # Time per step (avg over 12000000 steps): 2.104 ms The problem with stderr is that if you have a user that has BOINC set up to stop when CPU uasge goes above a certain percantage -- and that's the default in at least some versions of the BOINC client -- stderr may exceed the max allowed size of the stderr.txt file, which results in the beginning of the file being truncated. Combined with the fact that BOINC will switch the task between GPUs, unless you can see the enitre stderr, you can't be certain that the task ran entirely on the GPU you think it ran on. Also, if there's more than one GPU in the system, you need to have the app tell you which GPU was running the task. ID: 33158 · Rating: 0 · rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 33164 - Posted: 23 Sep 2013, 9:28:40 UTC - in response to Message 33158. Last modified: 23 Sep 2013, 10:58:06 UTC I asked for an benchmark app a few years back. There were several reasons. One was to allow us to test our clocks, report performances, and get an accurate comparison of different GPU's. Another reason was because it could be used by the numerous review sites that now tend to show folding performances. I also thought it could be used as a reset mechanism; when you stop getting WU's because you fail too many tasks (runaway failures) you would be allowed to run the benchmark app (no credits), and when you did this successfully you would be allowed to get normal tasks again. I think it would have needed it's own queue, so No was the answer. Jeremy, the actual boost clocks for GPU's is a big consideration. While most cards seem to run higher than what the manufacturers say, that's not always the case, and isn't the case for reference mid range cards. With high end cards it appears that most cards (of a given model) boost to around the same clock frequencies. For example non-reference GTX660Ti's tend to boost to around 1200MHz, 8% higher than the quoted Boost by the manufacturer and over 13% higher than the supposed Max Boost of a reference card. However we can't be sure if that's the case when looking at the data; some cards might not boost, and cards boost to different extents based on different circumstances, the power target, temperature, GPU usage... My reference GTX660 only boosts to 1071MHz (which is in fact 13MHz below the reference Max Boost, and why I suggest getting a non-reference model that clocks a bit higher). My Gigabyte 670 is in a Linux system, and is supposedly running at a reference Boost of 1058MHz. However I don't know if that's what it's really running at? I would like to see the real boost clock when crunching a WU being recorded in the Stderr, especially on Linux as XServer tells me the clock is 750MHz (which is wrong), the Stderr is almost blank, and there aren't any handy little apps such as GPUZ, Afterburner, Precision... It would be useful to know what other peoples cards boost to while running GPUGrid WU's... FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help ID: 33164 · Rating: 0 · rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 33177 - Posted: 23 Sep 2013, 21:52:08 UTC - in response to Message 33164. Last modified: 28 Sep 2013, 11:48:29 UTC Relative performances of high end and mid-range reference Keplers at GPUGrid: 100% GTX Titan 90% GTX 780 77% GTX 770 74% GTX 680 59% GTX 670 58% GTX 690 (each GPU) 55% GTX 660Ti 53% GTX 760 51% GTX 660 43% GTX 650TiBoost 33% GTX 650Ti This is meant as a general guide but it should be reasonably accurate (within a few percent). The figures are based on actual results but they should still be perceived as estimates; there are some potential unknowns, unexpected configurations, app specific effects, bespoke cards, OC's, bottlenecks... and the table is subject to change (WU and app type). Note that lots of Kepler cards have non-reference clocks, so expect line variations of over 10% (the GTX660 performance range could be from 50 to 56% of a Titan). FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help ID: 33177 · Rating: 0 · rate: / Reply Quote

5pot Send message Joined: 8 Mar 12 Posts: 411 Credit: 2,083,882,218 RAC: 0 Level Scientific publications	Message 33193 - Posted: 25 Sep 2013, 4:56:28 UTC Graph looks about right. And yes, boost clocks make a complete difference in actual GPU performance. So much so, that the table presented, and pretty much all tables in general as far as GPUs go, are wobbily at best. And down right laying on their side at worst. ID: 33193 · Rating: 0 · rate: / Reply Quote

jlhal Send message Joined: 1 Mar 10 Posts: 147 Credit: 1,077,535,540 RAC: 0 Level Scientific publications	Message 33196 - Posted: 25 Sep 2013, 11:28:50 UTC - in response to Message 33177. Relative performances of high end and mid-range reference Keplers at GPUGrid: 100% GTX Titan 90% GTX 780 77% GTX 770 74% GTX 680 59% GTX 670 55% GTX 660Ti 53% GTX 760 51% GTX 660 43% GTX 650TiBoost 33% GTX 650Ti Well, what about the GTX 690 (even if you consider only 1 of 2 GPU chips)? Lubuntu 16.04.1 LTS x64 ID: 33196 · Rating: 0 · rate: / Reply Quote

dskagcommunity Send message Joined: 28 Apr 11 Posts: 463 Credit: 979,266,958 RAC: 76,910 Level Scientific publications	Message 33198 - Posted: 25 Sep 2013, 11:38:22 UTC - in response to Message 32499. Last modified: 25 Sep 2013, 11:40:12 UTC Here is your answer. A GTX690 does around 80% more work than a GTX680; each GTX690 core is about 90% as fast as a GTX680. Assuming it used 300W (or an equal proportion of it compared to other KG104's and the KG110 GPU's) and a GTXTitan used 250W then it's reasonably accurate to say that a GTX690 is overall equally as efficient as a Titan (for crunching for here in terms of performance per Watt). GTX680 - 100% for 195W (0.53) Titan - 150% for 250W (0.60) GTX690 - 180% for 300W (0.60) If a GTX680 used 195W to do a certain amount of crunching, the Titan or the GTX690 would do the same amount of work for 167W. Both the Titan and the GTX680 are about 12% more efficient than the GTX680. To bring in the purchase factor, In the UK a GTX690 costs the same as a Titan (~£770), but would do 20% more work. A GTX780 on the other hand costs ~£500, but might bring around 90% the performance of the Titan (would need someone to confirm this). DSKAG Austria: http://www.dskag.at ID: 33198 · Rating: 0 · rate: / Reply Quote

5pot Send message Joined: 8 Mar 12 Posts: 411 Credit: 2,083,882,218 RAC: 0 Level Scientific publications	Message 33201 - Posted: 25 Sep 2013, 12:16:01 UTC My 780 is neck and neck with a titan. Like I said, it all depends on clock speed. ID: 33201 · Rating: 0 · rate: / Reply Quote