NVidia GPU Card comparisons in GFLOPS peak

Author	Message
John C MacAlister Send message Joined: 17 Feb 13 Posts: 181 Credit: 144,871,276 RAC: 0 Level Scientific publications	Message 40860 - Posted: 13 Apr 2015, 13:08:08 UTC Thanks: which is the correct thread? ID: 40860 · Rating: 0 · rate: / Reply Quote

Betting Slip Send message Joined: 5 Jan 09 Posts: 670 Credit: 2,498,095,550 RAC: 0 Level Scientific publications	Message 40861 - Posted: 13 Apr 2015, 13:51:52 UTC - in response to Message 40860. "Number Crunching" ID: 40861 · Rating: 0 · rate: / Reply Quote

John C MacAlister Send message Joined: 17 Feb 13 Posts: 181 Credit: 144,871,276 RAC: 0 Level Scientific publications	Message 40862 - Posted: 13 Apr 2015, 14:11:36 UTC Many thanks, Betting Slip! Boy, did you ever stir some memories with the team name "Radio Caroline"!! :)- ID: 40862 · Rating: 0 · rate: / Reply Quote

Robert Gammon Send message Joined: 28 May 12 Posts: 63 Credit: 714,535,121 RAC: 0 Level Scientific publications	Message 41179 - Posted: 28 May 2015, 13:22:50 UTC I wish that the web site would allow us to mine the dataset as apparently some have been able to do. Wish God would give us the ability to see the differences in runtimes over a long period versus the GPUs that we have and the ones we may buy sometime For example, I recently added a EVGA GTX960 (4G RAM about USD $240) I wish to compare the performance of this pair on GPU work units directly At this point only GTX960, two GTX760, and one GTX550 exist ID: 41179 · Rating: 0 · rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 41181 - Posted: 28 May 2015, 16:32:53 UTC - in response to Message 41179. Last modified: 28 May 2015, 16:48:25 UTC Robert, while running NOELIA_ETQ_bound tasks your GTX960 is ~11.7% faster than your GTX760. There isn't a GTX550, but there is a GTX 550 Ti, 192Cuda Cores but GF116 40nm and Compute Capable 2.1 691 GFlops peak (622 with correction factor applied albeit back in 2012). The performance of a GTX550Ti would be around 15% of the original GTX Titan, or to put it another way your GTX960 would be ~4 times as fast as the GTX550Ti. FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help ID: 41181 · Rating: 0 · rate: / Reply Quote

xixou Send message Joined: 8 Jun 14 Posts: 18 Credit: 19,804,091 RAC: 0 Level Scientific publications	Message 41281 - Posted: 10 Jun 2015, 8:17:46 UTC 10-06-15 09:59:20 \| \| CUDA: NVIDIA GPU 0: GeForce GTX TITAN X (driver version 353.12, CUDA version 7.5, compute capability 5.2, 4096MB, 3065MB available, 6611 GFLOPS peak) 10-06-15 09:59:20 \| \| OpenCL: NVIDIA GPU 0: GeForce GTX TITAN X (driver version 353.12, device version OpenCL 1.2 CUDA, 12288MB, 3065MB available, 6611 GFLOPS peak) ID: 41281 · Rating: 0 · rate: / Reply Quote

xixou Send message Joined: 8 Jun 14 Posts: 18 Credit: 19,804,091 RAC: 0 Level Scientific publications	Message 41282 - Posted: 10 Jun 2015, 8:18:00 UTC 10-06-15 09:59:20 \| \| CUDA: NVIDIA GPU 0: GeForce GTX TITAN X (driver version 353.12, CUDA version 7.5, compute capability 5.2, 4096MB, 3065MB available, 6611 GFLOPS peak) 10-06-15 09:59:20 \| \| OpenCL: NVIDIA GPU 0: GeForce GTX TITAN X (driver version 353.12, device version OpenCL 1.2 CUDA, 12288MB, 3065MB available, 6611 GFLOPS peak) ID: 41282 · Rating: 0 · rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 41294 - Posted: 11 Jun 2015, 9:49:20 UTC - in response to Message 41282. Last modified: 26 Aug 2016, 12:32:08 UTC This was updated with the GTX 980Ti using limited results of one task type only (subsequent observations show that performance of different cards varies by task type; with some jobs scaling better than others): Performance GPU Power GPUGrid Performance/Watt 211% GTX Titan Z (both GPUs) 375W 141% 156% GTX Titan X 250W 156% 143% GTX 980Ti 250W 143% 116% GTX 690 (both GPUs) 300W 97% 114% GTX Titan Black 250W 114% 112% GTX 780Ti 250W 112% 109% GTX 980 165W 165% 100% GTX Titan 250W 100% 93% GTX 970 145W 160% 90% GTX 780 250W 90% 77% GTX 770 230W 84% 74% GTX 680 195W 95% 64% GTX 960 120W 134% 59% GTX 670 170W 87% 55% GTX 660Ti 150W 92% 53% GTX 760 130W 102% 51% GTX 660 140W 91% 47% GTX 750Ti 60W 196% 43% GTX 650TiBoost 134W 80% 37% GTX 750 55W 168% 33% GTX 650Ti 110W 75% Throughput performances and Performances/Watt are relative to a GTX Titan. Note that these are estimates and that I’ve presumed Power to be the TDP as most cards boost to around that, for at least some tasks here. Probably not the case for GM200 though (post up your findings). When doing this I didn’t have a full range or cards to test against every app version or OS so some of this is based on presumptions based on consistent range observations of other cards. I’ve never had a GTX750, 690, 780, 780Ti, 970Ti or any of the Titan range to compare, but I have read what others report. While I could have simply listed the GFLOPS/Watt for each card that would only be theoretical and ignores discussed bottlenecks (for here) such as the MCU load, which differs by series. The GTX900 series cards can be tuned A LOT - either for maximum throughput or less power usage / coolness / performance per Watt: For example, with a GTX970 at ~108% TDP (157W) I can run @1342MHz GPU and 3600MHz GDDR or at ~60% TDP (87W) I can run at ~1050MHz and 3000MHz GDDR, 1.006V (175W at the wall with an i7 crunching CPU work on 6 cores). The former does more work, is ~9% faster than stock. The latter is more energy efficient, uses 60% stock power but does ~ 16% less work than stock or ~25% less than with OC'ed settings. At 60% power but ~84% performance the 970 would be 34% more efficient in terms of performance/Watt. On the above table that would be ~214% the performance/Watt efficiency of a Titan. I expected the 750Ti and 750 Maxwell's also boost further/use more power than their reference specs suggest, but Beyond pointed out that although they do auto-boost they don't use any more power for here (60W). It's likely that they can also be underclocked for better performance/Watt, coolness or to use less power. The GTX960 should also be very adaptable towards throughput or performance/Watt but may not be the choicest of cards in that respect. Note that system setup and configuration can greatly influence performance and performance varies with task types/runs. PM me with errors/ corrections. FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help ID: 41294 · Rating: 0 · rate: / Reply Quote

[CSF] Thomas H.V. DUPONT Send message Joined: 20 Jul 14 Posts: 732 Credit: 130,089,082 RAC: 0 Level Scientific publications	Message 41304 - Posted: 12 Jun 2015, 5:37:43 UTC - in response to Message 41294. Thanks for this new update skgiven :) [CSF] Thomas H.V. Dupont Founder of the team CRUNCHERS SANS FRONTIERES 2.0 www.crunchersansfrontieres ID: 41304 · Rating: 0 · rate: / Reply Quote

[AF>Amis des Lapins] Oncle Bob Send message Joined: 21 Apr 13 Posts: 3 Credit: 54,953,606 RAC: 0 Level Scientific publications	Message 41327 - Posted: 13 Jun 2015, 21:33:55 UTC Hi, Just a question : is GPUGrid in SP or DP ? I Though it was SP, but I have done some math and found that it it seems to be in DP. I ran some long tasks, which are ~5 TFLOP, and ran it on my GTX 750 Ti OC in 108000 secondes, that lead to ~46 GFLOPS, which is the peak in DP of my card. Am I right ? If yes, why do the Titan and Titan black, which have a strong DP, seem to be weaker on GPUGrid than high-end GTX 900 cards which have 10 times less DP ? As I planned to buy a Titan Black for GPUGrid, I'm very interested if you have the explanation. Thanks ! ID: 41327 · Rating: 0 · rate: / Reply Quote

Matt Send message Joined: 11 Jan 13 Posts: 216 Credit: 846,538,252 RAC: 0 Level Scientific publications	Message 41328 - Posted: 13 Jun 2015, 21:36:55 UTC - in response to Message 41327. I can confirm that this project is indeed SP. ID: 41328 · Rating: 0 · rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 41329 - Posted: 13 Jun 2015, 22:41:06 UTC - in response to Message 41327. Last modified: 13 Jun 2015, 22:46:05 UTC Just a question : is GPUGrid in SP or DP ? The GPUGrid app does most of its calculations in SP, on the GPU. The rest (in DP) is done on the CPU. I Though it was SP, but I have done some math and found that it it seems to be in DP. I ran some long tasks, which are ~5 TFLOP, and ran it on my GTX 750 Ti OC in 108000 secondes, that lead to ~46 GFLOPS, which is the peak in DP of my card. Am I right ? No. For a more elaborated answer perhaps you should share your performance calculation, and we could try to figure out what's wrong with it. ... why do the Titan and Titan black, which have a strong DP, seem to be weaker on GPUGrid than high-end GTX 900 cards which have 10 times less DP ? A Titan Black nearly equals a GTX 780Ti from GPUGrid's point of view. DP cards like the Titan Black usually have lower clocks than their gaming card equivalent, and/or the ECC makes the card's RAM run slower. As I planned to buy a Titan Black for GPUGrid,... Don't buy a DP card for GPUGrid. A Titan X is much better for this project, or a GTX980Ti, as it has a higher performace/price ratio than a Titan X. I'm very interested if you have the explanation. Now you have it too. :) Thanks ! You're welcome. ID: 41329 · Rating: 0 · rate: / Reply Quote

[AF>Amis des Lapins] Oncle Bob Send message Joined: 21 Apr 13 Posts: 3 Credit: 54,953,606 RAC: 0 Level Scientific publications	Message 41330 - Posted: 14 Jun 2015, 2:09:21 UTC - in response to Message 41329. For a more elaborated answer perhaps you should share your performance calculation, and we could try to figure out what's wrong with it. I am currently running a long task ( https://www.gpugrid.net/result.php?resultid=14262417 ). BOINC estimates its size at 5 000 000 GFLOP, and ETA is a bit more than 20 hours. GPU is a GTX 750 Ti, GPU load is 90-91%. I don't run any task on CPU (i7 2600K) in order to evaluate the speed of the card on this task. If I am right, 5 000 000 divided by 72 000 (number of seconds in 20 hours) = 69 and GTX 750 Ti is given for 1 300 GFLOPS (well, a bit more, as mine is a bit overclocked). I expected 5 000 000 / 1 300 = 3850 seconds (a few more, because I suppose CPU, which run slower than GPU, is a bottleneck), so what's wrong with my GPUGrid understanding ? ID: 41330 · Rating: 0 · rate: / Reply Quote

robertmiles Send message Joined: 16 Apr 09 Posts: 503 Credit: 769,991,668 RAC: 0 Level Scientific publications	Message 41331 - Posted: 14 Jun 2015, 3:18:23 UTC - in response to Message 41330. For a more elaborated answer perhaps you should share your performance calculation, and we could try to figure out what's wrong with it. I am currently running a long task ( https://www.gpugrid.net/result.php?resultid=14262417 ). BOINC estimates its size at 5 000 000 GFLOP, and ETA is a bit more than 20 hours. GPU is a GTX 750 Ti, GPU load is 90-91%. I don't run any task on CPU (i7 2600K) in order to evaluate the speed of the card on this task. If I am right, 5 000 000 divided by 72 000 (number of seconds in 20 hours) = 69 and GTX 750 Ti is given for 1 300 GFLOPS (well, a bit more, as mine is a bit overclocked). I expected 5 000 000 / 1 300 = 3850 seconds (a few more, because I suppose CPU, which run slower than GPU, is a bottleneck), so what's wrong with my GPUGrid understanding ? For a typical computer, the CPU runs at about 4 times the clock speed of the CPU. However, typical GPU programs are capable of using many of the GPU cores at once. GPUs have a varying number of GPU cores - usually more for the more expensive GPUs, currently with a maximum of about 3000 GPU cores per GPU. A GTX 750 Ti has 640 GPU cores. A GTX Titan Z board has 5760 GPU cores, but only because it uses 2 GPUs. A CPU can have multiple CPU cores, with the number being as high as 12 for the most expensive CPUs. However, BOINC CPU workunits usually use only one CPU core each. ID: 41331 · Rating: 0 · rate: / Reply Quote

Vagelis Giannadakis Send message Joined: 5 May 13 Posts: 187 Credit: 349,254,454 RAC: 0 Level Scientific publications	Message 41332 - Posted: 14 Jun 2015, 11:30:18 UTC - in response to Message 41330. I am currently running a long task ( https://www.gpugrid.net/result.php?resultid=14262417 ). BOINC estimates its size at 5 000 000 GFLOP, and ETA is a bit more than 20 hours. GPU is a GTX 750 Ti, GPU load is 90-91%. I don't run any task on CPU (i7 2600K) in order to evaluate the speed of the card on this task. If I am right, 5 000 000 divided by 72 000 (number of seconds in 20 hours) = 69 and GTX 750 Ti is given for 1 300 GFLOPS (well, a bit more, as mine is a bit overclocked). I expected 5 000 000 / 1 300 = 3850 seconds (a few more, because I suppose CPU, which run slower than GPU, is a bottleneck), so what's wrong with my GPUGrid understanding ? If the only factors playing the key roles were the tasks' computational load and the cards' performance, then I guess the tasks' computation times would be roughly equal to [task GFLOP] / [card GFLOP/s]. But of course, this is not an ideal world :) Off the top of my head I can think of the following factors: Each task type's degree of parallelism. The power of GPUs is in the sheer number of computing cores. The more cores a task can utilize, the more the performance will reach the theoretical levels. Each task type's need for main memory accesses. The GPU can perform its operations much quicker if it doesn't need to access the card's main memory It follows from the above, that the size of the in-GPU cache and its speed can have a major effect on the actual performance. The PCI-Express performance. The card eventually finishes with the computational work it has been assigned and then a) the results of the computation must be transferred back to the host and b) new computational work must be transferred from the host to the card. During these times, the card sits idle, or at least is not too busy doing real work. At least for GPUGrid, the CPU has to do some significant work too. For my (new!) GTX 970, the acemd process consumes ~25% of a logical core. This means a) there must be some CPU resources available, and b) the CPU has to provide some "acceptable" levels of performance. Doing other CPU-bound computational work (e.g. CPU BOINC tasks) can have a significant effect. System memory. Especially if other CPU-bound work is running, all these tasks will need to frequently access the main memory and this access has specific limited performance. Many tasks needing access simultaneously effectively lower the memory access bandwidth for each task. The increased load on the memory controller also effectively increases memory latency. These are just factors that I could readily think of, there may be others too that come into play (e.g. CPU cache size and speed). Computational performance is a difficult, complex topic. High Performance Computing especially more so! I'm telling you this as a professional software engineer who has spent a lot of time improving the performance of software... ID: 41332 · Rating: 0 · rate: / Reply Quote

[AF>Amis des Lapins] Oncle Bob Send message Joined: 21 Apr 13 Posts: 3 Credit: 54,953,606 RAC: 0 Level Scientific publications	Message 41333 - Posted: 14 Jun 2015, 13:00:21 UTC Thank you, so my idea of a bottleneck was not so far from the truth (it seems to be MANY bottlenecks instead !). It should be great to know what exactly slow the GPU computation and how we can improve the speed of the card. I run BOINC on a SSD, which is far faster than a classic HDD. Maybe using a RAMDisk should improve a bit more the total speed of the task ? For the RAM, using a non-ECC high frequency low latency RAM ? And for the GPU itself, does the bandwidth heavily affect the computation ? High-end GPU use a 384 bits bandwidth, and past high-end used as high as 512 bits (GTX 280...). Does compute capability affects the speed ? It is very sad to see that average computation speed is so far from theoretical GFLOPS peak. This is a waste of computational time and energy. ID: 41333 · Rating: 0 · rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 41339 - Posted: 14 Jun 2015, 17:46:24 UTC - in response to Message 41333. Last modified: 14 Jun 2015, 18:03:45 UTC I run BOINC on a SSD, which is far faster than a classic HDD. Maybe using a RAMDisk should improve a bit more the total speed of the task ? It won't be noticeable, as the speed of the HDD(SSD) subsystem matters only when a task is starting/ending/making a checkpoint. For the RAM, using a non-ECC high frequency low latency RAM ? Yes. And for the GPU itself, does the bandwidth heavily affect the computation ? High-end GPU use a 384 bits bandwidth, and past high-end used as high as 512 bits (GTX 280...). Not heavily, but it's noticeable (up to ~10%). But the WDDM overhead is much more a bottleneck especially for high-end cards. My GTX 980 in a Windows XP host is only 10% slower than a GTX980Ti in a Windows 8 host. Does compute capability affects the speed ? The actual compute capability which the client is written to use matters. The GPUGrid client is the most recent among BOINC projects, as it is written in CUDA6.5. It is very sad to see that average computation speed is so far from theoretical GFLOPS peak. This is a waste of computational time and energy. It's not as far as you think. In your calculation you took the 5.000.000 GFLOPs from BOINC manager / task properties, but this value is incorrect, as well as the result. I suppose from the task's runtime that you had a GERARD_FXCXCL12. This workunit gives 255000 credits, including the 50% bonus for fast return, so the basic credits given for the FLOPs is 170000. 1 BOINC credit is given for 432 GFLOPs, so the actual GFLOPs needed by the task is 73.440.000. Let's do your calculation again with this number (which is still an estimate). 73.440.000 GFLOPs / 1.300 GFLOPS = 56492.3 seconds =15h 41m 32.3s 73.440.000 GFLOPs / 72.000 sec = 1020 GFLOPS FLOPS stands for FLOating Point Operations Per Second. (its the speed of computation) FLOPs stands for FLOating Point OPerations (its the "total" number of operations done) The 1 BOINC credit for 432 GFLOPs comes from the definition of BOINC credits. It says that 200 cobblestones (=credits) are given for 1 day work on a 1000 MFLOPS computer. 1000 MFLOPS = 1 GFLOPS. 200 credits for 24h at 1 GFLOPS 200 credits for 86400sec at 1 GFLOPS 200 credits for 86400 GFLOPs 1 credit for 432 GFLOPs ID: 41339 · Rating: 0 · rate: / Reply Quote

robertmiles Send message Joined: 16 Apr 09 Posts: 503 Credit: 769,991,668 RAC: 0 Level Scientific publications	Message 41345 - Posted: 15 Jun 2015, 1:31:28 UTC - in response to Message 41339. For the RAM, using a non-ECC high frequency low latency RAM ? Yes. Especially for the graphics board RAM; much less for the CPU RAM (they are separate except for some low-end graphics boards). ID: 41345 · Rating: 0 · rate: / Reply Quote

robertmiles Send message Joined: 16 Apr 09 Posts: 503 Credit: 769,991,668 RAC: 0 Level Scientific publications	Message 41346 - Posted: 15 Jun 2015, 1:44:16 UTC - in response to Message 41332. The PCI-Express performance. The card eventually finishes with the computational work it has been assigned and then a) the results of the computation must be transferred back to the host and b) new computational work must be transferred from the host to the card. During these times, the card sits idle, or at least is not too busy doing real work. Not always true for CUDA workunits. With recent Nvidia GPUs, it's possible for CUDA workunits to transfer data to and from the graphics memory at the same time that the GPU is performing calculations on something already in the graphics memory. This requires starting some of the kernels asynchronously, though. I don't know if GPUGRID offers any workunits that do this, or if this is also possible for OpenCL workunits. ID: 41346 · Rating: 0 · rate: / Reply Quote

Vagelis Giannadakis Send message Joined: 5 May 13 Posts: 187 Credit: 349,254,454 RAC: 0 Level Scientific publications	Message 41351 - Posted: 15 Jun 2015, 8:18:18 UTC - in response to Message 41346. Not always true for CUDA workunits. With recent Nvidia GPUs, it's possible for CUDA workunits to transfer data to and from the graphics memory at the same time that the GPU is performing calculations on something already in the graphics memory. This requires starting some of the kernels asynchronously, though. I don't know if GPUGRID offers any workunits that do this, or if this is also possible for OpenCL workunits. Indeed I would think that not all the card's memory is being accessed by the GPU at the same time, so some part(s) of it could be updated without stopping the GPU. But to avoid data corruption you would need exclusive locks at some level (range of addresses, banks, whatever). Depending mostly on timing and I would guess the cleverness of some algorithm deciding which parts of the memory it would make available for external changes, these changes could happen without the GPU stopping at all. With such schemes however, you generally get better latency (in our case, the CPU applying the changes it wants with a shorter delay), but also lower overall throughput (both the CPU and GPU access fewer memory addresses over time). ID: 41351 · Rating: 0 · rate: / Reply Quote