Message boards :
Graphics cards (GPUs) :
NVidia GPU Card comparisons in GFLOPS peak
Message board moderation
Previous · 1 . . . 13 · 14 · 15 · 16 · 17 · Next
| Author | Message |
|---|---|
|
Send message Joined: 17 Feb 13 Posts: 181 Credit: 144,871,276 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Thanks: which is the correct thread? |
|
Send message Joined: 5 Jan 09 Posts: 670 Credit: 2,498,095,550 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
"Number Crunching" |
|
Send message Joined: 17 Feb 13 Posts: 181 Credit: 144,871,276 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Many thanks, Betting Slip! Boy, did you ever stir some memories with the team name "Radio Caroline"!! :)- |
|
Send message Joined: 28 May 12 Posts: 63 Credit: 714,535,121 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I wish that the web site would allow us to mine the dataset as apparently some have been able to do. Wish God would give us the ability to see the differences in runtimes over a long period versus the GPUs that we have and the ones we may buy sometime For example, I recently added a EVGA GTX960 (4G RAM about USD $240) I wish to compare the performance of this pair on GPU work units directly At this point only GTX960, two GTX760, and one GTX550 exist |
skgivenSend message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Robert, while running NOELIA_ETQ_bound tasks your GTX960 is ~11.7% faster than your GTX760. There isn't a GTX550, but there is a GTX 550 Ti, 192Cuda Cores but GF116 40nm and Compute Capable 2.1 691 GFlops peak (622 with correction factor applied albeit back in 2012). The performance of a GTX550Ti would be around 15% of the original GTX Titan, or to put it another way your GTX960 would be ~4 times as fast as the GTX550Ti. FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help |
|
Send message Joined: 8 Jun 14 Posts: 18 Credit: 19,804,091 RAC: 0 Level ![]() Scientific publications ![]() ![]()
|
10-06-15 09:59:20 | | CUDA: NVIDIA GPU 0: GeForce GTX TITAN X (driver version 353.12, CUDA version 7.5, compute capability 5.2, 4096MB, 3065MB available, 6611 GFLOPS peak) 10-06-15 09:59:20 | | OpenCL: NVIDIA GPU 0: GeForce GTX TITAN X (driver version 353.12, device version OpenCL 1.2 CUDA, 12288MB, 3065MB available, 6611 GFLOPS peak) |
|
Send message Joined: 8 Jun 14 Posts: 18 Credit: 19,804,091 RAC: 0 Level ![]() Scientific publications ![]() ![]()
|
10-06-15 09:59:20 | | CUDA: NVIDIA GPU 0: GeForce GTX TITAN X (driver version 353.12, CUDA version 7.5, compute capability 5.2, 4096MB, 3065MB available, 6611 GFLOPS peak) 10-06-15 09:59:20 | | OpenCL: NVIDIA GPU 0: GeForce GTX TITAN X (driver version 353.12, device version OpenCL 1.2 CUDA, 12288MB, 3065MB available, 6611 GFLOPS peak) |
skgivenSend message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
This was updated with the GTX 980Ti using limited results of one task type only (subsequent observations show that performance of different cards varies by task type; with some jobs scaling better than others): Performance GPU Power GPUGrid Performance/Watt 211% GTX Titan Z (both GPUs) 375W 141% 156% GTX Titan X 250W 156% 143% GTX 980Ti 250W 143% 116% GTX 690 (both GPUs) 300W 97% 114% GTX Titan Black 250W 114% 112% GTX 780Ti 250W 112% 109% GTX 980 165W 165% 100% GTX Titan 250W 100% 93% GTX 970 145W 160% 90% GTX 780 250W 90% 77% GTX 770 230W 84% 74% GTX 680 195W 95% 64% GTX 960 120W 134% 59% GTX 670 170W 87% 55% GTX 660Ti 150W 92% 53% GTX 760 130W 102% 51% GTX 660 140W 91% 47% GTX 750Ti 60W 196% 43% GTX 650TiBoost 134W 80% 37% GTX 750 55W 168% 33% GTX 650Ti 110W 75% Throughput performances and Performances/Watt are relative to a GTX Titan. FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help |
|
Send message Joined: 20 Jul 14 Posts: 732 Credit: 130,089,082 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Thanks for this new update skgiven :) [CSF] Thomas H.V. Dupont Founder of the team CRUNCHERS SANS FRONTIERES 2.0 www.crunchersansfrontieres |
[AF>Amis des Lapins] Oncle BobSend message Joined: 21 Apr 13 Posts: 3 Credit: 54,953,606 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Hi, Just a question : is GPUGrid in SP or DP ? I Though it was SP, but I have done some math and found that it it seems to be in DP. I ran some long tasks, which are ~5 TFLOP, and ran it on my GTX 750 Ti OC in 108000 secondes, that lead to ~46 GFLOPS, which is the peak in DP of my card. Am I right ? If yes, why do the Titan and Titan black, which have a strong DP, seem to be weaker on GPUGrid than high-end GTX 900 cards which have 10 times less DP ? As I planned to buy a Titan Black for GPUGrid, I'm very interested if you have the explanation. Thanks ! |
|
Send message Joined: 11 Jan 13 Posts: 216 Credit: 846,538,252 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I can confirm that this project is indeed SP. |
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Just a question : is GPUGrid in SP or DP ? The GPUGrid app does most of its calculations in SP, on the GPU. The rest (in DP) is done on the CPU. I Though it was SP, but I have done some math and found that it it seems to be in DP. No. For a more elaborated answer perhaps you should share your performance calculation, and we could try to figure out what's wrong with it. ... why do the Titan and Titan black, which have a strong DP, seem to be weaker on GPUGrid than high-end GTX 900 cards which have 10 times less DP ? A Titan Black nearly equals a GTX 780Ti from GPUGrid's point of view. DP cards like the Titan Black usually have lower clocks than their gaming card equivalent, and/or the ECC makes the card's RAM run slower. As I planned to buy a Titan Black for GPUGrid,... Don't buy a DP card for GPUGrid. A Titan X is much better for this project, or a GTX980Ti, as it has a higher performace/price ratio than a Titan X. I'm very interested if you have the explanation. Now you have it too. :) Thanks ! You're welcome. |
[AF>Amis des Lapins] Oncle BobSend message Joined: 21 Apr 13 Posts: 3 Credit: 54,953,606 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
For a more elaborated answer perhaps you should share your performance calculation, and we could try to figure out what's wrong with it. I am currently running a long task ( https://www.gpugrid.net/result.php?resultid=14262417 ). BOINC estimates its size at 5 000 000 GFLOP, and ETA is a bit more than 20 hours. GPU is a GTX 750 Ti, GPU load is 90-91%. I don't run any task on CPU (i7 2600K) in order to evaluate the speed of the card on this task. If I am right, 5 000 000 divided by 72 000 (number of seconds in 20 hours) = 69 and GTX 750 Ti is given for 1 300 GFLOPS (well, a bit more, as mine is a bit overclocked). I expected 5 000 000 / 1 300 = 3850 seconds (a few more, because I suppose CPU, which run slower than GPU, is a bottleneck), so what's wrong with my GPUGrid understanding ? |
robertmilesSend message Joined: 16 Apr 09 Posts: 503 Credit: 769,991,668 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
For a more elaborated answer perhaps you should share your performance calculation, and we could try to figure out what's wrong with it. For a typical computer, the CPU runs at about 4 times the clock speed of the CPU. However, typical GPU programs are capable of using many of the GPU cores at once. GPUs have a varying number of GPU cores - usually more for the more expensive GPUs, currently with a maximum of about 3000 GPU cores per GPU. A GTX 750 Ti has 640 GPU cores. A GTX Titan Z board has 5760 GPU cores, but only because it uses 2 GPUs. A CPU can have multiple CPU cores, with the number being as high as 12 for the most expensive CPUs. However, BOINC CPU workunits usually use only one CPU core each. |
|
Send message Joined: 5 May 13 Posts: 187 Credit: 349,254,454 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I am currently running a long task ( https://www.gpugrid.net/result.php?resultid=14262417 ). BOINC estimates its size at 5 000 000 GFLOP, and ETA is a bit more than 20 hours. If the only factors playing the key roles were the tasks' computational load and the cards' performance, then I guess the tasks' computation times would be roughly equal to [task GFLOP] / [card GFLOP/s]. But of course, this is not an ideal world :) Off the top of my head I can think of the following factors:
Each task type's need for main memory accesses. The GPU can perform its operations much quicker if it doesn't need to access the card's main memory It follows from the above, that the size of the in-GPU cache and its speed can have a major effect on the actual performance. The PCI-Express performance. The card eventually finishes with the computational work it has been assigned and then a) the results of the computation must be transferred back to the host and b) new computational work must be transferred from the host to the card. During these times, the card sits idle, or at least is not too busy doing real work. At least for GPUGrid, the CPU has to do some significant work too. For my (new!) GTX 970, the acemd process consumes ~25% of a logical core. This means a) there must be some CPU resources available, and b) the CPU has to provide some "acceptable" levels of performance. Doing other CPU-bound computational work (e.g. CPU BOINC tasks) can have a significant effect. System memory. Especially if other CPU-bound work is running, all these tasks will need to frequently access the main memory and this access has specific limited performance. Many tasks needing access simultaneously effectively lower the memory access bandwidth for each task. The increased load on the memory controller also effectively increases memory latency.
|
[AF>Amis des Lapins] Oncle BobSend message Joined: 21 Apr 13 Posts: 3 Credit: 54,953,606 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Thank you, so my idea of a bottleneck was not so far from the truth (it seems to be MANY bottlenecks instead !). It should be great to know what exactly slow the GPU computation and how we can improve the speed of the card. I run BOINC on a SSD, which is far faster than a classic HDD. Maybe using a RAMDisk should improve a bit more the total speed of the task ? For the RAM, using a non-ECC high frequency low latency RAM ? And for the GPU itself, does the bandwidth heavily affect the computation ? High-end GPU use a 384 bits bandwidth, and past high-end used as high as 512 bits (GTX 280...). Does compute capability affects the speed ? It is very sad to see that average computation speed is so far from theoretical GFLOPS peak. This is a waste of computational time and energy. |
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I run BOINC on a SSD, which is far faster than a classic HDD. Maybe using a RAMDisk should improve a bit more the total speed of the task ? It won't be noticeable, as the speed of the HDD(SSD) subsystem matters only when a task is starting/ending/making a checkpoint. For the RAM, using a non-ECC high frequency low latency RAM ? Yes. And for the GPU itself, does the bandwidth heavily affect the computation ? High-end GPU use a 384 bits bandwidth, and past high-end used as high as 512 bits (GTX 280...). Not heavily, but it's noticeable (up to ~10%). But the WDDM overhead is much more a bottleneck especially for high-end cards. My GTX 980 in a Windows XP host is only 10% slower than a GTX980Ti in a Windows 8 host. Does compute capability affects the speed ? The actual compute capability which the client is written to use matters. The GPUGrid client is the most recent among BOINC projects, as it is written in CUDA6.5. It is very sad to see that average computation speed is so far from theoretical GFLOPS peak. It's not as far as you think. In your calculation you took the 5.000.000 GFLOPs from BOINC manager / task properties, but this value is incorrect, as well as the result. I suppose from the task's runtime that you had a GERARD_FXCXCL12. This workunit gives 255000 credits, including the 50% bonus for fast return, so the basic credits given for the FLOPs is 170000. 1 BOINC credit is given for 432 GFLOPs, so the actual GFLOPs needed by the task is 73.440.000. Let's do your calculation again with this number (which is still an estimate). 73.440.000 GFLOPs / 1.300 GFLOPS = 56492.3 seconds =15h 41m 32.3s 73.440.000 GFLOPs / 72.000 sec = 1020 GFLOPS FLOPS stands for FLOating Point Operations Per Second. (its the speed of computation) FLOPs stands for FLOating Point OPerations (its the "total" number of operations done) The 1 BOINC credit for 432 GFLOPs comes from the definition of BOINC credits. It says that 200 cobblestones (=credits) are given for 1 day work on a 1000 MFLOPS computer. 1000 MFLOPS = 1 GFLOPS. 200 credits for 24h at 1 GFLOPS 200 credits for 86400sec at 1 GFLOPS 200 credits for 86400 GFLOPs 1 credit for 432 GFLOPs |
robertmilesSend message Joined: 16 Apr 09 Posts: 503 Credit: 769,991,668 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
For the RAM, using a non-ECC high frequency low latency RAM ? Especially for the graphics board RAM; much less for the CPU RAM (they are separate except for some low-end graphics boards). |
robertmilesSend message Joined: 16 Apr 09 Posts: 503 Credit: 769,991,668 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
The PCI-Express performance. The card eventually finishes with the computational work it has been assigned and then a) the results of the computation must be transferred back to the host and b) new computational work must be transferred from the host to the card. During these times, the card sits idle, or at least is not too busy doing real work. Not always true for CUDA workunits. With recent Nvidia GPUs, it's possible for CUDA workunits to transfer data to and from the graphics memory at the same time that the GPU is performing calculations on something already in the graphics memory. This requires starting some of the kernels asynchronously, though. I don't know if GPUGRID offers any workunits that do this, or if this is also possible for OpenCL workunits. |
|
Send message Joined: 5 May 13 Posts: 187 Credit: 349,254,454 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Not always true for CUDA workunits. With recent Nvidia GPUs, it's possible for CUDA workunits to transfer data to and from the graphics memory at the same time that the GPU is performing calculations on something already in the graphics memory. This requires starting some of the kernels asynchronously, though. I don't know if GPUGRID offers any workunits that do this, or if this is also possible for OpenCL workunits. Indeed I would think that not all the card's memory is being accessed by the GPU at the same time, so some part(s) of it could be updated without stopping the GPU. But to avoid data corruption you would need exclusive locks at some level (range of addresses, banks, whatever). Depending mostly on timing and I would guess the cleverness of some algorithm deciding which parts of the memory it would make available for external changes, these changes could happen without the GPU stopping at all. With such schemes however, you generally get better latency (in our case, the CPU applying the changes it wants with a shorter delay), but also lower overall throughput (both the CPU and GPU access fewer memory addresses over time).
|
©2025 Universitat Pompeu Fabra