Ampere 10496 & 8704 & 5888 fp32 cores!

Author	Message
eXaPower Send message Joined: 25 Sep 13 Posts: 293 Credit: 1,897,601,978 RAC: 0 Level Scientific publications	Message 55227 - Posted: 1 Sep 2020, 23:03:41 UTC Incredible amount of compute power compared to Turing! 38Tflops single 32bit precision and 30Tflops for 3080. 20Tflops on 3070. New arch details forthcoming. Plenty of websites with current info. New samsung? 8nm die lithography with 28 billion transistors (3090) die size unknown currently. Previous 12nm is TSMC "ffn". Memory clocks are faster than Turing. PCIe4.0 Ampere Boost should be around 2ghz as previous PCIe3.0 Pascal and Turing did routinely. Will GRUGRID be ready for this newest generation? I'll be an early adapter. Looking at couple 3080 since 3090 offers 1800 more cores for 700$ premium. Purchase new pci4.0 mb and +12 core CPU with the savings. ID: 55227 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1423 Credit: 9,186,946,190 RAC: 1,288,374 Level Scientific publications	Message 55229 - Posted: 2 Sep 2020, 1:15:32 UTC - in response to Message 55227. I didn't see any publication of the actual memory clocks. Just the bandwidths. Will wait and see what the actual specs and test results are once the actual cards are in the hands of testers. Just because the cards will be great pixel pushers for ray-tracing games doesn't mean they will produce the commensurate compute improvements. And to fully utilize the new architectural differences between Ampere and Turing/Pascal means we will need new applications. ID: 55229 · Rating: 0 · rate: / Reply Quote

eXaPower Send message Joined: 25 Sep 13 Posts: 293 Credit: 1,897,601,978 RAC: 0 Level Scientific publications	Message 55230 - Posted: 2 Sep 2020, 1:45:33 UTC - in response to Message 55229. Last modified: 2 Sep 2020, 1:49:26 UTC Integer32 performamce same as fp32 and int32 Turing 1:1? On gaming: the hardest pixel pushing 4k ray tracing is currently 3rd iteration of Metro game. Certainly others are crisp too. Another game or demo 8k or video will showcase ampere chops. I have 4k now. 8k monitor right around the corner for mainstream purchase. 2020 Bleeding edge. I was surprised at 10k or 8k <5k 3070 cores. Never mind ampere tensors or fp64 performance. ID: 55230 · Rating: 0 · rate: / Reply Quote

rod4x4 Send message Joined: 4 Aug 14 Posts: 266 Credit: 2,219,935,054 RAC: 0 Level Scientific publications	Message 55231 - Posted: 2 Sep 2020, 2:04:00 UTC - in response to Message 55229. And to fully utilize the new architectural differences between Ampere and Turing/Pascal means we will need new applications. The idea of going to "Wrapper" with ACEMD3 was to enable easier development of apps for new CUDA / Architectural releases. Interested to see how easy this path will be....The holdup may be Nvidia and how fast they release the next CUDA Toolkit. ID: 55231 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1423 Credit: 9,186,946,190 RAC: 1,288,374 Level Scientific publications	Message 55232 - Posted: 2 Sep 2020, 6:29:59 UTC - in response to Message 55231. The trick will be to use all the new hardware in the best parallelization of the current and future searches. If you are a game designer, you have new SDK's for gaming. But is there going to be a compute SDK right after release? Best scenario would be yes, and even better would be some new automatic profilers for compute loads. That way you could just input the current source code and the profiler would spit out the new optimized code for the new hardware resources in Ampere. Then look at the generated code and iterate another revision that is better and faster. Rinse and repeat. ID: 55232 · Rating: 0 · rate: / Reply Quote

eXaPower Send message Joined: 25 Sep 13 Posts: 293 Credit: 1,897,601,978 RAC: 0 Level Scientific publications	Message 55237 - Posted: 2 Sep 2020, 11:22:40 UTC What concerns me: will Turing quality control issues crop up on Ampere? Out of 6 turing GPUs I purchased 4 died. Gigabyte 2070 in a day. Zotac 2060 lasted 5 months. evga 2080 in 2 months with another evga 2080 enduring 2 years. All had 3 year warranty. I sold all the warranty replacements since I didnt want anymore Turings due to their high death rate. Had Pascal 1080 gpu knelt after 27 months. My Evga 1070 still holding on after 4 years 24/7 running it would be retired if the Turings lasted. And no Maxwell's quit they just retired. ID: 55237 · Rating: 0 · rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 55238 - Posted: 2 Sep 2020, 12:40:29 UTC - in response to Message 55227. Last modified: 2 Sep 2020, 12:48:43 UTC I expect that the CUDA cores could be used for crunching will be only the half of that stated in the name of this thread. Similarly to the GF116 architecture, where only the 2/3rd of the cores could be used for crunching (due to the dispatch unit/CUDA core (4/6) ratio): I think the relative performance in computing compared to the RTX 2080Ti will be the following: card cores performance RTX 2080Ti 4352 100.0% RTX 3090 10496 5248 120.6% RTX 3080 8704 4352 100.0% RTX 3070 5888 2944 67.6% Perhaps a bit (say 10%) more (taking other factors in consideration). ID: 55238 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 423,674 Level Scientific publications	Message 55239 - Posted: 2 Sep 2020, 15:23:03 UTC - in response to Message 55238. I think the CUDa core counts might be a bit of marketing magic. They claimed in the article that the Ampere cores can do 2 operations per clock, which is effectively doubling the work done, on a single physical core. We’ll see how it shakes out for compute work. I’ll probably grab a 3070 when they are released to compare to my 2080ti and existing 2070s. I’m very interested in performance per watt of the new cards. Don’t forget that all the new cards, while performance is looking great, they are seeing a healthy increase in power consumption as well. The 2080ti was a 250W card, the 2080 was a 215W card, the 2070 was a 175W card. Now these new cards are 350W/320W/220W for 3090/3080/3070 respectively. If the 3070 performs comparably to a 2080ti that’s about a 12% power efficiency boost, which is believable. For CUDA compute work, the 2070 was as fast or a little faster than a 1080ti at much less power draw (250W vs 175W). So I could see the same thing happening again for 3070 vs 2080ti. ID: 55239 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1423 Credit: 9,186,946,190 RAC: 1,288,374 Level Scientific publications	Message 55242 - Posted: 2 Sep 2020, 16:08:47 UTC - in response to Message 55238. I expect that the CUDA cores could be used for crunching will be only the half of that stated in the name of this thread. I think you will be correct also. I saw the published CUDA core counts and thought marketing nonsense. Unless they fundamentally changed the architecture design, I think they just doubled the physical core counts by the new 2 operands per cycle PAM memory operations. ID: 55242 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 423,674 Level Scientific publications	Message 55243 - Posted: 3 Sep 2020, 19:47:50 UTC - in response to Message 55238. Did some more reading and I think I got a better grasp on the whole CUDA core count issue and relative performance. What’s being left to the footnotes is that comment that Jenson made in the announcement “two instructions per clock”. What he didn’t say was how the SM partitioning worked to allow this to happen, and that the doubling affects FP32 calculations only. They have a new datapath design where each SM partition can handle either 32x FP32 (which is double Turing) OR 16FP32 and INT32. For graphics loads, it’s mostly FP32 so this works to their advantage, but for INT32 workloads, you won’t have this doubling effect, and the performance will be closer to Turing, with the normal generational efficiency boost. A full size Ampere GPU core (GA100) is 128SMs with 64 CUDA cores per SM for a highest possible CUDA core count of 8192. The 3090, 3080, 3070 are not full size Ampere cores. Given that the marketing number of the 3090 is “10496” cores, we can surmise that it’s really 5248 CUDA cores with 82 SMs. This is still more SMs and cores than the 2080ti has (68 SMs/ 4352 Cores). However, the 3070 with its “5888” cores really has 2944 cores. So it will depend on your workload. If you’re only living in the bubble of gaming, then yeah enjoy that 2x boost. But for other non-FP32 loads (like more purely computational loads), you’re going to likely only see normal generational improvements with Ampere over Turing. ID: 55243 · Rating: 0 · rate: / Reply Quote

rod4x4 Send message Joined: 4 Aug 14 Posts: 266 Credit: 2,219,935,054 RAC: 0 Level Scientific publications	Message 55246 - Posted: 3 Sep 2020, 23:28:58 UTC - in response to Message 55243. Did some more reading and I think I got a better grasp on the whole CUDA core count issue and relative performance. What’s being left to the footnotes is that comment that Jenson made in the announcement “two instructions per clock”. What he didn’t say was how the SM partitioning worked to allow this to happen, and that the doubling affects FP32 calculations only. They have a new datapath design where each SM partition can handle either 32x FP32 (which is double Turing) OR 16FP32 and INT32. For graphics loads, it’s mostly FP32 so this works to their advantage, but for INT32 workloads, you won’t have this doubling effect, and the performance will be closer to Turing, with the normal generational efficiency boost. A full size Ampere GPU core (GA100) is 128SMs with 64 CUDA cores per SM for a highest possible CUDA core count of 8192. The 3090, 3080, 3070 are not full size Ampere cores. Given that the marketing number of the 3090 is “10496” cores, we can surmise that it’s really 5248 CUDA cores with 82 SMs. This is still more SMs and cores than the 2080ti has (68 SMs/ 4352 Cores). However, the 3070 with its “5888” cores really has 2944 cores. So it will depend on your workload. If you’re only living in the bubble of gaming, then yeah enjoy that 2x boost. But for other non-FP32 loads (like more purely computational loads), you’re going to likely only see normal generational improvements with Ampere over Turing. Thanks for the analysis. Will have to wait and see how they assign SMs to each GTX model to gauge the performance increase for each model. Nvidia will not give their technology away, so I suspect you are right in saying it will only be a generational performance increase for us. ID: 55246 · Rating: 0 · rate: / Reply Quote

kain Send message Joined: 3 Sep 14 Posts: 152 Credit: 918,557,369 RAC: 0 Level Scientific publications	Message 55279 - Posted: 11 Sep 2020, 18:50:56 UTC Well, I'm optimistic :) https://wccftech.com/nvidia-geforce-rtx-3080-flagship-2x-faster-than-rtx-2080-in-opencl-cuda-benchmarks/ ID: 55279 · Rating: 0 · rate: / Reply Quote

Pop Piasa Send message Joined: 8 Aug 19 Posts: 252 Credit: 458,054,251 RAC: 0 Level Scientific publications	Message 55281 - Posted: 12 Sep 2020, 2:27:45 UTC Hopefully, the Amperes will be showing up on the Passmark GPU Direct Compute ratings soon. https://www.videocardbenchmark.net/directCompute.html ID: 55281 · Rating: 0 · rate: / Reply Quote

eXaPower Send message Joined: 25 Sep 13 Posts: 293 Credit: 1,897,601,978 RAC: 0 Level Scientific publications	Message 55292 - Posted: 14 Sep 2020, 21:55:21 UTC https://www.tomshardware.com/features/nvidia-ampere-architecture-deep-dive Real fp32 cores confirmed. 64 fp32 per sm with 32 cores for fp32 only with remaining 32 core being concurrent int32 or fp32. See Ampere slides. Compute benchmarks released (tomorrow?) will show how well new fp32 design performs. Ampere Integer32 performance 50-66% of floating 32 depending on code efficiency. Consumer Ampere ga102/104 (fp64) double precision (2 per sm) now 1/64 ratio of fp32. Turing has 1/32 (4 fp64 per sm). ID: 55292 · Rating: 0 · rate: / Reply Quote

eXaPower Send message Joined: 25 Sep 13 Posts: 293 Credit: 1,897,601,978 RAC: 0 Level Scientific publications	Message 55302 - Posted: 16 Sep 2020, 21:12:25 UTC https://www.tomshardware.com/news/nvidia-geforce-rtx-3080-review Tom's will reveal their compute benchmarks soon and show detailed power profiles. Definitely an upgrade if own a 2080ti. Are you purchasing an Ampere? If so which? For me 3080 and 3070 look good. RTX 3090 the ultra halo card with an awful per core cost compared to 3080. Note: GTX 3080 founders edition real power consumption similar to overclocked non founders 2080ti - demanding 330 watts. Curious to what the non founders RTX 3090 pulls - my guess is ~400W overclocked. Edit: noticed I made mistake in my previous post - meant to write Ampere has 128 fp32 cores per sm. ID: 55302 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1423 Credit: 9,186,946,190 RAC: 1,288,374 Level Scientific publications	Message 55304 - Posted: 17 Sep 2020, 20:25:25 UTC - in response to Message 55302. So far all the cards are power limited. Almost no overclocking potential. Maybe 2-3%. Doubt the 3090 cards will be any different. ID: 55304 · Rating: 0 · rate: / Reply Quote

eXaPower Send message Joined: 25 Sep 13 Posts: 293 Credit: 1,897,601,978 RAC: 0 Level Scientific publications	Message 55351 - Posted: 25 Sep 2020, 12:57:08 UTC - in response to Message 55304. So far all the cards are power limited. Almost no overclocking potential. Maybe 2-3%. Doubt the 3090 cards will be any different. You are right even with enough power the oc wall is 2GHz - many forums reporting Ampere 3080 crashing at 2GHz. Ampere release eery similar to Turing. Remember when early model 2080 and 2080ti were glitching out at stock clocks? Ampere improved founders cooling didnt help. Might be memory related due to new technology. And/or quality control with the massive dies. 12nm Die density on Turing TU102 25m transistors per mm. 8nm TA102 has 45m per mm. 7nm and 5nm are more dense. Finding a 3080 another story with limited availability. This is so bad Nvidia released a statement about availability. RTX3090 reviews published have oced clocks power consumption at 450W with a 480W power limit. 360W at out the box clocks. Big power increase for little performance gain. (2) 2080TI cards had 3 8 pin power connectors: the msi lighting and galax. Now all 3090 and most 3080 have (3) 8 pins or the new 12 pin on founders. ID: 55351 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1423 Credit: 9,186,946,190 RAC: 1,288,374 Level Scientific publications	Message 55354 - Posted: 25 Sep 2020, 17:21:18 UTC Last modified: 25 Sep 2020, 17:22:33 UTC Until the compute apps get recompiled for CUDA 11.1 and the new PTX library, none are going to show the potential from using the dormant extra FP32 pipeline in the architecture. ID: 55354 · Rating: 0 · rate: / Reply Quote

eXaPower Send message Joined: 25 Sep 13 Posts: 293 Credit: 1,897,601,978 RAC: 0 Level Scientific publications	Message 55366 - Posted: 27 Sep 2020, 15:25:33 UTC https://www.techpowerup.com/272591/rtx-3080-crash-to-desktop-problems-likely-connected-to-aib-designed-capacitor-choice ID: 55366 · Rating: 0 · rate: / Reply Quote

rod4x4 Send message Joined: 4 Aug 14 Posts: 266 Credit: 2,219,935,054 RAC: 0 Level Scientific publications	Message 55371 - Posted: 29 Sep 2020, 5:56:11 UTC - in response to Message 55366. Last modified: 29 Sep 2020, 6:01:38 UTC https://www.techpowerup.com/272591/rtx-3080-crash-to-desktop-problems-likely-connected-to-aib-designed-capacitor-choice Nvidia has released driver 456.55 to fix the issue. (driver appears to lower the boost to prevent the crashes during games) https://videocardz.com/newz/nvidia-geforce-rtx-3080-owners-report-fewer-crashes-after-updating-drivers ASUS and MSI have modified their designs to fix the crashes by implementing different capacitor configuration. https://videocardz.com/newz/asus-also-caught-modifying-geforce-rtx-3080-tuf-and-rog-strix-pcb-designs and https://videocardz.com/newz/msi-quietly-changes-geforce-rtx-3080-gaming-x-trio-design-amid-stability-concerns Hopefully just a minor speed bump in the release of the new Ampere Architecture. Plain sailing from here? ID: 55371 · Rating: 0 · rate: / Reply Quote