Ampere 10496 & 8704 & 5888 fp32 cores!

Message boards : Graphics cards (GPUs) : Ampere 10496 & 8704 & 5888 fp32 cores!
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 6 · Next

AuthorMessage
eXaPower

Send message
Joined: 25 Sep 13
Posts: 293
Credit: 1,897,601,978
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 55227 - Posted: 1 Sep 2020, 23:03:41 UTC

Incredible amount of compute power compared to Turing!
38Tflops single 32bit precision and 30Tflops for 3080. 20Tflops on 3070.

New arch details forthcoming. Plenty of websites with current info.

New samsung? 8nm die lithography with 28 billion transistors (3090) die size unknown currently.
Previous 12nm is TSMC "ffn".
Memory clocks are faster than Turing.
PCIe4.0 Ampere Boost should be around 2ghz as previous PCIe3.0 Pascal and Turing did routinely.

Will GRUGRID be ready for this newest generation?

I'll be an early adapter. Looking at couple 3080 since 3090 offers 1800 more cores for 700$ premium.
Purchase new pci4.0 mb and +12 core CPU with the savings.
ID: 55227 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 891
Level
Tyr
Scientific publications
watwatwatwatwat
Message 55229 - Posted: 2 Sep 2020, 1:15:32 UTC - in response to Message 55227.  

I didn't see any publication of the actual memory clocks. Just the bandwidths. Will wait and see what the actual specs and test results are once the actual cards are in the hands of testers. Just because the cards will be great pixel pushers for ray-tracing games doesn't mean they will produce the commensurate compute improvements.

And to fully utilize the new architectural differences between Ampere and Turing/Pascal means we will need new applications.
ID: 55229 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
eXaPower

Send message
Joined: 25 Sep 13
Posts: 293
Credit: 1,897,601,978
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 55230 - Posted: 2 Sep 2020, 1:45:33 UTC - in response to Message 55229.  
Last modified: 2 Sep 2020, 1:49:26 UTC

Integer32 performamce same as fp32 and int32 Turing 1:1?

On gaming: the hardest pixel pushing 4k ray tracing is currently 3rd iteration of Metro game.

Certainly others are crisp too. Another game or demo 8k or video will showcase ampere chops. I have 4k now. 8k monitor right around the corner for mainstream purchase. 2020 Bleeding edge.
I was surprised at 10k or 8k <5k 3070 cores. Never mind ampere tensors or fp64 performance.
ID: 55230 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
rod4x4

Send message
Joined: 4 Aug 14
Posts: 266
Credit: 2,219,935,054
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 55231 - Posted: 2 Sep 2020, 2:04:00 UTC - in response to Message 55229.  


And to fully utilize the new architectural differences between Ampere and Turing/Pascal means we will need new applications.

The idea of going to "Wrapper" with ACEMD3 was to enable easier development of apps for new CUDA / Architectural releases.
Interested to see how easy this path will be....The holdup may be Nvidia and how fast they release the next CUDA Toolkit.
ID: 55231 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 891
Level
Tyr
Scientific publications
watwatwatwatwat
Message 55232 - Posted: 2 Sep 2020, 6:29:59 UTC - in response to Message 55231.  

The trick will be to use all the new hardware in the best parallelization of the current and future searches.

If you are a game designer, you have new SDK's for gaming. But is there going to be a compute SDK right after release?

Best scenario would be yes, and even better would be some new automatic profilers for compute loads. That way you could just input the current source code and the profiler would spit out the new optimized code for the new hardware resources in
Ampere. Then look at the generated code and iterate another revision that is better and faster. Rinse and repeat.
ID: 55232 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
eXaPower

Send message
Joined: 25 Sep 13
Posts: 293
Credit: 1,897,601,978
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 55237 - Posted: 2 Sep 2020, 11:22:40 UTC

What concerns me: will Turing quality control issues crop up on Ampere?
Out of 6 turing GPUs I purchased 4 died. Gigabyte 2070 in a day. Zotac 2060 lasted 5 months. evga 2080 in 2 months with another evga 2080 enduring 2 years.

All had 3 year warranty. I sold all the warranty replacements since I didnt want anymore Turings due to their high death rate.
Had Pascal 1080 gpu knelt after 27 months. My Evga 1070 still holding on after 4 years 24/7 running it would be retired if the Turings lasted. And no Maxwell's quit they just retired.
ID: 55237 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 55238 - Posted: 2 Sep 2020, 12:40:29 UTC - in response to Message 55227.  
Last modified: 2 Sep 2020, 12:48:43 UTC

I expect that the CUDA cores could be used for crunching will be only the half of that stated in the name of this thread.
Similarly to the GF116 architecture, where only the 2/3rd of the cores could be used for crunching (due to the dispatch unit/CUDA core (4/6) ratio):


I think the relative performance in computing compared to the RTX 2080Ti will be the following:
card         cores     performance        
RTX 2080Ti        4352   100.0%
RTX 3090   10496  5248   120.6%
RTX 3080    8704  4352   100.0%
RTX 3070    5888  2944    67.6%
Perhaps a bit (say 10%) more (taking other factors in consideration).
ID: 55238 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 55239 - Posted: 2 Sep 2020, 15:23:03 UTC - in response to Message 55238.  

I think the CUDa core counts might be a bit of marketing magic. They claimed in the article that the Ampere cores can do 2 operations per clock, which is effectively doubling the work done, on a single physical core. We’ll see how it shakes out for compute work.

I’ll probably grab a 3070 when they are released to compare to my 2080ti and existing 2070s.

I’m very interested in performance per watt of the new cards. Don’t forget that all the new cards, while performance is looking great, they are seeing a healthy increase in power consumption as well. The 2080ti was a 250W card, the 2080 was a 215W card, the 2070 was a 175W card. Now these new cards are 350W/320W/220W for 3090/3080/3070 respectively.

If the 3070 performs comparably to a 2080ti that’s about a 12% power efficiency boost, which is believable.

For CUDA compute work, the 2070 was as fast or a little faster than a 1080ti at much less power draw (250W vs 175W). So I could see the same thing happening again for 3070 vs 2080ti.
ID: 55239 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 891
Level
Tyr
Scientific publications
watwatwatwatwat
Message 55242 - Posted: 2 Sep 2020, 16:08:47 UTC - in response to Message 55238.  

I expect that the CUDA cores could be used for crunching will be only the half of that stated in the name of this thread.

I think you will be correct also. I saw the published CUDA core counts and thought marketing nonsense. Unless they fundamentally changed the architecture design, I think they just doubled the physical core counts by the new 2 operands per cycle PAM memory operations.
ID: 55242 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 55243 - Posted: 3 Sep 2020, 19:47:50 UTC - in response to Message 55238.  

Did some more reading and I think I got a better grasp on the whole CUDA core count issue and relative performance.

What’s being left to the footnotes is that comment that Jenson made in the announcement “two instructions per clock”. What he didn’t say was how the SM partitioning worked to allow this to happen, and that the doubling affects FP32 calculations only. They have a new datapath design where each SM partition can handle either 32x FP32 (which is double Turing) OR 16FP32 and INT32.

For graphics loads, it’s mostly FP32 so this works to their advantage, but for INT32 workloads, you won’t have this doubling effect, and the performance will be closer to Turing, with the normal generational efficiency boost.

A full size Ampere GPU core (GA100) is 128SMs with 64 CUDA cores per SM for a highest possible CUDA core count of 8192. The 3090, 3080, 3070 are not full size Ampere cores. Given that the marketing number of the 3090 is “10496” cores, we can surmise that it’s really 5248 CUDA cores with 82 SMs. This is still more SMs and cores than the 2080ti has (68 SMs/ 4352 Cores). However, the 3070 with its “5888” cores really has 2944 cores.

So it will depend on your workload. If you’re only living in the bubble of gaming, then yeah enjoy that 2x boost. But for other non-FP32 loads (like more purely computational loads), you’re going to likely only see normal generational improvements with Ampere over Turing.
ID: 55243 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
rod4x4

Send message
Joined: 4 Aug 14
Posts: 266
Credit: 2,219,935,054
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 55246 - Posted: 3 Sep 2020, 23:28:58 UTC - in response to Message 55243.  

Did some more reading and I think I got a better grasp on the whole CUDA core count issue and relative performance.

What’s being left to the footnotes is that comment that Jenson made in the announcement “two instructions per clock”. What he didn’t say was how the SM partitioning worked to allow this to happen, and that the doubling affects FP32 calculations only. They have a new datapath design where each SM partition can handle either 32x FP32 (which is double Turing) OR 16FP32 and INT32.

For graphics loads, it’s mostly FP32 so this works to their advantage, but for INT32 workloads, you won’t have this doubling effect, and the performance will be closer to Turing, with the normal generational efficiency boost.

A full size Ampere GPU core (GA100) is 128SMs with 64 CUDA cores per SM for a highest possible CUDA core count of 8192. The 3090, 3080, 3070 are not full size Ampere cores. Given that the marketing number of the 3090 is “10496” cores, we can surmise that it’s really 5248 CUDA cores with 82 SMs. This is still more SMs and cores than the 2080ti has (68 SMs/ 4352 Cores). However, the 3070 with its “5888” cores really has 2944 cores.

So it will depend on your workload. If you’re only living in the bubble of gaming, then yeah enjoy that 2x boost. But for other non-FP32 loads (like more purely computational loads), you’re going to likely only see normal generational improvements with Ampere over Turing.


Thanks for the analysis.
Will have to wait and see how they assign SMs to each GTX model to gauge the performance increase for each model.
Nvidia will not give their technology away, so I suspect you are right in saying it will only be a generational performance increase for us.
ID: 55246 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
kain

Send message
Joined: 3 Sep 14
Posts: 152
Credit: 918,557,369
RAC: 28
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 55279 - Posted: 11 Sep 2020, 18:50:56 UTC

ID: 55279 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Pop Piasa
Avatar

Send message
Joined: 8 Aug 19
Posts: 252
Credit: 458,054,251
RAC: 0
Level
Gln
Scientific publications
watwat
Message 55281 - Posted: 12 Sep 2020, 2:27:45 UTC

Hopefully, the Amperes will be showing up on the Passmark GPU Direct Compute ratings soon.

https://www.videocardbenchmark.net/directCompute.html
ID: 55281 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
eXaPower

Send message
Joined: 25 Sep 13
Posts: 293
Credit: 1,897,601,978
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 55292 - Posted: 14 Sep 2020, 21:55:21 UTC

https://www.tomshardware.com/features/nvidia-ampere-architecture-deep-dive

Real fp32 cores confirmed. 64 fp32 per sm with 32 cores for fp32 only with remaining 32 core being concurrent int32 or
fp32. See Ampere slides. Compute benchmarks released (tomorrow?) will show how well new fp32 design performs.

Ampere Integer32 performance 50-66% of floating 32 depending on code efficiency.

Consumer Ampere ga102/104 (fp64) double precision (2 per sm) now 1/64 ratio of fp32. Turing has 1/32 (4 fp64 per sm).

ID: 55292 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
eXaPower

Send message
Joined: 25 Sep 13
Posts: 293
Credit: 1,897,601,978
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 55302 - Posted: 16 Sep 2020, 21:12:25 UTC

https://www.tomshardware.com/news/nvidia-geforce-rtx-3080-review

Tom's will reveal their compute benchmarks soon and show detailed power profiles.

Definitely an upgrade if own a 2080ti.

Are you purchasing an Ampere? If so which? For me 3080 and 3070 look good.
RTX 3090 the ultra halo card with an awful per core cost compared to 3080.

Note: GTX 3080 founders edition real power consumption similar to overclocked non founders 2080ti - demanding 330 watts.

Curious to what the non founders RTX 3090 pulls - my guess is ~400W overclocked.

Edit: noticed I made mistake in my previous post - meant to write Ampere has 128 fp32 cores per sm.
ID: 55302 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 891
Level
Tyr
Scientific publications
watwatwatwatwat
Message 55304 - Posted: 17 Sep 2020, 20:25:25 UTC - in response to Message 55302.  

So far all the cards are power limited. Almost no overclocking potential. Maybe 2-3%. Doubt the 3090 cards will be any different.
ID: 55304 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
eXaPower

Send message
Joined: 25 Sep 13
Posts: 293
Credit: 1,897,601,978
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 55351 - Posted: 25 Sep 2020, 12:57:08 UTC - in response to Message 55304.  

So far all the cards are power limited. Almost no overclocking potential. Maybe 2-3%. Doubt the 3090 cards will be any different.

You are right even with enough power the oc wall is 2GHz - many forums reporting Ampere 3080 crashing at 2GHz.
Ampere release eery similar to Turing.

Remember when early model 2080 and 2080ti were glitching out at stock clocks? Ampere improved founders cooling didnt help. Might be memory related due to new technology. And/or quality control with the massive dies. 12nm Die density on Turing TU102 25m transistors per mm. 8nm TA102 has 45m per mm. 7nm and 5nm are more dense.

Finding a 3080 another story with limited availability. This is so bad Nvidia released a statement about availability.

RTX3090 reviews published have oced clocks power consumption at 450W with a 480W power limit. 360W at out the box clocks. Big power increase for little performance gain. (2) 2080TI cards had 3 8 pin power connectors: the msi lighting and galax. Now all 3090 and most 3080 have (3) 8 pins or the new 12 pin on founders.
ID: 55351 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 891
Level
Tyr
Scientific publications
watwatwatwatwat
Message 55354 - Posted: 25 Sep 2020, 17:21:18 UTC
Last modified: 25 Sep 2020, 17:22:33 UTC

Until the compute apps get recompiled for CUDA 11.1 and the new PTX library, none are going to show the potential from using the dormant extra FP32 pipeline in the architecture.
ID: 55354 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
eXaPower

Send message
Joined: 25 Sep 13
Posts: 293
Credit: 1,897,601,978
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 55366 - Posted: 27 Sep 2020, 15:25:33 UTC

https://www.techpowerup.com/272591/rtx-3080-crash-to-desktop-problems-likely-connected-to-aib-designed-capacitor-choice


ID: 55366 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
rod4x4

Send message
Joined: 4 Aug 14
Posts: 266
Credit: 2,219,935,054
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 55371 - Posted: 29 Sep 2020, 5:56:11 UTC - in response to Message 55366.  
Last modified: 29 Sep 2020, 6:01:38 UTC

https://www.techpowerup.com/272591/rtx-3080-crash-to-desktop-problems-likely-connected-to-aib-designed-capacitor-choice


Nvidia has released driver 456.55 to fix the issue. (driver appears to lower the boost to prevent the crashes during games)
https://videocardz.com/newz/nvidia-geforce-rtx-3080-owners-report-fewer-crashes-after-updating-drivers

ASUS and MSI have modified their designs to fix the crashes by implementing different capacitor configuration.
https://videocardz.com/newz/asus-also-caught-modifying-geforce-rtx-3080-tuf-and-rog-strix-pcb-designs
and
https://videocardz.com/newz/msi-quietly-changes-geforce-rtx-3080-gaming-x-trio-design-amid-stability-concerns

Hopefully just a minor speed bump in the release of the new Ampere Architecture.

Plain sailing from here?
ID: 55371 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · 3 · 4 . . . 6 · Next

Message boards : Graphics cards (GPUs) : Ampere 10496 & 8704 & 5888 fp32 cores!

©2025 Universitat Pompeu Fabra