Ampere 10496 & 8704 & 5888 fp32 cores!

Author	Message
Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 423,674 Level Scientific publications	Message 58009 - Posted: 2 Dec 2021, 0:07:19 UTC - in response to Message 58005. Last modified: 2 Dec 2021, 0:25:06 UTC and since the real performance doesn't scale with the FP core count, that leads us to the conclusion that the GPUGRID app must be coded with a fair number of INT operations which cut into the FP cores available (half of the FP cores are actually shared FP/INT cores and can only do one type of operation at a time, while the other half are dedicated FP32). This explains the massive performance boost of Turing cards over Pascal (at the same FP count) for GPUGRID since Turing introduced concurrent FP/INT processing. https://developer.nvidia.com/blog/nvidia-turing-architecture-in-depth/ the Turing SM adds a new independent integer datapath that can execute instructions concurrently with the floating-point math datapath. In previous generations, executing these instructions would have blocked floating-point instructions from issuing Einstein scales much better with FP core count on Ampere, but is also more reliant on memory speed/latency than GPUGRID. if you take this "only half" rule, it's doesn't stack up with real world gains seen on Einstein. A 3080Ti is 70-75% faster than a 2080Ti on Einstein. while under the 1/2 rule "only" has 17% more cores. so one obviously can't paint with such a broad brush to include all of crunching. all depends on how the app is coded to use the hardware, and sometimes you can't make an app totally optimized for a certain GPU architecture depending on what kinds of computations you need to do or what coding methods you use. ID: 58009 · Rating: 0 · rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 58017 - Posted: 2 Dec 2021, 9:46:42 UTC - in response to Message 58009. Last modified: 2 Dec 2021, 9:58:16 UTC ... the GPUGRID app must be coded with a fair number of INT operations which cut into the FP cores available (half of the FP cores are actually shared FP/INT cores and can only do one type of operation at a time, while the other half are dedicated FP32). Perhaps MD simulations don't rely on that many INT operations, so it's independent from the coder and from the cruncher's wishes (demands). Einstein scales much better with FP core count on Ampere ... The Einstein app is not a native CUDA application (it's openCL), it's not good at utilizing (previous) NVidia GPUs, making this comparison inconsequential regarding the GPUGrid app performance improvement on Ampere. It's the Ampere architecture that saved the day for the Einsten app, so if the Einstein app would be coded (in CUDA) the way it could run great on Turing also, you would see the same (low) performance improvement on Ampere. all depends on how the app is coded to use the hardware, and sometimes you can't make an app totally optimized for a certain GPU architecture depending on what kinds of computations you need to do or what coding methods you use. Well, how the app is coded depends on the problem (the research area) and the methodology of the given research, and the program language, which is chosen by the targeted range of hardware. The method is the reason for the impossibility of "the GPUGRID app must be coded with a fair number of INT operations" demand. (FP32 is needed to calculate trajectories.) The targeted (broader) range of hardware is the reason for Einstein is coded in openCL, resulting in lower NVidia GPU utilization on previous GPU generations. ID: 58017 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 423,674 Level Scientific publications	Message 58018 - Posted: 2 Dec 2021, 12:41:09 UTC - in response to Message 58017. Last modified: 2 Dec 2021, 13:32:45 UTC ... the GPUGRID app must be coded with a fair number of INT operations which cut into the FP cores available (half of the FP cores are actually shared FP/INT cores and can only do one type of operation at a time, while the other half are dedicated FP32). Perhaps MD simulations don't rely on that many INT operations, so it's independent from the coder and from the cruncher's wishes (demands). Einstein scales much better with FP core count on Ampere ... The Einstein app is not a native CUDA application (it's openCL), it's not good at utilizing (previous) NVidia GPUs, making this comparison inconsequential regarding the GPUGrid app performance improvement on Ampere. It's the Ampere architecture that saved the day for the Einsten app, so if the Einstein app would be coded (in CUDA) the way it could run great on Turing also, you would see the same (low) performance improvement on Ampere. all depends on how the app is coded to use the hardware, and sometimes you can't make an app totally optimized for a certain GPU architecture depending on what kinds of computations you need to do or what coding methods you use. Well, how the app is coded depends on the problem (the research area) and the methodology of the given research, and the program language, which is chosen by the targeted range of hardware. The method is the reason for the impossibility of "the GPUGRID app must be coded with a fair number of INT operations" demand. (FP32 is needed to calculate trajectories.) The targeted (broader) range of hardware is the reason for Einstein is coded in openCL, resulting in lower NVidia GPU utilization on previous GPU generations. Nvidia cards only run CUDA. OpenCL code gets compiled into CUDA at runtime. But just using OpenCL doesn’t mean that it can’t effectively use the GPU, it’s all in the coding. The compiling to CUDA at runtime only provides a very small overhead. A new Einstein app was released several months ago which I was involved in the testing and development. Mostly coded/modified by another user (petri, the same guy who wrote the SETI special CUDA app), and then I forwarded and explained the changes to the Einstein devs for integration and release. It’s now available to everyone and provides big gains in performance for all Nvidia GPUs all the way back to Maxwell architecture, so yes the new coding applies to Turing also and Ampere saw major gains over Turing. But Ampere by far had the best gains. Maxwell and Pascal cards see about a 40% improvement over the old app, Turing about 60% improvement, and Ampere had over 100% improvement. It basically puts Nvidia back on level ground or even ahead of AMD for Einstein. The issue was a certain command and parameters being used, which caused memory access serialization on Nvidia GPUs, effectively holding them back. Petri recoded it a different way to basically remove the limiter, parallelize the memory access and allow the GPU to run full speed. So yes, it’s relevant and shows how different code can be used for the exact same computation to utilize the hardware better and increase performance. Also shows how you can’t say that you can only figure “half” of the cores for all of crunching when Ampere performance when it clearly has benefits that’s scale much closer to the FP core count with certain projects. If you look at the Ampere architecture and understand it, you’ll see that the only reason that “half” of the FP cores wouldn’t be used, is if INT operations are running. regarding GPUGRID's app, if you observe the behavior of what the app is doing with the hardware, you'll see that one trait stands out from most other projects, high PCIe bus utilization. the app is sending A LOT of data over the PCIe bus, equivalent to almost a full PCIe 3.0 x4 worth of bandwidth, for the whole run. The app is coded in a way where lots of data is being sent between the CPU and the GPU over the PCIe bus, but very little stored in GPU memory. these kinds of operations usually involve a lot of integer adds for data fetching as well as for FP compares or min/max processing. So while the meat of the computation is for trajectories, there are a lot of other things that the app needs to do in INT to get and send the data and organize the results. I imagine that if the devs changed their code philosophy to store more data in GPU memory, they could cut out a lot of the excess involved in sending so much data over the PCIe bus, keep things more local to the GPU, and speed up processing overall. This would have the caveat of excluding some low VRAM GPUs however. ID: 58018 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1423 Credit: 9,186,946,190 RAC: 1,288,374 Level Scientific publications	Message 58019 - Posted: 2 Dec 2021, 23:27:14 UTC - in response to Message 58018. 👏👍 ID: 58019 · Rating: 0 · rate: / Reply Quote