Maxwell now

Author	Message
skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 38870 - Posted: 8 Nov 2014, 11:34:49 UTC - in response to Message 38869. Last modified: 8 Nov 2014, 11:36:49 UTC I was able to clock the GDDR5 to 3500MHz and over, even when OC'ing the GPU core and with power >110%. That said I did get an error at silly speeds. Initially I was just being inquisitive to understand why the 256bit bus has such a high usage for some tasks, how much of a problem it was and looking for a happy spot WRT power and efficiency. The P-states are still a bit of a conundrum and I suspect we are looking at a simplified GUI which attempts to control more complex functions than indicated by NV Inspector. It's certainly not easy to control them. P-states are apparently distinct from boost and it's not clear what if any impact CPU usage has on boost: When I don't fix the P states the GPU clocks drop under certain circumstances. For example, when running a CPU MT app the GPU had been 1050MHz (no boost) and when CPU usage changed to 6 individual CPU apps the power went from 58% to 61%, GPU usage rose from 82% to 84% and the GPU clock rose to 1075MHz - still no boost (all running a long SDOERR_thrombin WU). When I force the GPU clock to 1190MHz while using 6 CPU cores for CPU work the GPU still does not boost. Not even when I reduce the CPU usage to 5, 4 or 3 (for CPU work). By that stage GPU usage is 86 to 87% and power is 66%. I expect a system restart would allow the GPU to boost again, but IMO boost is basically broken and the P-states don't line up properly. FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help ID: 38870 · Rating: 0 · rate: / Reply Quote

eXaPower Send message Joined: 25 Sep 13 Posts: 293 Credit: 1,897,601,978 RAC: 0 Level Scientific publications	Message 38871 - Posted: 8 Nov 2014, 14:14:09 UTC - in response to Message 38870. skgiven: Have you tried NVAPI for 343 branch? Might help to break through P2 memory lock and boost inconsistency or at the very least: provided new information. New data structures are included for Maxwell. Reference Documents are also available. I've seen 157 API functions with 343.98 and 163? for 344.60 NV_GPU_PERF_PSTATES20_INFO_V2 NV_GPU_CLOCK_FREQUENCIES_V1 NV_GPU_CLOCK_FREQUENCIES_V2 NV_GPU_DYNAMIC_PSTATES_INFO_EX NV_GPU_PERF_PSTATES20_INFO_V1 Used in NvAPI_GPU_GetPstates20() interface call NV_GPU_PERF_PSTATES20_PARAM_DELTA Used to describe both voltage and frequency deltas NV_GPU_PERF_PSTATES_INFO_V1 NV_GPU_PERF_PSTATES_INFO_V2 NV_GPU_PSTATE20_BASE_VOLTAGE_ENTRY_V1 Used to describe single base voltage entry NV_GPU_PSTATE20_CLOCK_ENTRY_V1 Used to describe single clock entry NV_GPU_THERMAL_SETTINGS_V1 NV_GPU_THERMAL_SETTINGS_V2 https://developer.nvidia.com/nvapi ID: 38871 · Rating: 0 · rate: / Reply Quote

Beyond Send message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level Scientific publications	Message 38873 - Posted: 8 Nov 2014, 14:27:54 UTC - in response to Message 38866. (Sorry for TLDR posts) Thanks eXaPower for those long posts. A lot of information concerning the NV architectures. Very interesting. ID: 38873 · Rating: 0 · rate: / Reply Quote

eXaPower Send message Joined: 25 Sep 13 Posts: 293 Credit: 1,897,601,978 RAC: 0 Level Scientific publications	Message 38877 - Posted: 9 Nov 2014, 0:21:47 UTC Maxwell and Kelper's: CUDA/LD/ST/SFU/TMU/ROP/warp schedulers/Instruction cache buffer/Dispatch unit/Issue/Cross Bar/Polymorph Engine"SMM" "SMX" "SMM" 128C/32LD/ST/4ROP/32SFU/8TMU= 204 4WS/4IB/8DPU/9iss/1PME/4CrB SMM Total 234 "SMX" 192C/16TMU/8ROP/32LD-DT/32SFU=280 1IC/4WS/8DPU/1CrB/9Issue/1PME SMX Total 304 "SMM" equals 77% of Kelper "SMX" Kelper GK110: 30415= 4560 Maxwell GM204: 23416= 3744 Kepler GK104: 304*8= 2432 GK104 65% of GM204. GK104 53.3% of GK110 GK110/GK104 "SMX" consists of 63.1% CUDA cores GM204 "SMM" 54.7% of CUDA cores. ID: 38877 · Rating: 0 · rate: / Reply Quote

ExtraTerrestrial Apes Volunteer moderator Volunteer tester Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level Scientific publications	Message 38886 - Posted: 9 Nov 2014, 21:21:04 UTC I wrote my vendor (Galax) an email - hopefully I'm not just getting some useless standard reply. @eXaPower: how could I "try NVAPI for 343 branch"? I know the tweak tools are using it, but have no experience on using it myself. @SK: "but IMO boost is basically broken". Quite the contrary, I would say! It's better than ever, quicker and more powerful with a wider dynamic range. However, it's obviously not perfect yet. Besides something sometimes getting messed up (e.g. your example), there's also a problem when a card with a borderline OC is switched from a load boost state to a high one. Here the clock si raised faster than the voltage, which can leave the card in an unstable state for a short period of time. Enough of time to cause driver reset for gamers. But this could easily be fixed by changing the internal timings of boost, it's certainly not fundamentally broken. @Zoltan: in my example it's easy to agree that the lower memory clock makes my card stable, while the higher one doesn't. But why is it stable in Heaven for hours? Running GPU-Grid I'm getting BSODs within minutes. Surely a demanding benchmark can't be this fault-tolerant and game etc. should also crash frequently. Or put another way: if GP-GPU isn't stable at those clocks, IMO they couldn't sell the cards like this, because other software would crash / error too frequently. MrS Scanning for our furry friends since Jan 2002 ID: 38886 · Rating: 0 · rate: / Reply Quote

ExtraTerrestrial Apes Volunteer moderator Volunteer tester Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level Scientific publications	Message 39082 - Posted: 5 Dec 2014, 22:49:15 UTC Finally found the time to summarize it and post at Einstein. I also sent it as bug report to nVidia. MrS Scanning for our furry friends since Jan 2002 ID: 39082 · Rating: 0 · rate: / Reply Quote

eXaPower Send message Joined: 25 Sep 13 Posts: 293 Credit: 1,897,601,978 RAC: 0 Level Scientific publications	Message 39341 - Posted: 31 Dec 2014, 23:47:06 UTC - in response to Message 38125. I found one of many papers written by you and others-- "ACEMD: Accelerating Biomolecular Dynamics in the Microsecond Time Scale" during golden days of GT200. A Maxwell update: if applicable- would be very informative. I'm doing a bit of work to improve the performance of the code for Maxwell hardware - expect an update before the end of the year. Matt Any information regarding the Maxwell update? ID: 39341 · Rating: 0 · rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 39372 - Posted: 2 Jan 2015, 21:53:09 UTC - in response to Message 39341. Last modified: 2 Jan 2015, 22:00:37 UTC In time there may well be a GTX960, 990, 960Ti, 950Ti, 950, 940, 930 &/or others, and when GM200/GM210 turns up there could be many variants in the GeForce, Quadro and Tesla ranges... FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help ID: 39372 · Rating: 0 · rate: / Reply Quote

eXaPower Send message Joined: 25 Sep 13 Posts: 293 Credit: 1,897,601,978 RAC: 0 Level Scientific publications	Message 39698 - Posted: 25 Jan 2015, 14:49:35 UTC Nvidia released a statement about GTX970 memory allocation issues. There are reports GTX970 can't properly utilize 4GB. For Reference: Kelper's [8] dispatch feeds {1} large crossbar into [10] issues routed to SMX CUDA/LD/ST/SFU. SMX has an issue for 32CUDA {1] 16SFU [1] for 16LD/ST units. Totaling [2] issues for 32SFU and [2} for 32LD/ST units with [6] issues for 192CUDA inside [1] SMX. An SMM consists of 12 issues and [8] dispatch. Maxwell's crossbar split into 4 slices per SMM. [2] dispatch for each slice that feeds [3}issues: [1] for 32CUDA [1] for 8SFU [1] for 8LD/ST. SMM Totals [4] issues for 128CUDA and {4] for 32SFU and [4] for 32LD/ST units. GTX980 consists of 64 crossbar slices for 192 issues and 128 dispatch while the 970 has 52 slices with 156 issues and 104 dispatch. A GTX780ti has 15 cross bars with 150 issues and 120 dispatch. Keep in mind: including all resources with-in an SMX or SMM - the CUDA core percentage is higher in SMX than SMM. GK110/GK104 "SMX" consists of 63.1% CUDA cores GM204 "SMM" 54.7% of CUDA cores. Nvidia states: “The GeForce GTX 970 is equipped with 4GB of dedicated graphics memory. However the 970 has a different configuration of SMs than the 980, and fewer crossbar resources to the memory system. To optimally manage memory traffic in this configuration, we segment graphics memory into a 3.5GB section and a 0.5GB section. The GPU has higher priority access to the 3.5GB section. When a game needs less than 3.5GB of video memory per draw command then it will only access the first partition, and 3rd party applications that measure memory usage will report 3.5GB of memory in use on GTX 970, but may report more for GTX 980 if there is more memory used by other commands. When a game requires more than 3.5GB of memory then we use both segments. We understand there have been some questions about how the GTX 970 will perform when it accesses the 0.5GB memory segment. The best way to test that is to look at game performance. Compare a GTX 980 to a 970 on a game that uses less than 3.5GB. Then turn up the settings so the game needs more than 3.5GB and compare 980 and 970 performance again." http://www.pcper.com/news/Graphics-Cards/NVIDIA-Responds-GTX-970-35GB-Memory-Issue http://images.anandtech.com/doci/7764/SMX_575px.png http://images.anandtech.com/doci/7764/SMMrecolored_575px.png ID: 39698 · Rating: 0 · rate: / Reply Quote

eXaPower Send message Joined: 25 Sep 13 Posts: 293 Credit: 1,897,601,978 RAC: 0 Level Scientific publications	Message 39744 - Posted: 27 Jan 2015, 12:38:38 UTC Last modified: 27 Jan 2015, 12:52:26 UTC Nvidia admits the GTX970 allows 1792KB of L2 cache and 56ROPS to be accessed. http://www.pcper.com/reviews/Graphics-Cards/NVIDIA-Discloses-Full-Memory-Structure-and-Limitations-GTX-970 Despite initial reviews and information from NVIDIA, the GTX 970 actually has fewer ROPs and less L2 cache than the GTX 980. NVIDIA says this was an error in the reviewer’s guide and a misunderstanding between the engineering team and the technical PR team on how the architecture itself functioned. That means the GTX 970 has 56 ROPs and 1792 KB of L2 cache compared to 64 ROPs and 2048 KB of L2 cache for the GTX 980. http://anandtech.com/show/8935/geforce-gtx-970-correcting-the-specs-exploring-memory-allocation The benchmarking program Si software has reported 1.8MB (1792KB) of cache for GTX970 since the beginning. It was always chocked up has a bug. ID: 39744 · Rating: 0 · rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 40004 - Posted: 2 Feb 2015, 12:11:03 UTC - in response to Message 39744. For here, the MCU usage is lower on the GTX970 than on the GTX980. Although the 970's bus width is effective 224+32 and suggestions are that 224bits are predominately used, the lower MCU usage might still be explained by the marginally favourable (by 7%) shader to bus ratio of the 970 over the 980 (1664 over 224bits vs 2048 over 256bits) and slightly lower GPU clocks. However, and despite some interpretations, I think its possible that all 256bits of the bus are actually used for accessing up to 3.5GB of GDDR5, after which it becomes 224bits for the first 3.5GB and 32 for the next 0.5GB. While on the whole the 2MB to 1.75MB cache doesn't appear to have any impact, it might explain some of the relative performance variation of different WU types. FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help ID: 40004 · Rating: 0 · rate: / Reply Quote

ExtraTerrestrial Apes Volunteer moderator Volunteer tester Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level Scientific publications	Message 40008 - Posted: 2 Feb 2015, 21:36:56 UTC - in response to Message 40004. Well, we can surely say that the smaller L2 cache doesn't hurt GPU-Grid performance much. But we can't say anything more specific, can we? Regarding the "new" bandwidth: from the explanation given to Anandtech it's clear that the card can not read or write with full 256 bit. The pipe between the memory controllers and the cross bar just isn't wide enough for this. You can see it where the traffic from the memory controller with the disabled ROP/L2 is routed through the ROP/L2 from its companion. At this point 2 x 32 bit would have to share a 1 x 32 bit bus. This bus is bidirectional, though, so you can use all memory controllers at once if the 7- and 1-partitions are performing different operations. Which I imagine is difficult to exploit efficiently via software. MrS Scanning for our furry friends since Jan 2002 ID: 40008 · Rating: 0 · rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 40061 - Posted: 5 Feb 2015, 23:39:20 UTC - in response to Message 40008. The 970 reminds me of the 660Ti, a lot. FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help ID: 40061 · Rating: 0 · rate: / Reply Quote

eXaPower Send message Joined: 25 Sep 13 Posts: 293 Credit: 1,897,601,978 RAC: 0 Level Scientific publications	Message 40077 - Posted: 6 Feb 2015, 23:19:11 UTC - in response to Message 40061. You're correct: Kelper disables a full memory controller and 8 ROP that comes with SMX. The GTX970 is still the best NVidia performance/cost card for amount of feature sets included. To make matters even more confusing about SMX disabling: the GT630 (2 SMX/C.C3.5) has 64 Bit bus with 512kb cache. Another example: GTX650tiboost (4SMX/GK106) die that includes a full GK106 cache (384KB) 24ROPS and memory bus (192bit) along with 3GPC - same as 5SMX GTX660. In the prior generation of Kepler-derived GPUs, Alben explained, any chips with faulty portions of L2 cache would need to have an entire memory partition disabled. For example, the GeForce GTX 660 Ti is based on a GK104 chip with several SMs and an entire memory partition inactive, so it has an aggregate 192-bit connection to memory, down 64 bits from the full chip's capabilities. Nvidia's engineers built a new feature into Maxwell that allows the company to make fuller use of a less-than-perfect chip. In the event that a memory partition has a bad section of L2 cache, the firm can disable the bad section of cache. The remaining L2 cache in the memory partition can then service both memory controllers in the partition thanks to a "buddy interface" between the L2 and the memory controllers. That "buddy interface" is shown as active, in a dark, horizontal arrow, in the bottom right memory partition on the diagram. In the other three memory partitions, this arrow is grayed out because the "buddy" interface is not used. From Damien Triolet at Hardware.fr: The pixel fillrate can be linked to the number of ROPs for some GPUs, but it’s been limited elsewhere for years for many Nvidia GPUs. Basically there are 3 levels that might have a say at what the peak fillrate is : •The number of rasterizers •The number of SMs •The number of ROPs On both Kepler and Maxwell each SM appears to use a 128-bit datapath to transfer pixels color data to the ROPs. Those appears to be converted from FP32 to the actual pixel format before being transferred to the ROPs. With classic INT8 rendering (32-bit per pixel) it means each SM has a throughput of 4 pixels/clock. With HDR FP16 (64-bit per pixel), each SM has a throughput of 2 pixels/clock. On Kepler each rasterizer can output up to 8 pixels/clock. With Maxwell, the rate goes up to 16 pixels/clock (at least with the currently released Maxwell GPUs). So the actual pixels/cycle peak rate when you look at all the limits (rasterizers/SMs/ROPs) would be : GTX 750 : 16/16/16 GTX 750 Ti : 16/20/16 GTX 760 : 32/24/32 or 24/24/32 (as there are 2 die configuration options) GTX 770 : 32/32/32 GTX 780 : 40/48/48 or 32/48/48 (as there are 2 die configuration options) GTX 780 Ti : 40/60/48 GTX 970 : 64/52/64 GTX 980 : 64/64/64 Testing (forums) reveals GTX970 from different manufactures have varying results for Peak rasterization rates and Peak pixel fill rates. (at the same clock/memory speeds) Is this because GTX970 SMM/cache/ROPS/Memory structures are disabled differently form one another? Reports include - not ALL GTX970 are affected by 224+32bit bus or the separate 512MB pool with a 20-28GB bandwidth slow down. Does this reveal NVidia changed the SMM disablement process? Is/was a second revision of GTX970 produced yet? AnandTech decent explanations about SMX structures can be found in the GTX660ti and GTX650ti reviews to compare with SMM recent articles. When sorting through for reliable information: Some tech forums threads misinformed 970 comments are hyperbolic - completely unprofessional and full of trolling. ID: 40077 · Rating: 0 · rate: / Reply Quote