Maxwell now

Author	Message
skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 38677 - Posted: 25 Oct 2014, 14:08:39 UTC - in response to Message 38629. Last modified: 25 Oct 2014, 15:53:34 UTC The improvement was only 2 or 3% for me with the GPU at around 1406MHz. There is a new compression algorithm which presumably runs on the GPU so any increase in GPU rate also increases compression rate. Anyway it appears there isn't a lot to be gained from increasing the GDDR frequency. That said, I just had a quick look at one GPU clock setting on one system and it's not like I actually overclocked the memory; just forced it to work at 3500MHz (as far as I can tell). It might be the case that at higher GPU clocks a memory frequency increase would result in greater performance, or that increasing the memory (as opposed to the GPU) allows you to run more efficiently (or not). FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help ID: 38677 · Rating: 0 · rate: / Reply Quote

ExtraTerrestrial Apes Volunteer moderator Volunteer tester Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level Scientific publications	Message 38682 - Posted: 25 Oct 2014, 22:20:35 UTC Regarding the question of whether ~50% MCL could be a bottleneck: the memory accesses don't have to be distributed homogenously over time. So even if the memory controllers may be idle about 50% of the time, the chip may be waiting for new data to continue processing at other times. Another point to consider is reduced latency, which simple clock speed increases will yield. These are nto terribly important for a GPU, as it's made to mask latency by keeping tons of threads in flight. But still, lower latency will certainly not perform worse. SK, have you tried EVGA precision to set memory clocks? I have to confess I'm a bit jealous: I ordered my GTX970 over a week ago, but so far no signs of it being shipped. The shop is trustworthy, though. I went for the Galax EXOC model, because it's the cheapest one which can take my Thermalright Shaman :D MrS Scanning for our furry friends since Jan 2002 ID: 38682 · Rating: 0 · rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 38706 - Posted: 26 Oct 2014, 19:51:04 UTC - in response to Message 38682. Last modified: 26 Oct 2014, 21:00:45 UTC I have not tried EVGA precision (yet) but I did eventually work out how to increase GDDR5 past 3505MHz using NV Inspector. You have to set the P-states when they are not in use! Presently working at 3605MHz... FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help ID: 38706 · Rating: 0 · rate: / Reply Quote

ExtraTerrestrial Apes Volunteer moderator Volunteer tester Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level Scientific publications	Message 38720 - Posted: 27 Oct 2014, 21:58:08 UTC - in response to Message 38706. Give that card a good spanking :D MrS Scanning for our furry friends since Jan 2002 ID: 38720 · Rating: 0 · rate: / Reply Quote

ExtraTerrestrial Apes Volunteer moderator Volunteer tester Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level Scientific publications	Message 38811 - Posted: 4 Nov 2014, 23:05:22 UTC I finally got my GTX970 yesterday! And I have the same problem as SK: by default the card crunches in power state P2, whereas my OC utility (EVGA precision) manipulates state P0. This works well for games and surprisingly the GPU core clock as well, but not for the memory. Using nVidia inspector I can increase the memory clock for P2 to the default 3500 MHz. However, when I do this I quickly get blue screens! The card runs at that memory clock just fine in Unigine Heaven, so I'm not completely sure it's this setting which causes the BS. If I don't use inspector things seem fine, though. Will test further. Do you guys know anything about this? Does the memory use a lower voltage in P2? Can I force P0? Setting the power management to maximum performance had no effect. I also tried nvidiainspector.exe -forcepstate:0,0 to put my single nVidia GPU into P0. But all this did was setting it to P2 with fixed clocks as set in inspector. MrS Scanning for our furry friends since Jan 2002 ID: 38811 · Rating: 0 · rate: / Reply Quote

Jim1348 Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level Scientific publications	Message 38812 - Posted: 4 Nov 2014, 23:45:25 UTC - in response to Message 38811. Are you using the latest Nvidia Inspector? They mention a couple of changes that might be of interest. http://www.guru3d.com/files_details/nvidia_inspector_download.html ID: 38812 · Rating: 0 · rate: / Reply Quote

ExtraTerrestrial Apes Volunteer moderator Volunteer tester Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level Scientific publications	Message 38816 - Posted: 5 Nov 2014, 7:55:49 UTC - in response to Message 38812. Yes, I'm using the current 1.9.7.3. Thanks, MrS Scanning for our furry friends since Jan 2002 ID: 38816 · Rating: 0 · rate: / Reply Quote

TJ Send message Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level Scientific publications	Message 38820 - Posted: 5 Nov 2014, 19:31:30 UTC - in response to Message 38811. Congratulations ETA. Quite a coincidence, I ordered an EVGA GTX970 OC last week, but its shipment was delayed a few times, and I got it yesterday too. I put it immediately in my system to replace the GTX770. I notice that it can run very warm. I use PrecisionX for all setting and when the temperature is allowed to run to 80°C, it quickly does. So warmer then my GTX780Ti's. When I allow to 80°C the clock boost to 1341,6, and TDP is ~95.3 Setting maximum temperature to 74°C (prioritize) the clock boost to 1303.2MHz, and TDP ~81.1. From what I have read I thought that the 9xx cards ran cooler then the 780Ti, but that is not what I see. By the way this is the version with only one radial fan, that is what I wanted. Greetings from TJ ID: 38820 · Rating: 0 · rate: / Reply Quote

ExtraTerrestrial Apes Volunteer moderator Volunteer tester Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level Scientific publications	Message 38825 - Posted: 5 Nov 2014, 21:01:06 UTC Thanks! Have fun with your as well :) Mine is currently running at manually set 1315 MHz @ 1.075 V. From a quick Unigine Heaven test I know it should be able to take about 40 MHz more at that clock speed, which is really amazing for 28 nm. Power consumption is currently about 140 W at moderate 87% GPU utilization. I hope I'll be able to increase this further when I get the memory clock up to where it should be. Well, "cooler running" is only half of the story. It does consume much less power, for sure, which means it'S far easier to keep cool. But if you pair it with a weak cooler it will run just as hot as any other card. Mine is the Galax with an open air cooler, which handles the card nicely (quieter than my case fans at moderate speed, 66°C with 22°C ambient temperature). But getting back to my question from yesterday: is your card also crunching with the memory running at 1500 / 3000 / 6000 MHz? Is it also in power state P2? (NV inspector can show this). So far I have left my card's memory clock at stock 1.5 GHz and have not faced a single blue screen since then. MrS Scanning for our furry friends since Jan 2002 ID: 38825 · Rating: 0 · rate: / Reply Quote

ExtraTerrestrial Apes Volunteer moderator Volunteer tester Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level Scientific publications	Message 38827 - Posted: 5 Nov 2014, 21:18:57 UTC BTW: the bitcoin folks are seeing the same problem. Some can't OC the memory, while one can. They don't get blue screen at stock memory clock, though. MrS Scanning for our furry friends since Jan 2002 ID: 38827 · Rating: 0 · rate: / Reply Quote

ExtraTerrestrial Apes Volunteer moderator Volunteer tester Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level Scientific publications	Message 38835 - Posted: 5 Nov 2014, 23:35:01 UTC OK, I'm getting closer: when I have GPU-Grid crunching in the background not even running Heaven triggers P0! It just stays there at P2, in the same way as everything I've tried with inspector. This is so strange. Maybe it's related to running CUDA? BTW: 344.60 WHQL behaves the same. MrS Scanning for our furry friends since Jan 2002 ID: 38835 · Rating: 0 · rate: / Reply Quote

TJ Send message Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level Scientific publications	Message 38855 - Posted: 6 Nov 2014, 15:17:38 UTC I installed nvidia inspector and it shows the card runs in P2. The clock is at 1304MHz (I have set with PresicionX the max temp to 73°C and the power target to 105). The clock can run higher if I set the temperature higher. Voltage is 1125mV and usage 90%, TDP 84%. I am waiting for the real Maxwell's but in the meantime I want to have one GTX980 also (or a GTX980Ti). What I noticed however that EVGA cards are nowhere in stock, the price increased from EUR 585 last Friday to EUR705 since this Monday. I check daily if it is available to order. That is more expensive then the better GTX780Ti. Greetings from TJ ID: 38855 · Rating: 0 · rate: / Reply Quote

ExtraTerrestrial Apes Volunteer moderator Volunteer tester Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level Scientific publications	Message 38859 - Posted: 6 Nov 2014, 22:14:00 UTC - in response to Message 38855. Last modified: 6 Nov 2014, 22:18:12 UTC I suppose your memory clock is set to 1500 MHz (in GPU-Z) as well? I'm trying again to manually run 1750 MHz. I passed "memtestcl" at that and since a few minutes everything looks fine. I could live with manually forcing the clock, while staying in P2, as long as I don't get BS again! Edit: don't hold your breath for a GTX980Ti. Rumors say "big Maxwell" is relatively far in production, but after the very successful launch of GM204 there's no reason for nVidia to bring it out yet. Or to price it at some sane level. I expect they'll use those valuable chips for HPC, Tesla and Quadro first, just as they did with GK110. And I don't think GTX980 is worth the added money, certainly not at those EVGA-scarcity prices. MrS Scanning for our furry friends since Jan 2002 ID: 38859 · Rating: 0 · rate: / Reply Quote

eXaPower Send message Joined: 25 Sep 13 Posts: 293 Credit: 1,897,601,978 RAC: 0 Level Scientific publications	Message 38862 - Posted: 7 Nov 2014, 1:42:06 UTC - in response to Message 38820. The 1920 DP64 core (5760 total cores/450WattTDP?) Titan-Z is currently priced @ 1399-1499usd equaling cost of [2]GTX980(4096totalcore/128DP64/340wattTDP) Big Maxwell is around the corner. Maybe 6-12 weeks. What will be AMD response to GM204- never mind GM20? A GTX970 equals AMD 290X @ 1440p/2160p gaming- that's NVidia 5th Total core count board compared to AMD top 2816 core card. GM204 higher clocks have an edge. Three years GK110 lasted without a replacement. Speaks to how well engineered Kelper is. Read Nvidia Dev Boards: there are "some" who think C.C 3.5 code arch is better than C.C 5.2. Not Including efficiency. (DP64 cores use more wattage then 32bit cores. This one reasons Maxwell TDP is lower- the loss of DP64 cores and change from superscalar to Scalar? Struggle for DP64 - C.C5.0/5.2 was trimmed even further away from Fermi's 1/2or1/8 and Kelper 1/3or1/16 64 bit core ratio. (I'm curious to see how many DP64 are in an GM20 SMM and TDP. C.C 3.5 64DP core per SMX @ 1/3rd total SMX core count.(driver limited to 8DP SMX for GTX780[ti] @ 1/16th total SMX) Maxwell C.C 5.0/5.2 is 1/32 total SMM. Not specific to GPUGRID: C.C 5.0/5.2 L1/L2 cache/global and shared memory behavior is different. Maxwell lacks [32] bank--64bit addressing mode. Kelper 3.0/3.5 64bit/8bytewide (if chosen code path) default is 4bytes. Maxwell faster at hashing- as Kelper is for DP64. CUDA Programming guide states: "section by reading it using the __ldg() function (see Read-Only Data Cache Load Function). When the compiler detects that the read-only condition is satisfied for some data, it will use __ldg() to read it. The compiler might not always be able to detect that the read-only condition is satisfied for some data. Marking pointers used for loading such data with both the const and __restrict__ qualifiers increases the likelihood that the compiler will detect the read-only condition. Data that is not read-only for the entire lifetime of the kernel cannot be cached in the unified L1/texture cache for devices of compute capability 5.0. For devices of compute capability 5.2, it is, by default, not cached in the unified L1/texture cache, but caching may be enabled using the following mechanisms: •Perform the read using inline assembly with the appropriate modifier as described in the PTX reference manual; • Compile with the -Xptxas -dlcm=ca compilation flag, in which case all reads are cached, except reads that are performed using inline assembly with a modifier that disables caching; • Compile with the -Xptxas -fscm=ca compilation flag, in which case all reads are cached, including reads that are performed using inline assembly regardless of the modifier used. When caching is enabled using some the three mechanisms listed above, devices of compute capability 5.2 will cache global memory reads in the unified L1/texture cache for all kernel launches except for the kernel launches for which thread blocks consume too much of the multiprocessor's resources. These exceptions are reported by the profiler." C.C 3.0/3.5 shared memory: "Shared memory has 32 banks with two addressing modes that are described below. The addressing mode can be queried using cudaDeviceGetSharedMemConfig() and set using cudaDeviceSetSharedMemConfig() (see reference manual for more details). Each bank has a bandwidth of 64 bits per clock cycle." 64-Bit Mode Successive 64-bit words map to successive banks. "A shared memory request for a warp does not generate a bank conflict between two threads that access any sub-word within the same 64-bit word (even though the addresses of the two sub-words fall in the same bank): In that case, for read accesses, the 64-bit word is broadcast to the requesting threads and for write accesses, each sub-word is written by only one of the threads (which thread performs the write is undefined). In this mode, the same access pattern generates fewer bank conflicts than on devices of compute capability 2.x for 64-bit accesses and as many or fewer for 32-bit accesses." 32-Bit Mode Successive 32-bit words map to successive banks. "A shared memory request for a warp does not generate a bank conflict between two threads that access any sub-word within the same 32-bit word or within two 32-bit words whose indices i and j are in the same 64-word aligned segment (i.e., a segment whose first index is a multiple of 64) and such that j=i+32 (even though the addresses of the two sub-words fall in the same bank): In that case, for read accesses, the 32-bit words are broadcast to the requesting threads and for write accesses, each sub-word is written by only one of the threads (which thread performs the write is undefined). In this mode, the same access pattern generates as many or fewer bank conflicts than on devices of compute capability 2.x." C.C 5.0/5.2 shared Memory: "Shared memory has 32 banks that are organized such that successive 32-bit words map to successive banks. Each bank has a bandwidth of 32 bits per clock cycle." C.C 2.0: "Shared memory has 32 banks that are organized such that successive 32-bit words map to successive banks. Each bank has a bandwidth of 32 bits per two clock cycles." can be applied to C.C 2.0/3.0/3.5/5.0/5.2: "A shared memory request for a warp does not generate a bank conflict between two threads that access any address within the same 32-bit word (even though the two addresses fall in the same bank): In that case, for read accesses, the word is broadcast to the requesting threads and for write accesses, each address is written by only one of the threads (which thread performs the write is undefined)." "A shared memory request for a warp does not generate a bank conflict between two threads that access any address within the same 32-bit word (even though the two addresses fall in the same bank): In that case, for read accesses, the word is broadcast to the requesting threads (and unlike for devices of compute capability 1.x, multiple words can be broadcast in a single transaction) and for write accesses, each address is written by only one of the threads (which thread performs the write is undefined). This means, in particular, that unlike for devices of compute capability 1.x, there are no bank conflicts if an array of char is accessed as follows, for example: extern __shared__ char shared[]; char data = shared[BaseIndex + tid]; Also, unlike for devices of compute capability 1.x, there may be bank conflicts between a thread belonging to the first half of a warp and a thread belonging to the second half of the same warp. 32-Bit Strided Access "A common access pattern is for each thread to access a 32-bit word from an array indexed by the thread ID tid and with some stride s: extern __shared__ float shared[]; float data = shared[BaseIndex + s * tid]; In this case, threads tid and tid+n access the same bank whenever s*n is a multiple of the number of banks (i.e., 32) or, equivalently, whenever n is a multiple of 32/d where d is the greatest common divisor of 32 and s. As a consequence, there will be no bank conflict only if the warp size (i.e., 32) is less than or equal to 32/d, i.e., only if d is equal to 1, i.e., s is odd." Larger Than 32-Bit Access "64-bit and 128-bit accesses are specifically handled to minimize bank conflicts as described below. Other accesses larger than 32-bit are split into 32-bit, 64-bit, or 128-bit accesses. The following code, for example: struct type { float x, y, z; }; extern __shared__ struct type shared[]; struct type data = shared[BaseIndex + tid]; results in three separate 32-bit reads without bank conflicts since each member is accessed with a stride of three 32-bit words. 64-Bit Accesses: For 64-bit accesses, a bank conflict only occurs if two threads in either of the half-warps access different addresses belonging to the same bank. Unlike for devices of compute capability 1.x, there are no bank conflicts for arrays of doubles accessed as follows, for example: extern __shared__ double shared[]; double data = shared[BaseIndex + tid]; 128-Bit Accesses: The majority of 128-bit accesses will cause 2-way bank conflicts, even if no two threads in a quarter-warp access different addresses belonging to the same bank. Therefore, to determine the ways of bank conflicts, one must add 1 to the maximum number of threads in a quarter-warp that access different addresses belonging to the same bank." ID: 38862 · Rating: 0 · rate: / Reply Quote

eXaPower Send message Joined: 25 Sep 13 Posts: 293 Credit: 1,897,601,978 RAC: 0 Level Scientific publications	Message 38863 - Posted: 7 Nov 2014, 15:11:14 UTC - in response to Message 38862. Last modified: 7 Nov 2014, 15:21:45 UTC Clarification for New Maxwell card owners and Kelper owners looking to upgrade. What will Maxwell GM20 bring to it's brethren ? We know of few proven GM107/GM204 changes from Kelper: refined crossbar/dispatch/issue - Better power efficiency per SMM compared to SMX- runs cooler- lower DP64 performance from loss of 64bit Banks and 8byte shared banking mode (less DP core per SMM) - Higher Integer (more cores per SMM) - different TMU/CUDA core ratio - revised filtering/cache/memory algorithms - a barrel shifter (left out of Kelper)- and enhanced SIP core block for video encoding/decoding.(Kelper is first generation) Version 1.1 Maxwell Tuning Guide for further comparisons: The Maxwell Streaming Multiprocessor, SMM, is similar in many respects to the Kepler architecture's SMX. The key enhancements of SMM over SMX are geared toward improving efficiency without requiring significant increases in available parallelism per SM from the application. 1.4.1.1. Occupancy The maximum number of concurrent warps per SMM remains the same as in SMX (i.e., 64), and factors influencing warp occupancy remain similar or improved over SMX: •The register file size (64k 32-bit registers) is the same as that of SMX. •The maximum registers per thread, 255, matches that of Kepler GK110. As with Kepler, experimentation should be used to determine the optimum balance of register spilling vs. occupancy, however. •The maximum number of thread blocks per SM has been increased from 16 to 32. This should result in an automatic occupancy improvement for kernels with small thread blocks of 64 or fewer threads (shared memory and register file resource requirements permitting). Such kernels would have tended to under-utilize SMX, but less so SMM. •Shared memory capacity is increased (see Shared Memory Capacity). As such, developers can expect similar or improved occupancy on SMM without changes to their application. At the same time, warp occupancy requirements (i.e., available parallelism) for maximum device utilization are similar to or less than those of SMX (see Instruction Latencies). 1.4.1.2. Instruction Scheduling The number of CUDA Cores per SM has been reduced to a power of two, however with Maxwell's improved execution efficiency, performance per SM is usually within 10% of Kepler performance, and the improved area efficiency of SMM means CUDA Cores per GPU will be substantially higher vs. comparable Fermi or Kepler chips. SMM retains the same number of instruction issue slots per clock and reduces arithmetic latencies compared to the Kepler design. As with SMX, each SMM has four warp schedulers. Unlike SMX, however, all SMM core functional units are assigned to a particular scheduler, with no shared units. Along with the selection of a power-of-two number of CUDA Cores per SM, which simplifies scheduling and reduces stall cycles, this partitioning of SM computational resources in SMM is a major component of the streamlined efficiency of SMM. The power-of-two number of CUDA Cores per partition simplifies scheduling, as each of SMM's warp schedulers issue to a dedicated set of CUDA Cores equal to the warp width. Each warp scheduler still has the flexibility to dual-issue (such as issuing a math operation to a CUDA Core in the same cycle as a memory operation to a load/store unit), but single-issue is now sufficient to fully utilize all CUDA Cores. 1.4.1.3. Instruction Latencies Another major improvement of SMM is that dependent math latencies have been significantly reduced; a consequence of this is a further reduction of stall cycles, as the available warp-level parallelism (i.e., occupancy) on SMM should be equal to or greater than that of SMX (see Occupancy), while at the same time each math operation takes less time to complete, improving utilization and throughput. 1.4.1.4. Instruction Throughput The most significant changes to peak instruction throughputs in SMM* are as follows: •The change in number of CUDA Cores per SM brings with it a corresponding change in peak single-precision floating point operations per clock per SM. However, since the number of SMs is typically increased, the result is an increase in aggregate peak throughput; furthermore, the scheduling and latency improvements also discussed above make this peak easier to approach. •The throughput of many integer operations including multiply, logical operations and shift is improved. In addition, there are now specialized integer instructions that can accelerate pointer arithmetic. These instructions are most efficient when data structures are a power of two in size. Note: As was already the recommended best practice, signed arithmetic should be preferred over unsigned arithmetic wherever possible for best throughput on SMM. The C language standard places more restrictions on overflow behavior for unsigned math, limiting compiler optimization opportunities. 1.4.2. Memory Throughput 1.4.2.1. Unified L1/Texture Cache Maxwell combines the functionality of the L1 and texture caches into a single unit. As with Kepler, global loads in Maxwell are cached in L2 only, unless using the LDG read-only data cache mechanism introduced in Kepler. In a manner similar to Kepler GK110B, GM204 retains this behavior by default but also allows applications to opt-in to caching of global loads in its unified L1/Texture cache. The opt-in mechanism is the same as with GK110B: pass the -Xptxas -dlcm=ca flag to nvcc at compile time. Local loads also are cached in L2 only, which could increase the cost of register spilling if L1 local load hit rates were high with Kepler. The balance of occupancy versus spilling should therefore be reevaluated to ensure best performance. Especially given the improvements to arithmetic latencies, code built for Maxwell may benefit from somewhat lower occupancy (due to increased registers per thread) in exchange for lower spilling. The unified L1/texture cache acts as a coalescing buffer for memory accesses, gathering up the data requested by the threads of a warp prior to delivery of that data to the warp. This function previously was served by the separate L1 cache in Fermi and Kepler. Two new device attributes were added in CUDA Toolkit 6.0: globalL1CacheSupported and localL1CacheSupported. Developers who wish to have separately-tuned paths for various architecture generations can use these fields to simplify the path selection process. Note: Enabling caching of globals in GM204 can affect occupancy. If per-thread-block SM resource usage would result in zero occupancy with caching enabled, the CUDA driver will override the caching selection to allow the kernel launch to succeed. This situation is reported by the profiler. 1.4.3. Shared Memory 1.4.3.1. Shared Memory Capacity With Fermi and Kepler, shared memory and the L1 cache shared the same on-chip storage. Maxwell, by contrast, provides dedicated space to the shared memory of each SMM, since the functionality of the L1 and texture caches have been merged in SMM. This increases the shared memory space available per SMM as compared to SMX: GM107 provides 64 KB shared memory per SMM, and GM204 further increases this to 96 KB shared memory per SMM. This presents several benefits to application developers: •Algorithms with significant shared memory capacity requirements (e.g., radix sort) see an automatic 33% to 100% boost in capacity per SM on top of the aggregate boost from higher SM count. •Applications no longer need to select a preference of the L1/shared split for optimal performance. For purposes of backward compatibility with Fermi and Kepler, applications may optionally continue to specify such a preference, but the preference will be ignored on Maxwell, with the full 64 KB per SMM always going to shared memory. Note: While the per-SM shared memory capacity is increased in SMM, the per-thread-block limit remains 48 KB. For maximum flexibility on possible future GPUs, NVIDIA recommends that applications use at most 32 KB of shared memory in any one thread block, which would for example allow at least two such thread blocks to fit per SMM. 1.4.3.2. Shared Memory Bandwidth Kepler SMX introduced an optional 8-byte shared memory banking mode, which had the potential to increase shared memory bandwidth per SM over Fermi for shared memory accesses of 8 or 16 bytes. However, applications could only benefit from this when storing these larger elements in shared memory (i.e., integers and fp32 values saw no benefit), and only when the developer explicitly opted into the 8-byte bank mode via the API. To simplify this, Maxwell returns to the Fermi style of shared memory banking, where banks are always four bytes wide. Aggregate shared memory bandwidth across the chip remains comparable to that of corresponding Kepler chips, given increased SM count. In this way, all applications using shared memory can now benefit from the higher bandwidth, even when storing only four-byte items into shared memory and without specifying any particular preference via the API. 1.4.3.3. Fast Shared Memory Atomics Kepler introduced a dramatically higher throughput for atomic operations to global memory as compared to Fermi. However, atomic operations to shared memory remained essentially unchanged: both architectures implemented shared memory atomics using a lock/update/unlock pattern that could be expensive in the case of high contention for updates to particular locations in shared memory. Maxwell improves upon this by implementing native shared memory atomic operations for 32-bit integers and native shared memory 32-bit and 64-bit compare-and-swap (CAS), which can be used to implement other atomic functions with reduced overhead compared to the Fermi and Kepler methods. Note: Refer to the CUDA C Programming Guide for an example implementation of an fp64 atomicAdd() using atomicCAS(). Source: Maxwell Tuning Guide. Version 1.1 Throughput of Native Arithmetic Instructions. (Number of Operations per Clock Cycle per Multiprocessor) For Total throughput: multiply the Operations for each SM/SMX/SMM Compute Capability Throughput from left to right:1.1/1.2/1.3/2.0/2.1/3.0/3.5/5.x 32-bit floating-point add, multiply, multiply-add: 8/8/32/48/192/192/128 64-bit floating-point add, multiply, multiply-add: N/A/1/16/4/8/64/1 32-bit floating-point reciprocal, reciprocal square root, base-2 logarithm (__log2f), base 2 exponential (exp2f), sine (__sinf), cosine (__cosf): 2/2/4/8/32/32/32 32-bit integer add, extended-precision add, subtract, extended-precision subtract: 10/10/32/48/160/160/128 32-bit integer multiply, multiply-add, extended-precision multiply-add: Multiple instructions/Multiple instructions/16/16/32/32/Multiple instructions 24-bit integer multiply (__[u]mul24): 8/8/Multiple instructions/Multiple instructions/Multiple instructions/Multiple instructions/Multiple instructions 32-bit integer shift: 8/8/16/16/32/64/64 compare, minimum, maximum: 10/10/32/48/160/160/64 32-bit integer bit reverse, bit field extract/insert: Multiple instructions/ Multiple instructions/16/16/32/32/64 32-bit bitwise AND, OR, XOR: 8/8/32/48/160/160/128 count of leading zeros, most significant non-sign bit: Multiple instructions/ Multiple instructions/16/16/32/32/Multiple instructions population count: Multiple instructions/Multiple instructions/16/16/32/32/32 warp shuffle: N/A N/A N/A N/A 32/32/32 sum of absolute difference: Multiple instructions/Multiple instructions/16/16/32/32/64 SIMD video instructions vabsdiff2: N/A N/A N/A N/A/160/160/Multiple instructions SIMD video instructions vabsdiff4: N/A N/A N/A N/A/160/160/Multiple instructions All other SIMD video instructions: N/A N/A/16/16/32/32/Multiple instructions Type conversions from 8-bit and 16-bit integer to 32-bit types: 8/8/16/16/128/128/32 Type conversions from and to 64-bit types Multiple instructions: Multiple instructions/1/16/4/8/32/4 All other type conversions: 8/8/16/16/32/32/32 Source: CUDA C Programming Guide ID: 38863 · Rating: 0 · rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 38864 - Posted: 7 Nov 2014, 15:57:51 UTC - in response to Message 38863. That is absolutely and exhaustively TLDR. However - if you like - you can choose some further readings from this list. ID: 38864 · Rating: 0 · rate: / Reply Quote

TJ Send message Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level Scientific publications	Message 38865 - Posted: 7 Nov 2014, 19:10:22 UTC - in response to Message 38859. I suppose your memory clock is set to 1500 MHz (in GPU-Z) as well? I'm trying again to manually run 1750 MHz. I passed "memtestcl" at that and since a few minutes everything looks fine. I could live with manually forcing the clock, while staying in P2, as long as I don't get BS again! MrS Yes indeed the memory clock runs at 1500MHz. I have now set the temperature maximum to 73°C. The GTX970 runs smooth, troughs hot air out the cast and is around 10.000 seconds faster then my GTX777. Checked via an energy monitor, the systems runs with 100W less energy then with the GTX770 in it. So not bad at all. What will be the effect to get it in P0 state and memory higher? Will that result in faster processing of WU's? Greetings from TJ ID: 38865 · Rating: 0 · rate: / Reply Quote

eXaPower Send message Joined: 25 Sep 13 Posts: 293 Credit: 1,897,601,978 RAC: 0 Level Scientific publications	Message 38866 - Posted: 7 Nov 2014, 22:27:19 UTC - in response to Message 38865. I'm looking forward to Matt's analysis. (Sorry for TLDR posts) GPUGIRD not withstanding- Kelper still reigns in many places. C.C 5.2 has room for further improvements. C.C 5.2 struggles against C.C 3.5 and does okay versus C.C 3.0 in some paths (C.C 3.0 is better at float double and generating 64bit random numbers) This similar to C.C2.0/C.C2.1 realignment (32SM16DP~~>48SM4DP) A lot GK110 features were transplanted into GM107/GM204 The replacement for a Flagship is now more interesting from recent GK110/GM204 owner inquiries. C.C 3.5 is still faster in many ways. C.C 5.2 differences became notable once many sectors had time for C.C 5.2 to be test-worthy. What will NVidia offer to replace a very successful and still meaningful flagship? A lower TDP with less apt code structure in areas is not equal as there are some CUDA C.C 5.2 limitations compared to C.C 3.5/3.0 Tesla/Quadro/Titan had a price drop lately to open up inventory. Looking at the price for GTX980/GTX780[ti] (500-900usd) current market as- a 1000$ Titan looks to be a worth a bargain two years ago or even recently. (GTX780[ti] were at a record low 300$-500 a couple weeks after GM204 launched. What will be specs for GM2*0? Will it be a 1/2DP64 ratio with 64DPSMM/16DPper [4]32core block? More than 960DP64 for enabled GK110? ID: 38866 · Rating: 0 · rate: / Reply Quote

ExtraTerrestrial Apes Volunteer moderator Volunteer tester Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level Scientific publications	Message 38867 - Posted: 7 Nov 2014, 23:06:31 UTC - in response to Message 38865. What will be the effect to get it in P0 state and memory higher? Will that result in faster processing of WU's? It should get the memory clock up to 1750 Mhz automatically. This should be good for about a 4% performance boost according to SK. So it's not dramatic, but quite a few credits per day more. Memory controller load would reduce from almost 50% to a good 40%. Scaling with higher GPU clocks should be improved. And GTX980 might like this a lot, as with higher performance it needs even more bandwidth to feed the beast. And I'm not sure if P2 uses a lower memory voltage. That's pretty much the point I care about most. Initially I got blue screens when I tried to run my memory at stock 1750 MHz in P2, whereas Heaven ran fine for about 2h at that clock speed in P0. When I tried it again yesterday I got a calculation error. Other people are reporting memory clocks up to 2000 MHz for these cards, which might be totally out of reach if I can not even get 1750 Mhz stable. And finally: I'm asking myself, why is nVidia limiting the mem clock speed in P2? and why are they changing to P2 when ever CUDA or OpenCL is being used? It can't be low utilization (~50%, as in Heaven). Maybe they're using stricter timings in this mode, which might generally benefit GP-GPU more than higher throughput? This could explain why my card would take this clock speed in P0, but not in P2. MrS Scanning for our furry friends since Jan 2002 ID: 38867 · Rating: 0 · rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 38869 - Posted: 8 Nov 2014, 0:22:32 UTC - in response to Message 38867. Last modified: 8 Nov 2014, 0:24:06 UTC I'm asking myself, why is nVidia limiting the mem clock speed in P2? and why are they changing to P2 when ever CUDA or OpenCL is being used? It can't be low utilization (~50%, as in Heaven). Maybe they're using stricter timings in this mode, which might generally benefit GP-GPU more than higher throughput? This could explain why my card would take this clock speed in P0, but not in P2. I think that stability is a more important factor than speed while using CUDA or OpenCL. I had memory clock issues with my Gigabyte GTX780Ti OC, so perhaps NVidia is trying to avoid such issues as much as possible. I've overclocked my GTX980 to 1407MHz @ 1.237V, the memory controller usage had risen to 56-60%, but it's still running at 3005MHz. I'm not sure I can increase the memory clock to 3505MHz, as according to MSI afterburner my card is using 104-108% power. ID: 38869 · Rating: 0 · rate: / Reply Quote