CUDA 7 Release Candidate Feature Overview

Author	Message
[CSF] Thomas H.V. DUPONT Send message Joined: 20 Jul 14 Posts: 732 Credit: 130,089,082 RAC: 0 Level Scientific publications	Message 39469 - Posted: 13 Jan 2015, 19:54:40 UTC It’s almost time for the next major release of the CUDA Toolkit, so I’m excited to tell you about the CUDA 7 Release Candidate, now available to all CUDA Registered Developers. The CUDA Toolkit version 7 expands the capabilities and improves the performance of the Tesla Accelerated Computing Platform and of accelerated computing on NVIDIA GPUs. CUDA 7 Release Candidate Feature Overview: C++11, New Libraries, and More [CSF] Thomas H.V. Dupont Founder of the team CRUNCHERS SANS FRONTIERES 2.0 www.crunchersansfrontieres ID: 39469 · Rating: 0 · rate: / Reply Quote

eXaPower Send message Joined: 25 Sep 13 Posts: 293 Credit: 1,897,601,978 RAC: 0 Level Scientific publications	Message 39475 - Posted: 14 Jan 2015, 12:38:19 UTC Last modified: 14 Jan 2015, 12:44:21 UTC Changes from Version 6.5 • Removed all references to devices of compute capabilities 1.x as they are no longer supported. • Mentioned in Default Stream the new --default-stream compilation flag that changes the behavior of the default stream. • Clarified the behavior of some atomic operations in Floating-Point Standard. •Updated Table 6 for improved ULP errors for rhypotf, rcbrtf, log2f, log10f, erfinvf, erfcf, erfcxf, and normcdff. •Updated Table 14 as CUDA_VISIBLE_DEVICES now accepts gpu-uuids to enumerate devices. • Added the new CUDA_AUTO_BOOST environment variable to Table 14. PTX ISA version 4.2 introduces the following new features: •Support for memory_layout field for surfaces and suq instruction support for querying this field. Semantic Changes and Clarifications Semantics for parameter passing under ABI were updated to indicate ld.param and st.param instructions used for argument passing cannot be predicated. Semantics of {atom/red}.add.f32 were updated to indicate subnormal inputs and results are flushed to sign-preserving zero for atomic operations on global memory; whereas atomic operations on shared memory preserve subnormal inputs and results and don't flush them to zero. Features Unimplemented in PTX ISA Version 4.2 The following features remain unimplemented in PTX ISA version 4.2: •Support for variadic functions. •Allocation of per-thread, stack-based memory using alloca. •Indirect branches. Source: PTX ISA CUDA7 toolkit Doc and CUDA7 Programming Guide Driver 347.12 is part of CUDA7 toolkit. Same CUDA driver (7.0.18) as 347.09 Branch 346 is for CUDA7 while branch 343 is CUDA6.5 A new library is included: "The cuSolver library is a high-level package based on the cuBLAS and cuSPARSE libraries. It combines three separate libraries under a single umbrella, each of which can be used independently or in concert with other toolkit libraries." From the CUDA7 Release notes: 2. New Features 2.1. General CUDA •Added a method to the CUDA Driver API, cuDevicePrimaryCtxRetain(), that allows a program to create (or to access if it already exists) the same CUDA context for a GPU device as the one used by the CUDART (CUDA Runtime API) library. This context is referred to as the primary context, and this new method allows for sharing the primary context between CUDART and other threads, which can reduce the performance overhead of creating and maintaining multiple contexts per device. •Unified the device enumeration for CUDA, NVML, and related tools. Variable CUDA_DEVICE_ORDER can have a value of FASTEST_FIRST (default) or PCI_BUS_ID. •Instrumented NVML (NVIDA Management Library) and the CUDA driver to ignore GPUs that have been made inaccessible via cgroups (control groups). This enables schedulers that rely on cgroups to enforce device access restrictions for their jobs. Job schedulers wanting to use cgroups for device restriction need CUDA and NVML to handle those restrictions in a graceful way. •Implimented Multi-process Server (MPS), which enables concurrent execution of GPU tasks from multiple CPUs within a single node. It allows setting multiple MPI ranks in a single node. •The Windows and Mac OS X installers are now also available as network installers. A network installer is much smaller than the traditional local installer and downloads only the components selected for installation. 2.2. CUDA Tools 2.2.1. CUDA Compiler •Added support for GCC 4.9. •Added support for the C++11 language dialect. •On Mac OS X, libc++ is supported with XCode 5.x. Command-line option -Xcompiler -stdlib=libstdc++ is no longer needed when invoking NVCC. Instead, NVCC uses the default library that Clang chooses on Mac OS X. Users are still able to choose between libc++ and libstdc++ by passing -Xcompiler -stdlib=libc++ or -Xcompiler -stdlib=libstdc++ to NVCC. •The Runtime Compilation library (nvrtc) provides an API to compile CUDA-C++ device source code at runtime. The resulting compiled PTX can be launched on a GPU using the CUDA Driver API. More details can be found in the libNVRTC User Guide. 2.2.2. CUDA-GDB •Starting with CUDA 7.0, GPU core dumps can read by CUDA-GDB with the target cudacore ${gpucoredump} and target core ${cpucoredump} ${gpucoredump} commands. •Enabled CUDA applications to generate a GPU core dump when an exception is hit on the GPU. The feature is supported on Windows, Mac OS X, and Linux desktops. (Android, L4T, and Vibrante support may come in the future.) On Windows, this feature is only supported in TCC mode. On Unix-like OSs (Linux, OS X, etc.), a CPU core dump is generated along with a GPU core dump. 2.2.3. CUDA-MEMCHECK •Enabled the tracking and reporting of uninitialized global memory. 2.2.4. CUDA Profiler •On supported chips (sm_30 and beyond), all hardware counters exposed by CUDA profiling tools (nvprof, nvvp, and Nsight Eclipse Edition) can now be profiled from multiple applications at the same time. 2.2.5. Nsight Eclipse Edition •Cross compiling to the Power8 target architecture using the GNU tool-chain is now supported within the Nsight IDE. 2.2.6. NVIDIA Visual Profiler •With GPU PC Sampling, which is supported for devices with compute capability 5.2, the Visual Profiler shows stall causes for each source and assembly line. This helps in pinpointing latency bottlenecks in a GPU kernel at the source level. 2.3. CUDA Libraries 2.3.1. cuBLAS Library •The batched LU solver cublas{T}getrsBatched routine has been added to cuBLAS. It takes the output of the batched factorization routines cublas{T}getrfBatched to compute the solution given the provided batch of right-hand-side matrices. •A license is no longer required in order to use cuBLAS-XT with more than two GPUs. 2.3.2. cuFFT Library •For CUDA 7.0, support for callback routines, invoked when cuFFT loads and/or stores data, no longer requires an evaluation license file. •For CUDA 7.0, cuFFT multiple-GPU execution is supported on up to four GPUs, except for single 1D complex-to-complex transforms, which are supported on two or four GPUs. •In CUDA 7.0, transform execution may be distributed to four GPUs with the same CUDA architecture. In addition, multiple GPU support for two or four GPUs is no longer constrained to GPUs on a single board. Use of this functionality requires a free evaluation license file, which is available to registered developers via the cuFFT developer page. •For CUDA 7.0, single complex-to-complex 2D and 3D transforms with dimensions that can be factored into primes less than or equal to 127 are supported on multiple GPUs. Single complex-to-complex 1D transforms on multiple GPUs continue to be limited to sizes that are powers of 2. 2.3.3. cuSOLVER Library •CUDA 7.0 introduces cuSOLVER, a new library that is a collection of routines to solve linear systems and Eigen problems. It includes dense and sparse linear solvers and sparse refactorization. •Enabled offloading dense linear algebra calls to the GPUs in a sparse direct solver. 2.3.4. cuSPARSE Library •Added a new cusparse<t>csrgemm2() routine optimized for small matrices and operations C = aAB + b*D, where A, B, and D are CSR matrices. •Added graph coloring. 2.3.5. CUDA Math Library •Support for 3D and 4D Euclidean norm and 3D Euclidean reciprocal norm has been added to the math library. 2.3.6. Thrust Library •Thrust version 1.8.0 introduces support for algorithm invocation from CUDA __device__ code, support for CUDA streams, and algorithm performance improvements. Users may now invoke Thrust algorithms from CUDA __device__ code, providing a parallel algorithms library to CUDA programmers authoring custom kernels as well as allowing Thrust programmers to nest their algorithm calls within functors. The thrust::seq execution policy allows users to require sequential algorithm execution in the calling thread and makes a sequential algorithms library available to individual CUDA threads. The .on(stream) syntax allows users to request a CUDA stream for kernels launched during algorithm execution. Finally, new CUDA algorithm implementations provide substantial performance improvements. 2.4. CUDA Samples •The CUDA Samples makefile x86_64=1 and ARMv7=1 options have been deprecated. Please use TARGET_ARCH to set the targeted build architecture instead. The CUDA Samples makefile GCC option has been deprecated. Please use HOST_COMPILER to set the host compiler instead. ID: 39475 · Rating: 0 · rate: / Reply Quote

eXaPower Send message Joined: 25 Sep 13 Posts: 293 Credit: 1,897,601,978 RAC: 0 Level Scientific publications	Message 41509 - Posted: 8 Jul 2015, 19:16:36 UTC Last modified: 8 Jul 2015, 19:21:10 UTC R352 driver branch CUDA7.5 toolkit: http://devblogs.nvidia.com/parallelforall/new-features-cuda-7-5/ Taking place this week in Lille France is the 2015 International Conference on Machine Learning - Nvidia also released CUDA Deep Neural Network library (cuDNN3) and DIGITS 2 higher-level neural network software for general scientists and researchers. 1.3. PTX ISA Version 4.3 PTX ISA version 4.3 introduces the following new features: •A new lop3 instruction which allows arbitrary logical operation on 3 inputs. •Adds support for 64-bit computations in extended precision arithmetic instructions. •Extends tex.grad instruction to support cube and acube geometries. •Extends tld4 instruction to support a2d, cube and acube geometries. •Extends tex and tld4 instructions to support optional operands for offset vector and depth compare. •Extends txq instruction to support querying texture fields from specific LOD. Semantic Changes and Clarifications None. Features Unimplemented in PTX ISA Version 4.3 The following features remain unimplemented in PTX ISA version 4.3: •Support for variadic functions. •Allocation of per-thread, stack-based memory using alloca. •Indirect branches. CUDA 7.5 Release Notes state: 2. New Features 2.1. General CUDA •CUDA applications can now run on Windows systems when launched from within a Remote Desktop session. They can also be run as Windows service applications. Note: the CUDA-GL interop APIs are only supported via Remote Desktop on GPUs that also support OpenGL via Remote Desktop, such as the Quadro line of GPUs. •The cudaStreamCreateWithPriority() function is now supported on all GPUs, not just Tesla and Quadro GPUs. •The Tesla Compute Cluster mode is now supported on all Tesla, Quadro, and GeForce GTX TITAN GPUs. Previously, it was only supported on Tesla and some Quadro GPUs. •Added two topology functions to NVML and the NVIDIA-SMI topo command. The NVML functions are listed below: ◦Function nvmlSystemGetGpuSet(): discovers the set of GPUs that are on the same IO-HUB (root complex) for a given CPU socket. ◦Function nvmlDeviceGetNearestDevices(): discovers the set of GPUs that are closest to a given GPU. The closest connection is determined by the input device handle and its hierarchy in the topology, that is, a GPU that shares a Gemini board, is on the same PLX switch, has a direct PCI link, or is connected to the same host bridge. These functions help applications optimize data access and memory transfers between GPUs and/or supported I/O devices. 2.2. CUDA Tools 2.2.1. CUDA Compilers •Compilers nvcc and nvrtc predefine the following macros: __CUDACC_VER_MAJOR__, __CUDACC_VER_MINOR__, __CUDACC_VER_BUILD__, and __CUDACC_VER__ to help users identify the compiler version in the source code. Macros __CUDACC_VER_MAJOR__, __CUDACC_VER_MINOR__, and __CUDACC_VER_BUILD__ are defined with the compiler major, minor, and build version numbers respectively; __CUDACC_VER__ is defined by __CUDACC_VER_MAJOR__ * 10000 + __CUDACC_VER_MINOR__ * 100 + __CUDACC_VER_BUILD__. •The PTX ISA has been extended to support the lop3 instruction, which can perform arbitrary logic operations involving three inputs. •The -warn-spills and -warn-lmem-usage options have been added to the ptxas compilation stage to provide warnings when the compiler needs to spill values from registers, or if there is any local memory usage, respectively. • Added a command line option to ptxas to give warnings if the input ptx is doing double-precision computations. The long form of the option is --warn-on-double-precision-use and the short form is -warn-double-usage. • The nvcc compiler has added an experimental feature to define a __device__ annotated C++11 lambda function on the host and pass it to a device kernel. This feature is enabled by the --std=c++11 --expt-extended-lambda command-line options to nvcc. • The clang LLVM-based C and C++ compiler versions 3.5 and 3.6 are now supported as a host-compiler by nvcc on Linux operating systems. Please see the Linux Installation Guide for specific version support. 2.2.2. GPU Wizard •The GPU Wizard is available as a standalone tool and is also bundled with CUDA Toolkit releases. The Wizard instruments the target application and detects BLAS, FFT, and other common math routines. It then provides runtime analysis and reports potential speedups for those library routines if they used GPU acceleration via NVIDIA math libraries. In addition, the tool also supports hotspot detection and the corresponding call trace to help with performance tuning. The Wizard is supported on Windows in this release. 2.2.3. Nsight Eclipse Edition •NSight Eclipse Edition is now supported on the IBM Power8 platform. •Nsight Eclipse Edition supports multiple toolkit versions. This enables the latest version of Nsight, shipped with CUDA 7.5, to support cross-compilation and remote profiling on platforms, such as Jetson TK1, that are on older versions of CUDA. 2.3. CUDA Libraries 2.3.1. cuBLAS Library •The routine cublasHgemm() has been added to support half-precision floating point (FP16). This routine is only supported for GPU architectures sm_53 or greater. •The cuBLAS Library has an extension of SGEMM called cublasSgemmEx() that supports different data formats (FP16 and INT8) in input and/or output while the computation is still done in single-precision floating point. Please refer to the documentation for the description of the supported formats. 2.3.2. cuFFT Library •Two new cuFFT functions have been added to support the specification of transforms on arrays spanning more than 4G elements. Functions cufftMakePlanMany64() and cufftGetSizeMany64() are identical to the 32-bit versions, except that all of the dimensions and strides are specified as 64-bit integers. 2.3.3. cuRAND Library •The sobol_direction_vectors.h header file, which allowed developers to employ the cuRAND device API with sobol distributions, has been removed. The file was removed in favor of the curandGetDirectionVectors{32,64}() and curandGetScrambleConstants{32,64}() functions. These functions return a memory pointer to the direction vectors that are precompiled into the cuRAND library, and developers should use this pointer to copy the vectors to the GPU device memory. Sample code demonstrating this approach is included in the cuRAND documentation. Read more at http://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html. 2.3.4. cuSPARSE Library •The cuSPARSE library now supports the cusparse{S,D,C,Z}gemvi() routine, which multiplies a dense matrix by a sparse vector. 2.4. CUDA Samples •CUDA samples were added to illustrate usage of the cuSOLVER library. 3. Unsupported Features The following features are officially unsupported in the current release. Developers must employ alternative solutions to these features in their software. The sobol_direction_vectors.h Header File The sobol_direction_vectors.h header file, which allowed developers to employ the cuRAND device API with sobol distributions, has been removed. CUDA-GDB on Mac OS XDebugging GPGPU code using cuda-gdb is no longer supported on the Mac platform. Developer Tools Support for 32-bit Applications on 64-bit Linux Systems This has no impact on Windows or Mac platforms. 4. Deprecated Features The following features are deprecated in the current release of the CUDA software. The features still work in the current release, but their documentation may have been removed, and they will become officially unsupported in a future release. We recommend that developers employ alternative solutions to these features in their software. Compatibility Modes in cuFFT 7.5In cuFFT 7.5, compatibility modes different from CUFFT_COMPATIBILITY_FFTW_ALL are deprecated. In the next major cuFFT release, the function cufftSetCompatibilityMode() will no longer accept the following values for the mode parameter: CUFFT_COMPATIBILITY_NATIVE, CUFFT_COMPATIBILITY_FFT_PADDING, and CUFFT_COMPATIBILITY_FFT_ASYMMETRIC. The error code CUFFT_NOT_SUPPORTED will be returned in each case. Exclusive Thread Compute ModeThe EXCLUSIVE_THREAD option for the nvidia-smi compute-mode setting is deprecated and will be unsupported in a future version of the software. Developing and Running 32-bit CUDA and OpenCL Applications on x86 Linux Platforms Support for developing and running 32-bit CUDA and OpenCL applications on 64-bit x86 Linux platforms is deprecated. 5. Performance Improvements 5.1. CUDA Tools 5.1.1. CUDA Compilers • Improved the performance of the sqrt macro expansion. The speed has increased by 5–10%, and marginal values are handled at least 40% more efficiently. 6. Resolved Issues 6.1. General CUDA •With multiple devices, the N-Body CUDA sample rounds up the number of bodies per device to (number of SMs * 256). This can lead to issues where the last GPU has little or no work, leading to a performance drop and/or a kernel launch failure. To resolve this, line 103 in file bodysystemcuda_impl.h should be modified from unsigned int round = numSms[i] * 256 to unsigned int round = 256. •It is no longer necessary to shut down the nvidia-persistenced daemon prior to the removal of the NVIDIA driver debian packages. •The POWER8 driver installation no longer overrides the mesa GL alternative. •Implemented smarter error handling for attached clients using an MPS server. 6.2. CUDA Tools 6.2.1. CUDA Compilers • The CUDA disassembler, nvdisasm, now correctly prints live range analysis information for Maxwell GPUs (SM 5.x). •When C++11 code (-std=c++11) is compiled on Linux with gcc as the host compiler, invoking pow() or std::pow() from device code with (float, int) or (double, int) arguments now compiles successfully. • Device linking (in order to use separate compilation) was causing severe performance degradation when constant memory was used. 6.2.2. CUDA Profiler •The profiler no longer fails to collect events or metrics when application replay mode is turned on for an application that uses CUDA driver APIs to launch the kernel. 6.2.3. CUDA Profiling Tools Interface (CUPTI) •The cuptiActivityConfigurePCSampling() function is not supported, and therefore the PC sampling period cannot be changed. The PC sampling period used during sampling is given in the samplingPeriodInCycles field of the CUpti_ActivityPCSamplingRecordInfo record. •The double-precision flops-per-cycle device attribute and the flop_dp_efficiency metric values reported by the profiler are now correct. 6.3. CUDA Libraries 6.3.1. cuFFT Library •The static library version of cuFFT had several known issues that were manifested only when execution was on a Maxwell GPU (sm50 or higher) and when a transform contained sizes that factor to primes in the range of 67–127. This has been fixed. 6.3.2. Thrust Library •On the SLES 11 Linux distribution, an issue that causes the segmentationTreeThrust CUDA sample in the 6_Advanced directory to fail has been fixed. 7. Known Issues 7.1. General CUDA •If the Windows toolkit installation fails, it may be because Visual Studio, Nvda.Launcher.exe, Nsight.Monitor.exe, or Nvda.CrashReporter.exe is running. Make sure these programs are closed and try to install again. •Peer access is disabled between two devices if either of them is in SLI mode. •Unified memory is not currently supported with IOMMU. The workaround is to disable IOMMU in the BIOS. Please refer to the vendor documentation for the steps to disable it in the BIOS. 7.2. CUDA Libraries 7.2.1. Thrust Library •On the SLES 11 Linux distribution, there is a known issue that causes the TestGetTemporaryBufferDispatchExplicit and TestGetTemporaryBufferDispatchImplicit unit tests provided with the Thrust library to fail. ID: 41509 · Rating: 0 · rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 41510 - Posted: 8 Jul 2015, 21:14:59 UTC - in response to Message 39469. Might be something in the libraries for here. Looks like the ACEMD app might be better supported on multiple cards (1 app over several cards 2/4). Matt & Gianni would know. FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help ID: 41510 · Rating: 0 · rate: / Reply Quote

Jim1348 Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level Scientific publications	Message 41516 - Posted: 10 Jul 2015, 4:49:19 UTC - in response to Message 41509. CUDA 7.5 Release Notes state: 2. New Features 2.1. General CUDA •CUDA applications can now run on Windows systems when launched from within a Remote Desktop session. They can also be run as Windows service applications. I wonder if that means that GPU projects can be run when BOINC is installed as a service, at least if the appropriate changes are made to BOINC. ID: 41516 · Rating: 0 · rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 41519 - Posted: 10 Jul 2015, 19:33:57 UTC - in response to Message 41516. In theory, yes, but the GPUGrid app and Boinc might need to be updated first. Then you would need to reinstall Boinc as a service. Would obviously need some research, development and testing if and before it would become the norm. CUDA 7.5 Release Notes state: 2. New Features 2.1. General CUDA •CUDA applications can now run on Windows systems when launched from within a Remote Desktop session. They can also be run as Windows service applications. I wonder if that means that GPU projects can be run when BOINC is installed as a service, at least if the appropriate changes are made to BOINC. FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help ID: 41519 · Rating: 0 · rate: / Reply Quote

Betting Slip Send message Joined: 5 Jan 09 Posts: 670 Credit: 2,498,095,550 RAC: 0 Level Scientific publications	Message 41525 - Posted: 12 Jul 2015, 8:29:13 UTC - in response to Message 41519. Last modified: 12 Jul 2015, 8:29:57 UTC Boinc might need to be updated first. On that point you may have to wait longer for the development of BOINC as its funding has been withdrawn and now relies on volunteers. ID: 41525 · Rating: 0 · rate: / Reply Quote

mikey Send message Joined: 2 Jan 09 Posts: 303 Credit: 7,322,550,090 RAC: 22,583 Level Scientific publications	Message 41528 - Posted: 12 Jul 2015, 10:41:58 UTC - in response to Message 41525. Boinc might need to be updated first. On that point you may have to wait longer for the development of BOINC as its funding has been withdrawn and now relies on volunteers. And all the main players have found other paying jobs. They have NOT quit providing Boinc support, they are just not doing it full time anymore. ID: 41528 · Rating: 0 · rate: / Reply Quote