Message boards :
Graphics cards (GPUs) :
CUDA 7 Release Candidate Feature Overview
Message board moderation
| Author | Message |
|---|---|
|
Send message Joined: 20 Jul 14 Posts: 732 Credit: 130,089,082 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
It’s almost time for the next major release of the CUDA Toolkit, so I’m excited to tell you about the CUDA 7 Release Candidate, now available to all CUDA Registered Developers. The CUDA Toolkit version 7 expands the capabilities and improves the performance of the Tesla Accelerated Computing Platform and of accelerated computing on NVIDIA GPUs. CUDA 7 Release Candidate Feature Overview: C++11, New Libraries, and More [CSF] Thomas H.V. Dupont Founder of the team CRUNCHERS SANS FRONTIERES 2.0 www.crunchersansfrontieres |
|
Send message Joined: 25 Sep 13 Posts: 293 Credit: 1,897,601,978 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Changes from Version 6.5 • Removed all references to devices of compute capabilities 1.x as they are no longer supported. • Mentioned in Default Stream the new --default-stream compilation flag that changes the behavior of the default stream. • Clarified the behavior of some atomic operations in Floating-Point Standard. •Updated Table 6 for improved ULP errors for rhypotf, rcbrtf, log2f, log10f, erfinvf, erfcf, erfcxf, and normcdff. •Updated Table 14 as CUDA_VISIBLE_DEVICES now accepts gpu-uuids to enumerate devices. • Added the new CUDA_AUTO_BOOST environment variable to Table 14. PTX ISA version 4.2 introduces the following new features: •Support for memory_layout field for surfaces and suq instruction support for querying this field. Semantic Changes and Clarifications Semantics for parameter passing under ABI were updated to indicate ld.param and st.param instructions used for argument passing cannot be predicated. Semantics of {atom/red}.add.f32 were updated to indicate subnormal inputs and results are flushed to sign-preserving zero for atomic operations on global memory; whereas atomic operations on shared memory preserve subnormal inputs and results and don't flush them to zero. Features Unimplemented in PTX ISA Version 4.2 The following features remain unimplemented in PTX ISA version 4.2: •Support for variadic functions. •Allocation of per-thread, stack-based memory using alloca. •Indirect branches. Source: PTX ISA CUDA7 toolkit Doc and CUDA7 Programming Guide Driver 347.12 is part of CUDA7 toolkit. Same CUDA driver (7.0.18) as 347.09 Branch 346 is for CUDA7 while branch 343 is CUDA6.5 A new library is included: "The cuSolver library is a high-level package based on the cuBLAS and cuSPARSE libraries. It combines three separate libraries under a single umbrella, each of which can be used independently or in concert with other toolkit libraries." From the CUDA7 Release notes: 2. New Features 2.1. General CUDA •Added a method to the CUDA Driver API, cuDevicePrimaryCtxRetain(), that allows a program to create (or to access if it already exists) the same CUDA context for a GPU device as the one used by the CUDART (CUDA Runtime API) library. This context is referred to as the primary context, and this new method allows for sharing the primary context between CUDART and other threads, which can reduce the performance overhead of creating and maintaining multiple contexts per device. •Unified the device enumeration for CUDA, NVML, and related tools. Variable CUDA_DEVICE_ORDER can have a value of FASTEST_FIRST (default) or PCI_BUS_ID. •Instrumented NVML (NVIDA Management Library) and the CUDA driver to ignore GPUs that have been made inaccessible via cgroups (control groups). This enables schedulers that rely on cgroups to enforce device access restrictions for their jobs. Job schedulers wanting to use cgroups for device restriction need CUDA and NVML to handle those restrictions in a graceful way. •Implimented Multi-process Server (MPS), which enables concurrent execution of GPU tasks from multiple CPUs within a single node. It allows setting multiple MPI ranks in a single node. •The Windows and Mac OS X installers are now also available as network installers. A network installer is much smaller than the traditional local installer and downloads only the components selected for installation. 2.2. CUDA Tools 2.2.1. CUDA Compiler •Added support for GCC 4.9. •Added support for the C++11 language dialect. •On Mac OS X, libc++ is supported with XCode 5.x. Command-line option -Xcompiler -stdlib=libstdc++ is no longer needed when invoking NVCC. Instead, NVCC uses the default library that Clang chooses on Mac OS X. Users are still able to choose between libc++ and libstdc++ by passing -Xcompiler -stdlib=libc++ or -Xcompiler -stdlib=libstdc++ to NVCC. •The Runtime Compilation library (nvrtc) provides an API to compile CUDA-C++ device source code at runtime. The resulting compiled PTX can be launched on a GPU using the CUDA Driver API. More details can be found in the libNVRTC User Guide. 2.2.2. CUDA-GDB •Starting with CUDA 7.0, GPU core dumps can read by CUDA-GDB with the target cudacore ${gpucoredump} and target core ${cpucoredump} ${gpucoredump} commands. •Enabled CUDA applications to generate a GPU core dump when an exception is hit on the GPU. The feature is supported on Windows, Mac OS X, and Linux desktops. (Android, L4T, and Vibrante support may come in the future.) On Windows, this feature is only supported in TCC mode. On Unix-like OSs (Linux, OS X, etc.), a CPU core dump is generated along with a GPU core dump. 2.2.3. CUDA-MEMCHECK •Enabled the tracking and reporting of uninitialized global memory. 2.2.4. CUDA Profiler •On supported chips (sm_30 and beyond), all hardware counters exposed by CUDA profiling tools (nvprof, nvvp, and Nsight Eclipse Edition) can now be profiled from multiple applications at the same time. 2.2.5. Nsight Eclipse Edition •Cross compiling to the Power8 target architecture using the GNU tool-chain is now supported within the Nsight IDE. 2.2.6. NVIDIA Visual Profiler •With GPU PC Sampling, which is supported for devices with compute capability 5.2, the Visual Profiler shows stall causes for each source and assembly line. This helps in pinpointing latency bottlenecks in a GPU kernel at the source level. 2.3. CUDA Libraries 2.3.1. cuBLAS Library •The batched LU solver cublas{T}getrsBatched routine has been added to cuBLAS. It takes the output of the batched factorization routines cublas{T}getrfBatched to compute the solution given the provided batch of right-hand-side matrices. •A license is no longer required in order to use cuBLAS-XT with more than two GPUs. 2.3.2. cuFFT Library •For CUDA 7.0, support for callback routines, invoked when cuFFT loads and/or stores data, no longer requires an evaluation license file. •For CUDA 7.0, cuFFT multiple-GPU execution is supported on up to four GPUs, except for single 1D complex-to-complex transforms, which are supported on two or four GPUs. •In CUDA 7.0, transform execution may be distributed to four GPUs with the same CUDA architecture. In addition, multiple GPU support for two or four GPUs is no longer constrained to GPUs on a single board. Use of this functionality requires a free evaluation license file, which is available to registered developers via the cuFFT developer page. •For CUDA 7.0, single complex-to-complex 2D and 3D transforms with dimensions that can be factored into primes less than or equal to 127 are supported on multiple GPUs. Single complex-to-complex 1D transforms on multiple GPUs continue to be limited to sizes that are powers of 2. 2.3.3. cuSOLVER Library •CUDA 7.0 introduces cuSOLVER, a new library that is a collection of routines to solve linear systems and Eigen problems. It includes dense and sparse linear solvers and sparse refactorization. •Enabled offloading dense linear algebra calls to the GPUs in a sparse direct solver. 2.3.4. cuSPARSE Library •Added a new cusparse<t>csrgemm2() routine optimized for small matrices and operations C = a*A*B + b*D, where A, B, and D are CSR matrices. •Added graph coloring. 2.3.5. CUDA Math Library •Support for 3D and 4D Euclidean norm and 3D Euclidean reciprocal norm has been added to the math library. 2.3.6. Thrust Library •Thrust version 1.8.0 introduces support for algorithm invocation from CUDA __device__ code, support for CUDA streams, and algorithm performance improvements. Users may now invoke Thrust algorithms from CUDA __device__ code, providing a parallel algorithms library to CUDA programmers authoring custom kernels as well as allowing Thrust programmers to nest their algorithm calls within functors. The thrust::seq execution policy allows users to require sequential algorithm execution in the calling thread and makes a sequential algorithms library available to individual CUDA threads. The .on(stream) syntax allows users to request a CUDA stream for kernels launched during algorithm execution. Finally, new CUDA algorithm implementations provide substantial performance improvements. 2.4. CUDA Samples •The CUDA Samples makefile x86_64=1 and ARMv7=1 options have been deprecated. Please use TARGET_ARCH to set the targeted build architecture instead. The CUDA Samples makefile GCC option has been deprecated. Please use HOST_COMPILER to set the host compiler instead. |
|
Send message Joined: 25 Sep 13 Posts: 293 Credit: 1,897,601,978 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
R352 driver branch CUDA7.5 toolkit: http://devblogs.nvidia.com/parallelforall/new-features-cuda-7-5/ Taking place this week in Lille France is the 2015 International Conference on Machine Learning - Nvidia also released CUDA Deep Neural Network library (cuDNN3) and DIGITS 2 higher-level neural network software for general scientists and researchers. 1.3. PTX ISA Version 4.3 CUDA 7.5 Release Notes state:
|
skgivenSend message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Might be something in the libraries for here. Looks like the ACEMD app might be better supported on multiple cards (1 app over several cards 2/4). Matt & Gianni would know. FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help |
|
Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I wonder if that means that GPU projects can be run when BOINC is installed as a service, at least if the appropriate changes are made to BOINC. |
skgivenSend message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
In theory, yes, but the GPUGrid app and Boinc might need to be updated first. Then you would need to reinstall Boinc as a service. Would obviously need some research, development and testing if and before it would become the norm.
FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help |
|
Send message Joined: 5 Jan 09 Posts: 670 Credit: 2,498,095,550 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Boinc might need to be updated first. On that point you may have to wait longer for the development of BOINC as its funding has been withdrawn and now relies on volunteers. |
|
Send message Joined: 2 Jan 09 Posts: 303 Credit: 7,321,800,090 RAC: 330 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Boinc might need to be updated first. And all the main players have found other paying jobs. They have NOT quit providing Boinc support, they are just not doing it full time anymore. |
©2025 Universitat Pompeu Fabra