https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html
- Introduction
- GPGPU – Massive parallelism. Focus on data processing, not data caching or control flow. Hide memory access latencies through computation.
- CUDA – C++ Extension. Three key abstractions: thread groups, shared memories, and barrier synchronizations.
- Programming Model
- Kernel – C++ functions. Executed N times in parallel by N CUDA threads.
- __global__ declaration specifier.
- <<<…>>> execution configuration.
- threadIdx build in variable.
- Thread Hierarchy
- Grid of thread blocks of threads.
- Requirements:
- All threads of a block must run on the same processor core and share memory (L1) resources. A thread block may contain up to 1024 threads.
- Thread blocks are required to be executed independently.
- Hierarchy
- Threads are organized into a 1D/2D/3D block of threads.
- Blocks are organized into a 1D/2D/3D grid of thread blocks.
- Requirements:
- Build-in variables
- threadIdx: 3-component vector. threadIdx.x/y/z (inner – outer).
- blockIdx: 3-component vector.
- blockDim: 3-component vector.
- Execute configuration
- <<</*dim3 object for blocks*/, /*dim3 object for threads*/>>>
- Synchronization
- __synchthreads() all threads within a thread block will wait before any is allowed to proceed. (lightweight)
- Grid of thread blocks of threads.
- Memory Hierarchy
- Local memory: private to each thread
- Shared memory: private to each thread block
- Global memory: public to all threads
- Constant & Texture memory: Read-only. Persistent. Public to all.
- Heterogeneous Programming
- Compute: kernels on GPU; rest on CPU.
- Memory: separate.
- Host memory for CPU and device memory for GPU. Programmer manage memory (allocation, deallocation and transfer).
- Unified memory as managed memory. Shared. Coherrent.
- Compute Capability
- Version number. X.Y -> Main and minor.
- Kernel – C++ functions. Executed N times in parallel by N CUDA threads.
- Programming Interface
- Compilation with NVCC – NVCC is a compiler driver
- Compilation Workflow
- Offline Compilation – NVCC compiles the device code into binary code and/or PTX code. (NVRTC can dot it JIT). The host compiler compiles the host code into assembly.
- JIT Compilation – Devide driver compiles the PTX code into assembly. Compute cache caches the compiled code.
- Binary compatibility – `-code=` specify compatibility. X.z is compatible with X.y when z >= y.
- Architecture compatibility – `-arch=` specicify architecture.
- Application compatibility
- Dynamic dispatch.
- __CUDA_ARCH__ marco.
- C++ compatibility – Subset is supported.
- 64-Bit compatibility – Device and host code must be the same, 32-bit or 64-bit.
- Compilation Workflow
- CUDA Runtime – implemented in `cudart` library
- Initialization
- Initializes the first time a runtime function is called. (Overhead in Profiling)
- Creates a CUDA context for each device, shared by all the host threads.
- Device Memory
- Allocation: Linear memory (C fashion continuous memory) / CUDA arrays (opaque)
- Management:
- 1D Arrays: cudaMalloc, cudaMemcpy, and cudaFree on host code.
- 2D / 3D Arrays: cudaMallocPitch / cudaMalloc3D; cudaMemcpy2D / cudeMemcpy3D. (Allocation properly padded to meet alignment request)
- Query: cudaGetSymbolAddress(); cudaGetSymbolSize().
- Shared Memory
- Allocated using __shared__ memory space specifier. Act like a scratchpad.
- Page-Locked Host Memory
- Usage: cudaHostAlloc() and cudaFreeHost() / cudaHostRegister()
- Benifites Concurrent copy; Avoid copy; High bandwith
- Portable Memory: passing flag cudaHostAllocPortable to cudaHostAlloc or cudaHostRegisterPortable to cudaHostRegister.
- Write-Combining Memory:
- Usage: passing flag *WriteCombined to cudaHostAlloc.
- Pros:
- free Host L1 and L2 cache.
- not snopped PCI Express bus. improve performance up to 40%.
- Mapped Memory:
- Usage: passing flag *Mapped to cudaHostAlloc / cudaHostRegister.
- Pros:
- no need to alloc and copy in device
- no need to use streams to overlap data transfer.
- Cons:
- doesn’t have same bandwith
- synchronization required to avoid RAW, WAR, WAW hazards.
- Asynchronous Concurrent Execution
- Concurrent execution between host and device
- Concurrent kernel execution: concurrentKernels property
- Overlap of data transfer and kernel execution: asyncEngineCount
- Concurret data transfer: asyncEngineCount
- Streams
- Definition:
- Sequence of commands execute in order.
- Different streams may execute out-of-order respect to each other.
- Similar to C++ multi-threading. But puts commands sequentially.
- Creation and Destruction: cudaStreamCreate / cudaStreamDestory.
- Default stream:
- –default-stream per-thread: regular stream. each host thread has its own stream.
- –default-stream legacy: NULL stream. each device has a single null stream to be used for all host threads.
- Explicit Synchronization:
- cudaDeviceSynchronize() wait until all streams of all host treads
- cudaStreamSynchroniza() wait until a single stream. synchronize the host with a specific stream.
- cudaStreamWaitEvent() wait for given event.
- cudaStreamQuery() if all preceding commands are completed.
- UNFINISHED
- Definition:
- Initialization
- Compilation with NVCC – NVCC is a compiler driver
Leave a comment