CUDA Programming Note

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html

Introduction
1. GPGPU – Massive parallelism. Focus on data processing, not data caching or control flow. Hide memory access latencies through computation.
2. CUDA – C++ Extension. Three key abstractions: thread groups, shared memories, and barrier synchronizations.
Programming Model
1. Kernel – C++ functions. Executed N times in parallel by N CUDA threads.
  1. __global__ declaration specifier.
  2. <<<…>>> execution configuration.
  3. threadIdx build in variable.
2. Thread Hierarchy
  1. Grid of thread blocks of threads.
    1. Requirements:
      1. All threads of a block must run on the same processor core and share memory (L1) resources. A thread block may contain up to 1024 threads.
      2. Thread blocks are required to be executed independently.
    2. Hierarchy
      1. Threads are organized into a 1D/2D/3D block of threads.
      2. Blocks are organized into a 1D/2D/3D grid of thread blocks.
  2. Build-in variables
    1. threadIdx: 3-component vector. threadIdx.x/y/z (inner – outer).
    2. blockIdx: 3-component vector.
    3. blockDim: 3-component vector.
  3. Execute configuration
    1. <<</*dim3 object for blocks*/, /*dim3 object for threads*/>>>
  4. Synchronization
    1. __synchthreads() all threads within a thread block will wait before any is allowed to proceed. (lightweight)
3. Memory Hierarchy
  1. Local memory: private to each thread
  2. Shared memory: private to each thread block
  3. Global memory: public to all threads
  4. Constant & Texture memory: Read-only. Persistent. Public to all.
4. Heterogeneous Programming
  1. Compute: kernels on GPU; rest on CPU.
  2. Memory: separate.
    1. Host memory for CPU and device memory for GPU. Programmer manage memory (allocation, deallocation and transfer).
    2. Unified memory as managed memory. Shared. Coherrent.
5. Compute Capability
  1. Version number. X.Y -> Main and minor.
Programming Interface
1. Compilation with NVCC – NVCC is a compiler driver
  1. Compilation Workflow
    1. Offline Compilation – NVCC compiles the device code into binary code and/or PTX code. (NVRTC can dot it JIT). The host compiler compiles the host code into assembly.
    2. JIT Compilation – Devide driver compiles the PTX code into assembly. Compute cache caches the compiled code.
  2. Binary compatibility – `-code=` specify compatibility. X.z is compatible with X.y when z >= y.
  3. Architecture compatibility – `-arch=` specicify architecture.
  4. Application compatibility
    1. Dynamic dispatch.
    2. __CUDA_ARCH__ marco.
  5. C++ compatibility – Subset is supported.
  6. 64-Bit compatibility – Device and host code must be the same, 32-bit or 64-bit.
2. CUDA Runtime – implemented in `cudart` library
  1. Initialization
    1. Initializes the first time a runtime function is called. (Overhead in Profiling)
    2. Creates a CUDA context for each device, shared by all the host threads.
  2. Device Memory
    1. Allocation: Linear memory (C fashion continuous memory) / CUDA arrays (opaque)
    2. Management:
      1. 1D Arrays: cudaMalloc, cudaMemcpy, and cudaFree on host code.
      2. 2D / 3D Arrays: cudaMallocPitch / cudaMalloc3D; cudaMemcpy2D / cudeMemcpy3D. (Allocation properly padded to meet alignment request)
    3. Query: cudaGetSymbolAddress(); cudaGetSymbolSize().
  3. Shared Memory
    1. Allocated using __shared__ memory space specifier. Act like a scratchpad.
  4. Page-Locked Host Memory
    1. Usage: cudaHostAlloc() and cudaFreeHost() / cudaHostRegister()
    2. Benifites Concurrent copy; Avoid copy; High bandwith
    3. Portable Memory: passing flag cudaHostAllocPortable to cudaHostAlloc or cudaHostRegisterPortable to cudaHostRegister.
    4. Write-Combining Memory:
      1. Usage: passing flag *WriteCombined to cudaHostAlloc.
      2. Pros:
        
        free Host L1 and L2 cache.
        
        not snopped PCI Express bus. improve performance up to 40%.
    5. Mapped Memory:
      1. Usage: passing flag *Mapped to cudaHostAlloc / cudaHostRegister.
      2. Pros:
        
        no need to alloc and copy in device
        
        no need to use streams to overlap data transfer.
      3. Cons:
        
        doesn’t have same bandwith
        
        synchronization required to avoid RAW, WAR, WAW hazards.
  5. Asynchronous Concurrent Execution
    1. Concurrent execution between host and device
    2. Concurrent kernel execution: concurrentKernels property
    3. Overlap of data transfer and kernel execution: asyncEngineCount
    4. Concurret data transfer: asyncEngineCount
    5. Streams
      1. Definition:
        
        Sequence of commands execute in order.
        
        Different streams may execute out-of-order respect to each other.
        
        Similar to C++ multi-threading. But puts commands sequentially.
      2. Creation and Destruction: cudaStreamCreate / cudaStreamDestory.
      3. Default stream:
        
        –default-stream per-thread: regular stream. each host thread has its own stream.
        
        –default-stream legacy: NULL stream. each device has a single null stream to be used for all host threads.
      4. Explicit Synchronization:
        
        cudaDeviceSynchronize() wait until all streams of all host treads
        
        cudaStreamSynchroniza() wait until a single stream. synchronize the host with a specific stream.
        
        cudaStreamWaitEvent() wait for given event.
        
        cudaStreamQuery() if all preceding commands are completed.
      5. UNFINISHED

levendlee

CUDA Programming Note

Leave a comment Cancel reply

CUDA Programming Note

Share this:

Leave a comment Cancel reply