Efficient Compute for ML

Based on my observation, data and compute are the biggest challenges of machine learning engineering.

Data determines the ultimate quality of the model in theory.
Compute determines the ultimate quality of the model in practice.

The recent success of large language models heavily relies on using massively large amount of compute on massively large amount of data to achieve the best model quality, aside from all the modeling researches and advancements.

For compute perspective, the challenges and popular solutions can be decomposed at different levels.

Reduce computation workload
- Reduce parameters to be trained
  - Selective: Tune part of the layers. Keep others frozen.
  - Additive: Tune added layers. Keep original frozen.
  - LoRA: Low rank matrix adoption to be added on weights for fine-tuning.
  - Prompt tuning: Fine-turning a soft-prompt, learning the embedding vector.
- Reduce parameters size
  - Keep model architecture
    - Pruning: Skip compute on zero values with structured/unstructured sparsity.
  - Change model architecture
    - Knowledge Distillation: Distill knowledge from large teacher model to small student model.
    - Natural architecture search: Search model architecture with best quality under compute constraints.
- Reduce parameter precision
  - Mixed precision: Using reduced precision (FP16/BF16/FP8/…) floating-point representation to accelerate compute/io with dedicated tensor cores.
  - Quantization: Using integer representation (INT8/INT4/INT2/…) with affine transformation to accelerate compute/io with dedicated tensor cores.
Improve computation efficiency
- Communication efficiency
  - Inter-datacenter communication: Going through network similar to other web application backend development.
  - Intra-datacenter, inter-node communication: Hardware solutions provided by vendors. For example, InifiniBand is commonly used for inter node communication providing up to 400GB/s bandwith.
  - Intra-node communication: Hardware solutions provided by vendors. For example, Nvidia provides NVLink + NVSwitch to fulfill fast communication across multiple GPUs inside the same node.
  - Host-device communication: Software solutions provided by frameworks. For example, Google provides asynchronous device-host communication in XLA/TPU runtime to overlap with computation.
  - Intra-device communication: Mostly handled by compilers or kernel engineers. The communication goes through memory at different levels or goes through registers.
- Compute & Memory Efficiency
  - Reduce Online Computation
    - Graph optimizations: Usually completed in compilers or runtimes. For example, constant folding and dead code elimination.
  - Improve parallelism
    - Model level parallelism: Implements parallelism at model level. Not only involves computation parallelism, but also involves memory efficiency and intra-node/inter-node communication.
      - Data parallelism (zero data parallelism).
      - Model parallelism (tensor parallelism, pipeline parallelism).
      - FlexFlow.
    - Software level parallelism: Implements parallelism at software level. Not only involves computation parallelism, but also involves memory efficiency.
      - Programming language: CUDA. Triton. OpenCL. C++ threads.
      - Template expressions (Based on PLs): CUTLASS. Blaze. Eigen.
      - Precompiled libraries (Based on PLs): CuDNN. CuBLAS. CuSparse. MKL.
      - Compilers (Horizontal fusions. Tree reductions. etc): XLA. TensorRT. Triton.
    - Hardware level parallelism: Executes parallelism defined at software.
      - GPU Example:
        
        Thread Blocks: Parallel/Sequential across different SMs.
        
        Warps: Sequential on the same SM.
        
        Threads: Parallel across the same warp.
        
        SIMD: Parallel across the same thread.
        
        Instruction: Intrinsics utilizing tensor cores on xPU improves compute efficiency.
  - Improve memory efficiency
    - Model level memory efficiency:
      - Reduce parameters size and Reduce parameter precision techniques applies here. As it reduced the total amount of data to be transferred.
      - Improve communication efficiency techniques applies here. As it reduces the latency of data transfers.
    - Software level memory efficiency:
      - Vertical fusion techniques applies here. As it eliminates the need to transfer data by keeping it on or close to registers. Vertical fusion is mostly done by keeping intermediates on caches.
        
        Model level parallelism section defined pipeline parallelism and part of collective ops (e.g. reduce_scatter) are essentially doing vertical fusions.
        
        Software level parallelism section defined techniques can do fusions.
        
        Programming language: Developer explicitly defined manual fusion.
        
        Template expressions (Based on PLs): Developer implicitly defined manual fusion with the help of C++ templates.
        
        Precompiled libraries (Based on PLs): Pre-defined and implemented fusions.
        
        Compilers (Vertical fusion): Automatic fusions through pattern matching.
      - Compute/Memory overlap techniques applies here. As it hides the latency of data transfer by overlapping it with compute. It can be applied at multiple levels through different ways. Examples are:
        
        Asynchronous I/O (d2d, h2d, d2h) with the help of compiler/runtime scheduling.
        
        Software pipeline (SW Prefetching) with the help of compiler optimization or developer manual optimization.
        
        HW Prefetching using instructions/modifiers provided the underlying hardware. It can also be done automatically at hardware level.
    - Hardware level memory efficiency:
      - Caching avoids reloading data from memory. Improves efficiency. It is about transparent caching that isn’t managed by developers. (AKA, not including shared memory, but includes L1,L2 cache).
      - Hardware specific optimization tricks can make read/write faster. Examples are:
        
        Memory coalescing: Within each wrap, having multiple threads accessing continuous global memory at the same time, and coalesce the memory request.
        
        Memory swizzling and Bank conflicts, Within each wrap, avoid having multiple threads accessing different shard memory bank at the same time trough swizzling.

levendlee

Efficient Compute for ML

Leave a comment Cancel reply

Efficient Compute for ML

Share this:

Leave a comment Cancel reply