Stay humble. Stay hungry. Stay foolish.

Reading Notes – 9

Written in

by

StreamRL

  • Reinforcement Learning Framework
    • Colocated architecture: verL, ReaL, RLHFuse
      • Pros:
        • Avoid resource idleness
      • Cons:
        • Resource coupling
          • Generator time reaches plateau quicker than trainer with resource increases
          • Same resource quantities and hardware types
            • Not suitable or cost-effective hardware type
            • Not leverage cross-datacenter heterogeneous resource pool
        • CPU based context switch
    • Disaggregated architecture: OpenRLHF, NeMo
      • Pros:
        • Flexible and efficient resource allocation
        • Suitable hardware
        • P2P data transfer between stages
      • Cons:
        • Pipeline bubbles <- Stage dependencies
          • Previous work
            • Mini-batch pipelining
            • Asynchronous pipelining
          • This work
            • Stream generation
        • Skewness bubbles <- LongTail output length
          • Previous work
            • Temporarily store partially generated samples in replay buffer
          • This work
            • Skewness aware dispatching and scheduling
  • Reinforcement Learning for LLMs
    • PPO, GRPO
    • Generation. Training (including Rewards). 
  • StreamRL Design
    • Stream Generation Service (SGS) + Trainer
    • Stream
      • SGS sends completed request to trainer in streaming fashion
        • Not after all samples are generated
      • Improve 1
        • Previous
          • Mini-batch pipelining. Batched at generator.
        • This
          • Dynamic-batch pipelining. Batched at trainer.
      • Improve 2
        • Previous
          • One-step asynchronous pipelining. Off policy.
          • Global synchronization to transmit weights
        • This
          • Fully asynchronous pipelining.
          • Weight transmission overlaps with training for the next iteration.
    • Resource Allocation
      • Profiler based resource allocation
      • Dynamic resource adjustment
    • Long Tail Issue
      • Output length ranker model
      • Skewness-aware scheduling
        • Run long tail samples with a smaller batch size
        • Use an online trained model to determine the relative ranks of samples
        • Longest-Processing-Time first scheduling

AReaL

  • Fully asynchronous RL system
    • Interruptible rollout workers
    • Dynamic batching for variable-length outputs
    • Parallel reward service                                                                                                                                                                                                                                                                                                        
  • Components
    • Interruptible rollout workers
      • Interrupt. Load weights. Discard KV cache and recompute.
    • Reward service
    • Trainer workers
      • Contiguous sample from reply buffer. Use once.
    • Rollout controller
      • Trajectory along reward stored in replay buffer.
  • Algorithmic Challenges
    • Staleness-Aware Training
      • Rejects new generation requests that violates
    • Decoupled PPO Objective
      • Behavior policy and proximal policy

YOCO

  • Transformer Architecture
    • Encoder-only: BERT
    • Encoder-decoder: T5
    • Decoder-only: GPT
    • Decoder-decoder: YOCO
  • YOCO
    • Pros
      • Cache once. Low memory consumption.
      • Early exit of prefilling stage.
      • Efficient distributed long context training.
      • Gated retention.
    • Architecture
      • Self-decoder: Efficient attention
        • Sliding Window Attention
        • Gated Retention
          • Parallel Representation: Most recompute
          • Recurrent Representation: Most serialization
          • Chunkwise Recurrent Representation: Balanced
      • Cross-decoder: Global attention
    • KV Cache
      • Self-decoder uses the local KV cache produced by each individual self-decoder layer O(LD)
      • Cross-decoder uses the global KV cache produced by the output of self-decoder layer O(ND)
    • Interleaved attention and retention delivers good model quality

Tags

Leave a comment