Reading Notes – 9

StreamRL

Reinforcement Learning Framework
- Colocated architecture: verL, ReaL, RLHFuse
  - Pros:
    - Avoid resource idleness
  - Cons:
    - Resource coupling
      - Generator time reaches plateau quicker than trainer with resource increases
      - Same resource quantities and hardware types
        
        Not suitable or cost-effective hardware type
        
        Not leverage cross-datacenter heterogeneous resource pool
    - CPU based context switch
- Disaggregated architecture: OpenRLHF, NeMo
  - Pros:
    - Flexible and efficient resource allocation
    - Suitable hardware
    - P2P data transfer between stages
  - Cons:
    - Pipeline bubbles <- Stage dependencies
      - Previous work
        
        Mini-batch pipelining
        
        Asynchronous pipelining
      - This work
        
        Stream generation
    - Skewness bubbles <- LongTail output length
      - Previous work
        
        Temporarily store partially generated samples in replay buffer
      - This work
        
        Skewness aware dispatching and scheduling
Reinforcement Learning for LLMs
- PPO, GRPO
- Generation. Training (including Rewards).
StreamRL Design
- Stream Generation Service (SGS) + Trainer
- Stream
  - SGS sends completed request to trainer in streaming fashion
    - Not after all samples are generated
  - Improve 1
    - Previous
      - Mini-batch pipelining. Batched at generator.
    - This
      - Dynamic-batch pipelining. Batched at trainer.
  - Improve 2
    - Previous
      - One-step asynchronous pipelining. Off policy.
      - Global synchronization to transmit weights
    - This
      - Fully asynchronous pipelining.
      - Weight transmission overlaps with training for the next iteration.
- Resource Allocation
  - Profiler based resource allocation
  - Dynamic resource adjustment
- Long Tail Issue
  - Output length ranker model
  - Skewness-aware scheduling
    - Run long tail samples with a smaller batch size
    - Use an online trained model to determine the relative ranks of samples
    - Longest-Processing-Time first scheduling

AReaL

Fully asynchronous RL system
- Interruptible rollout workers
- Dynamic batching for variable-length outputs
- Parallel reward service
Components
- Interruptible rollout workers
  - Interrupt. Load weights. Discard KV cache and recompute.
- Reward service
- Trainer workers
  - Contiguous sample from reply buffer. Use once.
- Rollout controller
  - Trajectory along reward stored in replay buffer.
Algorithmic Challenges
- Staleness-Aware Training
  - Rejects new generation requests that violates
- Decoupled PPO Objective
  - Behavior policy and proximal policy

YOCO

Transformer Architecture
- Encoder-only: BERT
- Encoder-decoder: T5
- Decoder-only: GPT
- Decoder-decoder: YOCO
YOCO
- Pros
  - Cache once. Low memory consumption.
  - Early exit of prefilling stage.
  - Efficient distributed long context training.
  - Gated retention.
- Architecture
  - Self-decoder: Efficient attention
    - Sliding Window Attention
    - Gated Retention
      - Parallel Representation: Most recompute
      - Recurrent Representation: Most serialization
      - Chunkwise Recurrent Representation: Balanced
  - Cross-decoder: Global attention
- KV Cache
  - Self-decoder uses the local KV cache produced by each individual self-decoder layer O(LD)
  - Cross-decoder uses the global KV cache produced by the output of self-decoder layer O(ND)
- Interleaved attention and retention delivers good model quality

levendlee

Reading Notes – 9

StreamRL

AReaL

YOCO

Leave a comment Cancel reply

Reading Notes – 9

StreamRL

AReaL

YOCO

Share this:

Leave a comment Cancel reply