StreamRL
- Reinforcement Learning Framework
- Colocated architecture: verL, ReaL, RLHFuse
- Pros:
- Avoid resource idleness
- Cons:
- Resource coupling
- Generator time reaches plateau quicker than trainer with resource increases
- Same resource quantities and hardware types
- Not suitable or cost-effective hardware type
- Not leverage cross-datacenter heterogeneous resource pool
- CPU based context switch
- Resource coupling
- Pros:
- Disaggregated architecture: OpenRLHF, NeMo
- Pros:
- Flexible and efficient resource allocation
- Suitable hardware
- P2P data transfer between stages
- Cons:
- Pipeline bubbles <- Stage dependencies
- Previous work
- Mini-batch pipelining
- Asynchronous pipelining
- This work
- Stream generation
- Previous work
- Skewness bubbles <- LongTail output length
- Previous work
- Temporarily store partially generated samples in replay buffer
- This work
- Skewness aware dispatching and scheduling
- Previous work
- Pipeline bubbles <- Stage dependencies
- Pros:
- Colocated architecture: verL, ReaL, RLHFuse
- Reinforcement Learning for LLMs
- PPO, GRPO
- Generation. Training (including Rewards).
- StreamRL Design
- Stream Generation Service (SGS) + Trainer
- Stream
- SGS sends completed request to trainer in streaming fashion
- Not after all samples are generated
- Improve 1
- Previous
- Mini-batch pipelining. Batched at generator.
- This
- Dynamic-batch pipelining. Batched at trainer.
- Previous
- Improve 2
- Previous
- One-step asynchronous pipelining. Off policy.
- Global synchronization to transmit weights
- This
- Fully asynchronous pipelining.
- Weight transmission overlaps with training for the next iteration.
- Previous
- SGS sends completed request to trainer in streaming fashion
- Resource Allocation
- Profiler based resource allocation
- Dynamic resource adjustment
- Long Tail Issue
- Output length ranker model
- Skewness-aware scheduling
- Run long tail samples with a smaller batch size
- Use an online trained model to determine the relative ranks of samples
- Longest-Processing-Time first scheduling
AReaL
- Fully asynchronous RL system
- Interruptible rollout workers
- Dynamic batching for variable-length outputs
- Parallel reward service
- Components
- Interruptible rollout workers
- Interrupt. Load weights. Discard KV cache and recompute.
- Reward service
- Trainer workers
- Contiguous sample from reply buffer. Use once.
- Rollout controller
- Trajectory along reward stored in replay buffer.
- Interruptible rollout workers
- Algorithmic Challenges
- Staleness-Aware Training
- Rejects new generation requests that violates
- Decoupled PPO Objective
- Behavior policy and proximal policy
- Staleness-Aware Training
YOCO
- Transformer Architecture
- Encoder-only: BERT
- Encoder-decoder: T5
- Decoder-only: GPT
- Decoder-decoder: YOCO
- YOCO
- Pros
- Cache once. Low memory consumption.
- Early exit of prefilling stage.
- Efficient distributed long context training.
- Gated retention.
- Architecture
- Self-decoder: Efficient attention
- Sliding Window Attention
- Gated Retention
- Parallel Representation: Most recompute
- Recurrent Representation: Most serialization
- Chunkwise Recurrent Representation: Balanced
- Cross-decoder: Global attention
- Self-decoder: Efficient attention
- KV Cache
- Self-decoder uses the local KV cache produced by each individual self-decoder layer O(LD)
- Cross-decoder uses the global KV cache produced by the output of self-decoder layer O(ND)
- Interleaved attention and retention delivers good model quality
- Pros
Leave a comment