04/01/2024-04/07/2024
AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE
Vision Transformer
- Transformer
- Tokens
- Create patches of N * (HW/P^2) as tokens.
- Embedding
- Prepend learnable embedding to the sequence of patches.
- 1D positional embedding is added. 2D-aware positional embedding is not needed.
- Transformer Encoder
- MLP Head
- Used for classification
- Tokens
- Inductive bias
- Much less inductive bias. Spatial dimension learnt from scratch.
- Hybrid architecture
- Use CNN as the input feature map to encode embeddings.
- Use as an alternative to patches.
Fine-tune and Higher resolution
- Remove the pre-rained head and attach a new FFN.
- Keep the same patch size and perform 2D interpolation of the pre-trained position embeddings according to location in image.
Experiments
- Simple setting. Adam for pre-training. SGD + momentum for fine-tuning. No learning rate schedule. No weight decay.
- CNN conductive bias shows advantage on small datasets but not large datasets.
High-Resolution Image Synthesis with Latent Diffusion Models
Background
- Diffusion Models (DMs) are likelihood-based models, prone to spend excessive amounts of capacity on modeling.
- Learning is divided into two stages.
- Perceptual compression. Removes high frequency details but learns little semantic variation.
- Semantic compression. Learn semantic and conceptual composition of data.
Generative Model for Image Synthesis
- GAN (Generative Adversarial Networks): Allow for efficient sampling but are difficult to optimize and cannot capture full data distribution.
- Likelihood-based methods emphasize density optimization and are more well-behaved.
- Variational autoencoders (VAE) & Flow-based models
- Efficient synthesis of high resolution images but sample quality is bad
- Autoregressive models (ARM)
- Computation demand and sequential processes limit them to low resolution.
- To resolve the limitation, people use 2-stage approaches to model a compressed latent image.
- Diffusion Probabilistic Models (DM)
- UNet architecture to fit inductive bias. But still have computational cost on high resolution images.
- Variational autoencoders (VAE) & Flow-based models
- Two Stage Image Synthesis
- Balancing quality and compute on compression rate. LDM scales more gracefully with convolution instead of transformer backbone.
Method
- Perceptual Image Compression
- Explicit encoder + decoder structure.
- Encoders encode images into latent space.
- Decoders decode latent space along with supplied conditions.
- UNet + CrossAttention (K,V as conditions)
- Explicit encoder + decoder structure.
Evaluation
- Small downsample factors result in slow training. However, large downsample factors in the early stage of the model result in information loss.
Scalable Diffusion Models with Transformers
Diffusion Transformers
Preliminaries
- Diffusion Formula
- Gussian diffusion models assume a forward noising process and the model is trained to learn the reverse of the process.
- Classifier-free Guidance
- Conditional diffusion models takes extra information, such as class label c. Randomly dropping label as replacewith a learned null embedding to achieve classifier-free guidance.
- Latent diffusions models
- Auto-encoder that compresses models into smaller spatial representations.
- Train a diffusion model of representation.
- Decode the new image from the representation.
Design Space
- Patchify
- I*I*C image is patched as (I/p) * (I/p) tiles/tokens.
- DiT Block
- Additional tokens: time stamp, class label, natural language, etc.
- In-context conditioning.
- Simply appending additional tokens to noise image tokens.
- Negligible compute added.
- Cross-attention.
- Add one additional multi-head cross-attention layer with KV as additional inputs.
- Adds 15% compute overhead.
- Adaptive layer norm. adaLN
- Replace gamma and beta with sum of embedding vectors of t and c. Most compute efficient
- adaLN-Zero block:
- Initialize each residual block as the identity function is beneficial.
- Besides regress gamma and beta, also regress dimension-wise scaling alpha applied immediately before residual connections.
- Initialize MLP to apply zero-vector for all alpha, having DiT block as an identity matrix.
- Model size
- Different sizes with jointly scaling N, d and attention heads.
- Transformer decoder
- Standard linear decoder. Rearrange decoded tokens to original spatial layout to get predicted noise and covariance.
Evaluation
- Training
- Default AdamW. No learning rate scheduling or weight decay.
- Diffusion
- Pre-trained VAE model.
- Evaluation metrics
- FID (Frechet Inception Distance)
- Results
- adaLN-Zero shows lowest FID but most compute efficient
- Scaling up parameters (larger model size) or tokens (smaller patch size) can improve model quality.
- DiT Gflops are critical to improving performance
- Larger DiT models are more compute efficient.
eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers
Innovations
- Combine text and image encoders as ensemble-of-expert-denoisers to handle different stages of diffusion.
- Text encoder for earlier stage. (T5 & CLIP)
- Image encoder for later stage. (CLIP)
Diffusion models
- Train on low resolution images or latent variables. Then uses super-resolution diffusion models or latent-to-image models.
- Train to recover corrupted models with gaussian noise added.
Training
Expert Denoisers
- Three hard-coded experts at low noise level, intermediate noise level, and high noise level
Conditional Inputs
- Three conditional embedding encoders. Process embeddings offline. Random dropout embeddings.
- Use U-Net architecture for base models.
Paint With Words
- User doodled canvas encoded as mask. Cross attention matrix (QK^T) is shifted b a mask matrix wA. In the cross attention, Q is query embeddings from image tokens, K and V are key and value embeddings from text tokens.
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
Innovations
- New noise samplers for rectified flow modes
- Text-to-image bi-directional mixing text and image token streams
Simulation-Free Training of Flow
- Regress the high-dimensional function from noise distribution to data distribution as a vector field that generates a probability distribution.
- Use a linear combination as a forward process.
- Derive an simple objective function based on conditional flow machine
Flow Trajectories
- Rectified Flow
- The forward pass with linear coefficients of both the working sample and data sample.
- EDM
- The forward pass with constant coefficient of the data sample but a hyper-function based coefficient of the working sample (exponential of quantile function of normal distribution).
- Cosine
- The forward pass with cosine coefficient on data sample and sine coefficient on working sample.
- LDM-Linear
- ???
Tailored SNR Samplers for RF models
- More weights on intermediate timestamps by sampling them more frequently.
- Logit-Normal Sampling
- Location parameters help to bias the training timestamps.
- Cons: Vanish at endpoints at 0 and 1.
- Mode Sampling with Heavy Tails
- Use scale parameters to control the degree to which midpoints or endpoints are favored.
- Includes a uniform weighting.
- CosMap
Text-to-Image Architecture
- CLIP-G image encoder
- CLIP-L and T5 XXL text encoder.
- MM-DiT block with two individual transformer blocks shared with a common self-attention.
Evaluations
- RF loss with Logit-Normal sampling performs best with proper settings.
Improvements
- Encoder
- Latent diffusion models rely on latent space to reconstruct quality.
- Increased latent channels can improve performance.
- Captions
- Use 50% synthetically generated captions through image-to-text model can provide more details.
- Architecture
- Outperforms: DiT (Shared transformer parameters for different modalities); CrossDiT (Cross-attentionv variant of DiT); UViT (Combination of UNet and Transformer)
Training
- QK-Normalization to help reduce errors
- Position Encodings
Leave a comment