Reading Notes – 5

04/01/2024-04/07/2024

AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE

Vision Transformer

Transformer
- Tokens
  - Create patches of N * (HW/P^2) as tokens.
- Embedding
  - Prepend learnable embedding to the sequence of patches.
  - 1D positional embedding is added. 2D-aware positional embedding is not needed.
- Transformer Encoder
- MLP Head
  - Used for classification
Inductive bias
- Much less inductive bias. Spatial dimension learnt from scratch.
Hybrid architecture
- Use CNN as the input feature map to encode embeddings.
- Use as an alternative to patches.

Fine-tune and Higher resolution

Remove the pre-rained head and attach a new FFN.
Keep the same patch size and perform 2D interpolation of the pre-trained position embeddings according to location in image.

Experiments

Simple setting. Adam for pre-training. SGD + momentum for fine-tuning. No learning rate schedule. No weight decay.
CNN conductive bias shows advantage on small datasets but not large datasets.

High-Resolution Image Synthesis with Latent Diffusion Models

Background

Diffusion Models (DMs) are likelihood-based models, prone to spend excessive amounts of capacity on modeling.
Learning is divided into two stages.
- Perceptual compression. Removes high frequency details but learns little semantic variation.
- Semantic compression. Learn semantic and conceptual composition of data.

Generative Model for Image Synthesis

GAN (Generative Adversarial Networks): Allow for efficient sampling but are difficult to optimize and cannot capture full data distribution.
Likelihood-based methods emphasize density optimization and are more well-behaved.
- Variational autoencoders (VAE) & Flow-based models
  - Efficient synthesis of high resolution images but sample quality is bad
- Autoregressive models (ARM)
  - Computation demand and sequential processes limit them to low resolution.
  - To resolve the limitation, people use 2-stage approaches to model a compressed latent image.
- Diffusion Probabilistic Models (DM)
  - UNet architecture to fit inductive bias. But still have computational cost on high resolution images.
Two Stage Image Synthesis
- Balancing quality and compute on compression rate. LDM scales more gracefully with convolution instead of transformer backbone.

Method

Perceptual Image Compression
- Explicit encoder + decoder structure.
  - Encoders encode images into latent space.
  - Decoders decode latent space along with supplied conditions.
    - UNet + CrossAttention (K,V as conditions)

Evaluation

Small downsample factors result in slow training. However, large downsample factors in the early stage of the model result in information loss.

Scalable Diffusion Models with Transformers

Diffusion Transformers

Preliminaries

Diffusion Formula
- Gussian diffusion models assume a forward noising process and the model is trained to learn the reverse of the process.
Classifier-free Guidance
- Conditional diffusion models takes extra information, such as class label c. Randomly dropping label as replacewith a learned null embedding to achieve classifier-free guidance.
Latent diffusions models
- Auto-encoder that compresses models into smaller spatial representations.
- Train a diffusion model of representation.
- Decode the new image from the representation.

Design Space

Patchify
- I*I*C image is patched as (I/p) * (I/p) tiles/tokens.
DiT Block
- Additional tokens: time stamp, class label, natural language, etc.
- In-context conditioning.
  - Simply appending additional tokens to noise image tokens.
  - Negligible compute added.
- Cross-attention.
  - Add one additional multi-head cross-attention layer with KV as additional inputs.
  - Adds 15% compute overhead.
- Adaptive layer norm. adaLN
  - Replace gamma and beta with sum of embedding vectors of t and c. Most compute efficient
- adaLN-Zero block:
  - Initialize each residual block as the identity function is beneficial.
  - Besides regress gamma and beta, also regress dimension-wise scaling alpha applied immediately before residual connections.
  - Initialize MLP to apply zero-vector for all alpha, having DiT block as an identity matrix.
Model size
- Different sizes with jointly scaling N, d and attention heads.
Transformer decoder
- Standard linear decoder. Rearrange decoded tokens to original spatial layout to get predicted noise and covariance.

Evaluation

Training
- Default AdamW. No learning rate scheduling or weight decay.
Diffusion
- Pre-trained VAE model.
Evaluation metrics
- FID (Frechet Inception Distance)
Results
- adaLN-Zero shows lowest FID but most compute efficient
- Scaling up parameters (larger model size) or tokens (smaller patch size) can improve model quality.
- DiT Gflops are critical to improving performance
- Larger DiT models are more compute efficient.

eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

Innovations

Combine text and image encoders as ensemble-of-expert-denoisers to handle different stages of diffusion.
- Text encoder for earlier stage. (T5 & CLIP)
- Image encoder for later stage. (CLIP)

Diffusion models

Train on low resolution images or latent variables. Then uses super-resolution diffusion models or latent-to-image models.
Train to recover corrupted models with gaussian noise added.

Training

Expert Denoisers

Three hard-coded experts at low noise level, intermediate noise level, and high noise level

Conditional Inputs

Three conditional embedding encoders. Process embeddings offline. Random dropout embeddings.
Use U-Net architecture for base models.

Paint With Words

User doodled canvas encoded as mask. Cross attention matrix (QK^T) is shifted b a mask matrix wA. In the cross attention, Q is query embeddings from image tokens, K and V are key and value embeddings from text tokens.

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Innovations

New noise samplers for rectified flow modes
Text-to-image bi-directional mixing text and image token streams

Simulation-Free Training of Flow

Regress the high-dimensional function from noise distribution to data distribution as a vector field that generates a probability distribution.
- Use a linear combination as a forward process.
- Derive an simple objective function based on conditional flow machine

Flow Trajectories

Rectified Flow
- The forward pass with linear coefficients of both the working sample and data sample.
EDM
- The forward pass with constant coefficient of the data sample but a hyper-function based coefficient of the working sample (exponential of quantile function of normal distribution).
Cosine
- The forward pass with cosine coefficient on data sample and sine coefficient on working sample.
LDM-Linear
- ???

Tailored SNR Samplers for RF models

More weights on intermediate timestamps by sampling them more frequently.
Logit-Normal Sampling
- Location parameters help to bias the training timestamps.
- Cons: Vanish at endpoints at 0 and 1.
Mode Sampling with Heavy Tails
- Use scale parameters to control the degree to which midpoints or endpoints are favored.
- Includes a uniform weighting.
CosMap

Text-to-Image Architecture

CLIP-G image encoder
CLIP-L and T5 XXL text encoder.
MM-DiT block with two individual transformer blocks shared with a common self-attention.

Evaluations

RF loss with Logit-Normal sampling performs best with proper settings.

Improvements

Encoder
- Latent diffusion models rely on latent space to reconstruct quality.
- Increased latent channels can improve performance.
Captions
- Use 50% synthetically generated captions through image-to-text model can provide more details.
Architecture
- Outperforms: DiT (Shared transformer parameters for different modalities); CrossDiT (Cross-attentionv variant of DiT); UViT (Combination of UNet and Transformer)

Training

QK-Normalization to help reduce errors
Position Encodings

levendlee

Reading Notes – 5

AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE

Vision Transformer

Fine-tune and Higher resolution

Experiments

High-Resolution Image Synthesis with Latent Diffusion Models

Background

Generative Model for Image Synthesis

Method

Evaluation

Scalable Diffusion Models with Transformers

Diffusion Transformers

Preliminaries

Design Space

Evaluation

eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

Innovations

Diffusion models

Training

Expert Denoisers

Conditional Inputs

Paint With Words

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Innovations

Simulation-Free Training of Flow

Flow Trajectories

Tailored SNR Samplers for RF models

Text-to-Image Architecture

Evaluations

Improvements

Training

Leave a comment Cancel reply

Vision Transformer

Fine-tune and Higher resolution

Experiments

Background

Generative Model for Image Synthesis

Method

Evaluation

Diffusion Transformers

Preliminaries

Design Space

Evaluation

Innovations

Diffusion models

Training

Expert Denoisers

Conditional Inputs

Paint With Words

Innovations

Simulation-Free Training of Flow

Flow Trajectories

Tailored SNR Samplers for RF models

Text-to-Image Architecture

Evaluations

Improvements

Training

Share this:

Leave a comment Cancel reply