Stay humble. Stay hungry. Stay foolish.

Reading Notes – 5

Written in

by

04/01/2024-04/07/2024

AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE

Vision Transformer

  • Transformer
    • Tokens
      • Create patches of N * (HW/P^2) as tokens.
    • Embedding
      • Prepend learnable embedding to the sequence of patches.
      • 1D positional embedding is added. 2D-aware positional embedding is not needed.
    • Transformer Encoder
    • MLP Head
      • Used for classification
  • Inductive bias
    • Much less inductive bias. Spatial dimension learnt from scratch.
  • Hybrid architecture
    • Use CNN as the input feature map to encode embeddings. 
    • Use as an alternative to patches.

Fine-tune and Higher resolution

  • Remove the pre-rained head and attach a new FFN.
  • Keep the same patch size and perform 2D interpolation of the pre-trained position embeddings according to location in image.

Experiments

  • Simple setting. Adam for pre-training. SGD + momentum for fine-tuning. No learning rate schedule. No weight decay.
  • CNN conductive bias shows advantage on small datasets but not large datasets.

High-Resolution Image Synthesis with Latent Diffusion Models

Background

  • Diffusion Models (DMs) are likelihood-based models, prone to spend excessive amounts of capacity on modeling.
  • Learning is divided into two stages.
    • Perceptual compression. Removes high frequency details but learns little semantic variation.
    • Semantic compression. Learn semantic and conceptual composition of data.

Generative Model for Image Synthesis

  • GAN (Generative Adversarial Networks): Allow for efficient sampling but are difficult to optimize and cannot capture full data distribution.
  • Likelihood-based methods emphasize density optimization and are more well-behaved.
    • Variational autoencoders (VAE) & Flow-based models
      • Efficient synthesis of high resolution images but sample quality is bad
    • Autoregressive models (ARM)
      • Computation demand and sequential processes limit them to low resolution.
      • To resolve the limitation, people use 2-stage approaches to model a compressed latent image.
    • Diffusion Probabilistic Models (DM)
      • UNet architecture to fit inductive bias. But still have computational cost on high resolution images.
  • Two Stage Image Synthesis
    • Balancing quality and compute on compression rate. LDM scales more gracefully with convolution instead of transformer backbone.

Method

  • Perceptual Image Compression
    • Explicit encoder + decoder structure.
      • Encoders encode images into latent space. 
      • Decoders decode latent space along with supplied conditions.
        • UNet + CrossAttention (K,V as conditions)

Evaluation

  • Small downsample factors result in slow training. However, large downsample factors in the early stage of the model result in information loss. 

Scalable Diffusion Models with Transformers

Diffusion Transformers

Preliminaries

  • Diffusion Formula
    • Gussian diffusion models assume a forward noising process and the model is trained to learn the reverse of the process.
  • Classifier-free Guidance
    • Conditional diffusion models takes extra information, such as class label c. Randomly dropping label as replacewith a learned null embedding to achieve classifier-free guidance.
  • Latent diffusions models
    • Auto-encoder that compresses models into smaller spatial representations.
    • Train a diffusion model of representation. 
    • Decode the new image from the representation.

Design Space

  • Patchify
    • I*I*C image is patched as  (I/p) * (I/p)  tiles/tokens.
  • DiT Block
    • Additional tokens: time stamp, class label, natural language, etc.
    • In-context conditioning.
      • Simply appending additional tokens to noise image tokens.
      • Negligible compute added.
    • Cross-attention.
      • Add one additional multi-head cross-attention layer with KV as additional inputs.
      • Adds 15% compute overhead.
    • Adaptive layer norm. adaLN
      • Replace gamma and beta with sum of embedding vectors of t and c. Most compute efficient
    • adaLN-Zero block:
      • Initialize each residual block as the identity function is beneficial.
      • Besides regress gamma and beta, also regress dimension-wise scaling alpha applied immediately before residual connections.
      • Initialize MLP to apply zero-vector for all alpha, having DiT block as an identity matrix.
  • Model size
    • Different sizes with jointly scaling N, d and attention heads.
  • Transformer decoder
    • Standard linear decoder. Rearrange decoded tokens to original spatial layout to get predicted noise and covariance.

Evaluation

  • Training
    • Default AdamW. No learning rate scheduling or weight decay.
  • Diffusion
    • Pre-trained VAE model.
  • Evaluation metrics
    • FID (Frechet Inception Distance)
  • Results
    • adaLN-Zero shows lowest FID but most compute efficient
    • Scaling up parameters (larger model size) or tokens (smaller patch size) can improve model quality.
    • DiT Gflops are critical to improving performance
    • Larger DiT models are more compute efficient.

eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

Innovations

  • Combine text and image encoders as ensemble-of-expert-denoisers to handle different stages of diffusion.
    • Text encoder for earlier stage. (T5 & CLIP)
    • Image encoder for later stage. (CLIP)

Diffusion models

  • Train on low resolution images or latent variables. Then uses super-resolution diffusion models or latent-to-image models.
  • Train to recover corrupted models with gaussian noise added.

Training

Expert Denoisers

  • Three hard-coded experts at low noise level, intermediate noise level, and high noise level

Conditional Inputs

  • Three conditional embedding encoders. Process embeddings offline. Random dropout embeddings.
  • Use U-Net architecture for base models.

Paint With Words

  • User doodled canvas encoded as mask. Cross attention matrix (QK^T) is shifted b a mask matrix wA. In the cross attention, Q is query embeddings from image tokens, K and V are key and value embeddings from text tokens.

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Innovations

  • New noise samplers for rectified flow modes
  • Text-to-image bi-directional mixing text and image token streams

Simulation-Free Training of Flow

  • Regress the high-dimensional function from noise distribution to data distribution as a vector field that generates a probability distribution.
    • Use a linear combination as a forward process.
    • Derive an simple objective function based on conditional flow machine

Flow Trajectories

  • Rectified Flow
    • The forward pass with linear coefficients of both the working sample and data sample.
  • EDM
    • The forward pass with constant coefficient of the data sample but a hyper-function based coefficient of the working sample (exponential of quantile function of normal distribution).
  • Cosine
    • The forward pass with cosine coefficient on data sample and sine coefficient on working sample.
  • LDM-Linear
    • ???

Tailored SNR Samplers for RF models

  • More weights on intermediate timestamps by sampling them more frequently.
  • Logit-Normal Sampling
    • Location parameters help to bias the training timestamps.
    • Cons: Vanish at endpoints at 0 and 1.
  • Mode Sampling with Heavy Tails
    • Use scale parameters to control the degree to which midpoints or endpoints are favored.
    • Includes a uniform weighting.
  • CosMap

Text-to-Image Architecture

  • CLIP-G image encoder
  • CLIP-L and T5 XXL text encoder.
  • MM-DiT block with two individual transformer blocks shared with a common self-attention.

Evaluations

  • RF loss with Logit-Normal sampling performs best with proper settings.

Improvements

  • Encoder
    • Latent diffusion models rely on latent space to reconstruct quality. 
    • Increased latent channels can improve performance.
  • Captions
    • Use 50% synthetically generated captions through image-to-text model can provide more details.
  • Architecture
    • Outperforms: DiT (Shared transformer parameters for different modalities); CrossDiT (Cross-attentionv variant of DiT); UViT (Combination of UNet and Transformer)

Training

  • QK-Normalization to help reduce errors
  • Position Encodings

Tags

Leave a comment