Stay humble. Stay hungry. Stay foolish.

Reading Notes – 6

Written in

by

04/08/2024-04/21/2024

Longformer: The Long-Document Transformer

Sliding Window Attention

  • Reduce quadratic self-attention overhead to linear

Architecture

  • Attention Types
    • A windowed local-context self-attention.
    • A global attention motivated by an end task that encodes inductive bias about the task.
  • Sliding Window Attention Setting
    • Window size:
      • Use small window sizes for the lower layers.
      • Increase window sizes when moving to higher layers.
    • Dilation size:
      • Not use dilated sliding windows for lower layers to maximize their capacity to learn and utilize the immediate local context. 
      • Use a small amount of increasing dilation only on 2 heads for higher layers.
  • Similar Work
    • Sparse Transformer
      • Dilated sliding window
    • BPTransformer
    • Blockwise attention

Evaluation

  • Compare against other models with limited sequence length though the sequence splitting trick.

Longformer-Encoder-Decoder 

  • Replace self-attention with window/global attention
  • Initialize with BART weights.
  • Deliver good performance even without further fine-tuning.

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Shifted Window Attention

  • Fixed window with shifts between adjacent layers.
    • Cycling shift for the shifted window for compute efficiency
  • Learnt relative position bias helps better than absolute position encoding

Architecture

  • A combination of
    • Patch embedding (patchify + linear)
    • Linear embedding
    • Swin Transformer Blocks (w/ window attention and shifted window attention)

Learning Transferable Visual Models From Natural Language Supervision

Background

  • NLP succeeded in pre-training from raw text, surpassing high quality crowd labeled datasets.
  • CV transferred to transformer architecture.
    • Some models use raw text as natural language supervision for image representation learning. However, static softmax classifiers and limited image classes have limited power.
    • Scale matters.

Approach

  • Natural language supervision
    • Enable large dataset + zero-short.
  • Creating a large dataset
    • 400M image + text pairs
    • 500K query * 20K pairs per query.
  • Efficient Pre-Training
    • Contrastive instead of generative.
    • Predict which text as a whole is paired with which image. Not exact words.
  • Choosing and Scaling a Model
    • The goal is to find select the matching text prompt from the 
    • Text encoder
      • Standard Transformer
      • Sequence length capped at 76
    • Vision encoder
      • ResNet-50 or ViT
    • Having both embeddings normalized and run pairwise cosine similarities
    • Interpret image coder as computer vision backbone; Text encoder as hypernetwork to generate weights of a linear classifier.

Training

  • Train for 32 epochs.
  • Cosine learning rate schedule.

Zero-Shot Transfer

  • Prompt engineering
    • Polysemy: Same word but different meanings. Provide context.
    • Single word: Use template. “A proto of a {label}”.

Patch n’ Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution

Observation

  • Dealing with different resolution and aspect ratios in batched training.
    • Resizing hurt model quality.
    • Padding hurts training efficiency.
  • Pack multiple images into one sequence. It enables variable resolution while preserving the aspect ratio, inspired by example packing in natural language processing.

Architecture

  • Masked self-attention and masked pooling.
    • Support multiple images to be stitched together in training.
  • Factorized & fractional positional embeddings.
    • Have the position embedding support for various resolution (image size) and aspect ratios.
      • Absolute embedding: Good to sense different aspect ratios.
      • Relative embedding: Good to ignore different resolutions.
  • Continuous token dropping
    • Random omission of input patches during training. 
    • Dropping rate can vary across different images.
  • Resolution sampling
    • Mixture of different resolutions to improve performance
  • Padding examples and the contrastive loss
    • Use per-example loss instead of per-token loss as CV task.
    • Use the chunked contrastive loss

Training

  • Classification training and contrastive language image training

Tags

Leave a comment