Reading Notes – 6

04/08/2024-04/21/2024

Longformer: The Long-Document Transformer

Sliding Window Attention

Reduce quadratic self-attention overhead to linear

Architecture

Attention Types
- A windowed local-context self-attention.
- A global attention motivated by an end task that encodes inductive bias about the task.
Sliding Window Attention Setting
- Window size:
  - Use small window sizes for the lower layers.
  - Increase window sizes when moving to higher layers.
- Dilation size:
  - Not use dilated sliding windows for lower layers to maximize their capacity to learn and utilize the immediate local context.
  - Use a small amount of increasing dilation only on 2 heads for higher layers.
Similar Work
- Sparse Transformer
  - Dilated sliding window
- BPTransformer
- Blockwise attention

Evaluation

Compare against other models with limited sequence length though the sequence splitting trick.

Longformer-Encoder-Decoder

Replace self-attention with window/global attention
Initialize with BART weights.
Deliver good performance even without further fine-tuning.

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Shifted Window Attention

Fixed window with shifts between adjacent layers.
- Cycling shift for the shifted window for compute efficiency
Learnt relative position bias helps better than absolute position encoding

Architecture

A combination of
- Patch embedding (patchify + linear)
- Linear embedding
- Swin Transformer Blocks (w/ window attention and shifted window attention)

Learning Transferable Visual Models From Natural Language Supervision

Background

NLP succeeded in pre-training from raw text, surpassing high quality crowd labeled datasets.
CV transferred to transformer architecture.
- Some models use raw text as natural language supervision for image representation learning. However, static softmax classifiers and limited image classes have limited power.
- Scale matters.

Approach

Natural language supervision
- Enable large dataset + zero-short.
Creating a large dataset
- 400M image + text pairs
- 500K query * 20K pairs per query.
Efficient Pre-Training
- Contrastive instead of generative.
- Predict which text as a whole is paired with which image. Not exact words.
Choosing and Scaling a Model
- The goal is to find select the matching text prompt from the
- Text encoder
  - Standard Transformer
  - Sequence length capped at 76
- Vision encoder
  - ResNet-50 or ViT
- Having both embeddings normalized and run pairwise cosine similarities
- Interpret image coder as computer vision backbone; Text encoder as hypernetwork to generate weights of a linear classifier.

Training

Train for 32 epochs.
Cosine learning rate schedule.

Zero-Shot Transfer

Prompt engineering
- Polysemy: Same word but different meanings. Provide context.
- Single word: Use template. “A proto of a {label}”.

Patch n’ Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution

Observation

Dealing with different resolution and aspect ratios in batched training.
- Resizing hurt model quality.
- Padding hurts training efficiency.
Pack multiple images into one sequence. It enables variable resolution while preserving the aspect ratio, inspired by example packing in natural language processing.

Architecture

Masked self-attention and masked pooling.
- Support multiple images to be stitched together in training.
Factorized & fractional positional embeddings.
- Have the position embedding support for various resolution (image size) and aspect ratios.
  - Absolute embedding: Good to sense different aspect ratios.
  - Relative embedding: Good to ignore different resolutions.
Continuous token dropping
- Random omission of input patches during training.
- Dropping rate can vary across different images.
Resolution sampling
- Mixture of different resolutions to improve performance
Padding examples and the contrastive loss
- Use per-example loss instead of per-token loss as CV task.
- Use the chunked contrastive loss

Training

Classification training and contrastive language image training

levendlee

Reading Notes – 6

Longformer: The Long-Document Transformer

Sliding Window Attention

Architecture

Evaluation

Longformer-Encoder-Decoder

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Shifted Window Attention

Architecture

Learning Transferable Visual Models From Natural Language Supervision

Background

Approach

Training

Zero-Shot Transfer

Patch n’ Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution

Observation

Architecture

Training

Leave a comment Cancel reply

Reading Notes – 6

Longformer: The Long-Document Transformer

Sliding Window Attention

Architecture

Evaluation

Longformer-Encoder-Decoder

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Shifted Window Attention

Architecture

Learning Transferable Visual Models From Natural Language Supervision

Background

Approach

Training

Zero-Shot Transfer

Patch n’ Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution

Observation

Architecture

Training

Share this:

Leave a comment Cancel reply