04/08/2024-04/21/2024
Longformer: The Long-Document Transformer
Sliding Window Attention
- Reduce quadratic self-attention overhead to linear
Architecture
- Attention Types
- A windowed local-context self-attention.
- A global attention motivated by an end task that encodes inductive bias about the task.
- Sliding Window Attention Setting
- Window size:
- Use small window sizes for the lower layers.
- Increase window sizes when moving to higher layers.
- Dilation size:
- Not use dilated sliding windows for lower layers to maximize their capacity to learn and utilize the immediate local context.
- Use a small amount of increasing dilation only on 2 heads for higher layers.
- Window size:
- Similar Work
- Sparse Transformer
- Dilated sliding window
- BPTransformer
- Blockwise attention
- Sparse Transformer
Evaluation
- Compare against other models with limited sequence length though the sequence splitting trick.
Longformer-Encoder-Decoder
- Replace self-attention with window/global attention
- Initialize with BART weights.
- Deliver good performance even without further fine-tuning.
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
Shifted Window Attention
- Fixed window with shifts between adjacent layers.
- Cycling shift for the shifted window for compute efficiency
- Learnt relative position bias helps better than absolute position encoding
Architecture
- A combination of
- Patch embedding (patchify + linear)
- Linear embedding
- Swin Transformer Blocks (w/ window attention and shifted window attention)
Learning Transferable Visual Models From Natural Language Supervision
Background
- NLP succeeded in pre-training from raw text, surpassing high quality crowd labeled datasets.
- CV transferred to transformer architecture.
- Some models use raw text as natural language supervision for image representation learning. However, static softmax classifiers and limited image classes have limited power.
- Scale matters.
Approach
- Natural language supervision
- Enable large dataset + zero-short.
- Creating a large dataset
- 400M image + text pairs
- 500K query * 20K pairs per query.
- Efficient Pre-Training
- Contrastive instead of generative.
- Predict which text as a whole is paired with which image. Not exact words.
- Choosing and Scaling a Model
- The goal is to find select the matching text prompt from the
- Text encoder
- Standard Transformer
- Sequence length capped at 76
- Vision encoder
- ResNet-50 or ViT
- Having both embeddings normalized and run pairwise cosine similarities
- Interpret image coder as computer vision backbone; Text encoder as hypernetwork to generate weights of a linear classifier.
Training
- Train for 32 epochs.
- Cosine learning rate schedule.
Zero-Shot Transfer
- Prompt engineering
- Polysemy: Same word but different meanings. Provide context.
- Single word: Use template. “A proto of a {label}”.
Patch n’ Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution
Observation
- Dealing with different resolution and aspect ratios in batched training.
- Resizing hurt model quality.
- Padding hurts training efficiency.
- Pack multiple images into one sequence. It enables variable resolution while preserving the aspect ratio, inspired by example packing in natural language processing.
Architecture
- Masked self-attention and masked pooling.
- Support multiple images to be stitched together in training.
- Factorized & fractional positional embeddings.
- Have the position embedding support for various resolution (image size) and aspect ratios.
- Absolute embedding: Good to sense different aspect ratios.
- Relative embedding: Good to ignore different resolutions.
- Have the position embedding support for various resolution (image size) and aspect ratios.
- Continuous token dropping
- Random omission of input patches during training.
- Dropping rate can vary across different images.
- Resolution sampling
- Mixture of different resolutions to improve performance
- Padding examples and the contrastive loss
- Use per-example loss instead of per-token loss as CV task.
- Use the chunked contrastive loss
Training
- Classification training and contrastive language image training
Leave a comment