Swin Transformer: Empirical Evaluation on Small Fine-Grained Data
A controlled four-family architecture comparison (Swin T/S/B, RegNetY CNNs, EfficientNet B3-B7, ViT-B/16) on the Oxford-IIIT Pet Dataset under RTX 4090 constraints. Three findings: Swin's hierarchical attention transfers cleanly to small datasets (93.8-96.35%), EfficientNet's compound scaling breaks (B3 beats B7 by 8.66 points), and ViT catastrophically fails (7.17%: barely above the 2.7% random baseline).
Why Swin Works on Small Data
Swin Transformers were designed for ImageNet-scale data. The interesting result of this study is that they also work without that scale. This page explains why, structurally.
The original problem Swin solves
The Vision Transformer (Dosovitskiy et al. 2021) demonstrated that attention could replace convolution for image classification, but only with massive pretraining data (JFT-300M, ImageNet-22K). Two structural problems caused this:
1. Quadratic attention complexity. ViT computes attention globally across all image patches. For a 224×224 image divided into 16×16 patches, that's 196 tokens and 196² ≈ 38,000 attention computations per head. Pushing to higher resolution or denser prediction tasks is computationally infeasible.
2. No inductive bias. A convolution layer assumes that nearby pixels are related and that the same operation should apply across spatial locations: translation equivariance baked in. ViT has neither assumption. It learns everything from data, which is great when you have unlimited data, and terrible when you don't.
Swin's two innovations
Swin (Liu et al. 2021) addresses both problems with shifted window attention and hierarchical patch merging.
Shifted window attention
Instead of computing attention globally, Swin partitions the feature map into non-overlapping windows (default 7×7) and computes attention inside each window. This drops complexity from quadratic to linear:
ViT (global): Ω(MSA) = 4hwC² + 2(hw)²C : quadratic in patches
Swin (windowed): Ω(W-MSA) = 4hwC² + 2M²hwC : linear in patches
Windowed attention alone would have a problem: no information flows between windows. Swin fixes this by shifting the window partition between consecutive layers. Layer ℓ uses a regular grid; layer ℓ+1 displaces it by (⌊M/2⌋, ⌊M/2⌋) pixels. New windows now straddle the boundaries of the old ones. Over a few blocks, information propagates across the whole feature map without ever computing global attention.
The implementation uses cyclic shifting plus attention masking to keep the number of windows constant, which is what makes this fast on real hardware rather than just on paper.
Hierarchical patch merging
ViT produces a single-resolution feature map. Swin produces a four-stage hierarchy that matches what CNNs do:
Stage 1: H/4 × W/4 × C (high res, fine detail)
Stage 2: H/8 × W/8 × 2C
Stage 3: H/16 × W/16 × 4C
Stage 4: H/32 × W/32 × 8C (low res, global context)
Between stages, patch merging concatenates features from 2×2 neighboring patches and applies a linear projection: like strided convolution, but for attention features. This gives Swin two things ViT doesn't have: multi-scale features (useful for any vision task), and an architecture that's a drop-in backbone for detection and segmentation.
Why these innovations matter on small data
My experiments weren't designed to validate Swin against ViT at scale: that's already been done. They were designed to test whether Swin's structural choices help when data is scarce. They do, and here's the structural reason:
1. Windowed attention is a soft inductive bias. By restricting attention to local windows, Swin imposes a structural assumption that nearby patches matter more than distant ones: without hard-coding it the way convolution does. This is the same locality assumption that lets CNNs train on small datasets. ViT throws that assumption away and has to learn it from data; when there's no data to learn from, it can't.
2. The hierarchy concentrates capacity where the model can learn from limited data. Multi-scale features mean the model isn't trying to learn 37 pet breeds from a single high-resolution feature map. It's learning coarse breed-family signals at low resolution and fine-grained distinguishing detail at high resolution: separately, in parallel paths that share weights cleanly.
3. Pretrained Swin transfers well because its features look like CNN features. The hierarchical feature pyramid is the same shape as what every downstream vision system expects. Fine-tuning a Swin backbone on a new dataset works the way fine-tuning a ResNet works. ViT's single-scale feature map doesn't transfer the same way: that's part of why ViT-B/16 failed here.
The result in one sentence
Swin works on small data because its inductive biases are soft enough to learn from data but hard enough not to require massive data, and because its feature hierarchy plays well with transfer learning. The original paper's headline numbers are about scale; this study's contribution is showing the architecture still wins when the scale isn't there.