Swin Transformer: Empirical Evaluation on Small Fine-Grained Data
A controlled four-family architecture comparison (Swin T/S/B, RegNetY CNNs, EfficientNet B3-B7, ViT-B/16) on the Oxford-IIIT Pet Dataset under RTX 4090 constraints. Three findings: Swin's hierarchical attention transfers cleanly to small datasets (93.8-96.35%), EfficientNet's compound scaling breaks (B3 beats B7 by 8.66 points), and ViT catastrophically fails (7.17%: barely above the 2.7% random baseline).
Methodology
The goal was a controlled comparison: same dataset, same split, same epoch count, same loss, same hardware. The differences that remained: batch size, input resolution, optimizer: were forced by each model's memory profile, not by experimental drift.
Dataset
Oxford-IIIT Pet Dataset: 37 classes (25 dog breeds + 12 cat breeds), approximately 7,400 images, ~200 per class. Standard fine-grained classification benchmark with real-world challenges: intra-class variability in poses and colors, scale differences across samples, background clutter, and lighting variance.
Why this dataset: small enough to surface small-data scaling problems, structured enough to expect well-pretrained backbones to do well, fine-grained enough that the gap between architectures matters.
Split and preprocessing
- Train/validation split: 80% / 20%
- Image loading: custom PyTorch Dataset class, label extraction from filename convention
- Augmentation pipeline: Resize → ToTensor → Normalize
- Normalization: ImageNet statistics: mean
[0.485, 0.456, 0.406], std[0.229, 0.224, 0.225] - Pretraining: all models loaded with ImageNet pretrained weights via
timm. Only the final classifier head was replaced to match the 37 classes.
Training configuration
| Setting | Value |
|---|---|
| Epochs | 5 |
| Loss | CrossEntropyLoss |
| Optimizer | AdamW |
| Learning rate | 1e-3 |
| Weight decay | Default per timm settings |
| Workers | 0-4 (per family) |
Each training epoch was followed by a validation pass. FLOPs were measured with fvcore.
Why 5 epochs
Deliberately constrained, for two reasons. First, it surfaces the convergence behavior of each architecture: does it learn fast or does it need a long warmup? Swin models converged by epoch 2-3 with stable validation; ViT had barely moved off random by epoch 5. That's the signal. Second, it kept the comparison feasible on a single RTX 4090: longer training across 12 model variants would have run for weeks.
The tradeoff: some architectures might still have been improving at epoch 5. The conclusions are about convergence behavior and small-data transfer, not asymptotic performance. The challenges.md page treats this honestly.
Hardware
- GPU: NVIDIA RTX 4090 (24GB)
- System: AMD Ryzen 9 7950X3D, Ubuntu kernel 6.17.x, CUDA-enabled PyTorch
- Constraint: single consumer card, no multi-GPU
The hardware shaped the experiment. Batch sizes had to drop to 16-64 across all families to avoid OOM crashes. ViT-L/16 was attempted and immediately ran out of memory, it doesn't appear in results because it never trained. The CNN families were originally designed around batch sizes of 128-256, which we couldn't sustain. Some of the conclusions about training stability and convergence are partly conditioned on this constraint and are flagged where it matters.
Per-family configuration
| Family | Input size | Batch size | Notes |
|---|---|---|---|
| Swin-T | 224² | 64 | timm default |
| Swin-S | 224² | 64 | timm default |
| Swin-B (224²) | 224² | 64 | best peak accuracy |
| Swin-B (384²) | 384² | reduced | early overfitting at high res |
| RegNetY-4G | 224² | 64 | timm default |
| RegNetY-8G | 224² | 64 | timm default |
| RegNetY-16G | 224² | 64 | timm default |
| EfficientNet-B3 | 300² | 16 | native input size |
| EfficientNet-B4 | 380² | 16 | native input size |
| EfficientNet-B5 | 456² | 16 | native input size |
| EfficientNet-B6 | 528² | 16 | native input size |
| EfficientNet-B7 | 600² | 16 | native input size |
| ViT-B/16 | 384² | 64 | ImageNet-21K → 1K weights |
| ViT-L/16 | 384² | : | OOM, did not train |
Native input sizes were used for EfficientNet because they're integral to the compound-scaling design: running B7 at 224² would have defeated the comparison.