Swin Transformer: Empirical Evaluation on Small Fine-Grained Data
A controlled four-family architecture comparison (Swin T/S/B, RegNetY CNNs, EfficientNet B3-B7, ViT-B/16) on the Oxford-IIIT Pet Dataset under RTX 4090 constraints. Three findings: Swin's hierarchical attention transfers cleanly to small datasets (93.8-96.35%), EfficientNet's compound scaling breaks (B3 beats B7 by 8.66 points), and ViT catastrophically fails (7.17%: barely above the 2.7% random baseline).
Swin Transformer: Empirical Evaluation on Small Fine-Grained Data
A comprehensive empirical study of Swin Transformer architectures for fine-grained image classification, conducted as part of CS 634: Deep Learning (Fall 2025) at NJIT. This isn't a Swin reimplementation. It's a deployment-realism question: do the architectural advantages demonstrated on ImageNet-1K and COCO actually translate to the resource-constrained, small-dataset scenarios that real-world projects face?
The study systematically compares four architecture families on the Oxford-IIIT Pet Dataset (37 classes, ~7,400 images), all trained on a single RTX 4090.
The four-family comparison
| Family | Models tested | Best result |
|---|---|---|
| Swin Transformers | Swin-T, Swin-S, Swin-B (224²), Swin-B (384²) | 96.35% (Swin-B 224²) |
| RegNetY CNNs | RegNetY-4G, 8G, 16G | 85.25% (RegNetY-16G) |
| EfficientNet | B3, B4, B5, B6, B7 | 80.58% (B3) |
| Vision Transformers | ViT-B/16 | 7.17% (catastrophic) |
Three findings worth the writeup
1. Swin's hierarchical attention scales down cleanly. On 7,400 images: three orders of magnitude smaller than ImageNet-22K: Swin variants still achieved 93.8% to 96.35% accuracy. The shifted-window mechanism and hierarchical patch merging transfer to small datasets without retraining from scratch. The original paper's headline result was at scale; this study shows the architecture works without the scale.
2. EfficientNet's compound scaling breaks on small fine-grained data. EfficientNet-B3 (11M parameters) outperformed EfficientNet-B7 (64M parameters) by 8.66 percentage points: an inverse scaling result that contradicts the NAS-derived scaling laws the architecture was designed around. The scaling principles optimized on ImageNet-1K don't generalize to small, specialized domains. Higher input resolutions (600² for B7) also hurt rather than helped.
3. ViT catastrophically fails. ViT-B/16 achieved 7.17% accuracy: barely above the 2.7% random baseline for a 37-class problem. The original paper reported 84% on ImageNet. The difference is the data scale: ViT's quadratic global attention plus near-zero inductive bias requires ImageNet-21K-scale pretraining to work. On 7,400 images, the inductive bias deficit is fatal. ViT-L/16 couldn't be evaluated at all: out of memory on a 24GB GPU.
At a glance
| Task | Fine-grained image classification |
| Dataset | Oxford-IIIT Pet Dataset (37 classes, ~7,400 images) |
| Architectures compared | 4 families, 12 model variants |
| Hardware | NVIDIA RTX 4090, 24GB |
| Framework | PyTorch + timm |
| Training | 5 epochs, AdamW, CrossEntropyLoss, ImageNet pretrained |
| Split | 80% train / 20% validation |
| Course | CS 634: Deep Learning, NJIT, Fall 2025 |
What this study honestly is, and isn't
It isn't a reproduction of the original Swin paper's headline numbers. The original used ImageNet-1K and ImageNet-22K pretraining, trained for 300 epochs on multi-GPU clusters, and applied Swin as a backbone for detection and segmentation. This study uses a single consumer GPU, 5 epochs, transfer learning from ImageNet, and a small fine-grained dataset.
What it is is a systematic answer to the question a practitioner actually asks: if I have a small dataset and one GPU, which architecture should I pick? The answer turns out to be more interesting than the question.
The hardware constraints are documented openly, not hidden. Batch sizes were forced down to 16-64 across all families to prevent OOM crashes. ViT-L/16 wouldn't fit at all. Some of the performance gaps are partly attributable to suboptimal training conditions: that distinction matters and is called out in challenges.md.
Where to go from here
methodology.md: dataset, splits, optimizer, batch sizes per family, hardwareresults.md: full architecture comparison, accuracy-per-FLOP, training dynamicsswin-deep-dive.md: shifted-window attention, hierarchical patch merging, why it transfers to small databaseline-analysis.md: the EfficientNet inverse-scaling result, RegNet consistency, ViT failurechallenges.md: RTX 4090 constraints, ViT-L/16 OOM, batch-size compromisesreferences.md: Liu et al. 2021 (Swin), Dosovitskiy et al. 2021 (ViT), Tan & Le 2019 (EfficientNet)