Swin Transformer: Empirical Evaluation on Small Fine-Grained Data
A controlled four-family architecture comparison (Swin T/S/B, RegNetY CNNs, EfficientNet B3-B7, ViT-B/16) on the Oxford-IIIT Pet Dataset under RTX 4090 constraints. Three findings: Swin's hierarchical attention transfers cleanly to small datasets (93.8-96.35%), EfficientNet's compound scaling breaks (B3 beats B7 by 8.66 points), and ViT catastrophically fails (7.17%: barely above the 2.7% random baseline).
Challenges and Honest Limitations
The constraints of this study shaped its conclusions. This page documents them openly so the results can be read for what they are.
1. Single consumer GPU
The constraint: one RTX 4090 with 24GB of memory. No multi-GPU, no rented A100, no DGX.
The impact: every model family in this study was originally designed and benchmarked with substantially more compute. The original Swin paper used multi-GPU training with batch size 1024 across 300 epochs. EfficientNet was tuned with batch sizes of 128-256. CNN families in general expect to run with large effective batch sizes for stable training. On a single 4090, those numbers weren't reachable.
The compromise: batch sizes dropped to 16 for the EfficientNet B3-B7 variants (forced by their 300²-600² native input sizes) and 64 for the smaller models. This is well below what each architecture's authors recommended. Some of the absolute accuracy numbers, particularly for the CNN families, could be different with proper batch sizes and more epochs.
Why it doesn't invalidate the findings: the comparisons are between architectures under the same constraints. Swin still beat the CNNs by 11-16 points, and that gap is large enough that batch-size optimization wouldn't close it. EfficientNet's inverse scaling result is even more robust to hardware constraints: bigger models did worse with the same training budget, which is the whole point.
2. ViT-L/16 wouldn't train
ViT-L/16 with 300M+ parameters and 384² input ran out of memory immediately on the 24GB card. It does not appear in the results because it could not be trained, not because it performed poorly.
This is itself a finding: a deployment-relevant one. If you're choosing an architecture for use on a single consumer GPU, ViT-L is simply off the table. The smaller ViT-B/16 fit but failed for different reasons (see baseline-analysis.md).
3. ViT-B/16 may have been partly throttled by hardware
ViT-B/16 caused multiple crashes during training. The ~450s-per-epoch training time on the 4090 is suspicious: it suggests the model was either memory-bandwidth-bound or hitting some other resource limit. With more memory and stable training conditions, ViT-B/16's 7.17% number could move.
But probably not by enough. The original ViT paper itself acknowledges that ViT requires JFT-300M or ImageNet-21K pretraining to work, and even DeiT (the ViT variant explicitly designed for smaller datasets) needs distillation tricks the base ViT doesn't have. The catastrophic-failure conclusion is robust; the exact failure number isn't.
4. Five epochs is short
Deliberately so: see methodology.md. But it does mean that some architectures might have continued improving with longer training. RegNet in particular looked like it was still learning at epoch 5. The conclusions in this study are about early training behavior and convergence stability, not about asymptotic performance with unlimited compute.
5. One dataset
Oxford-IIIT Pet is one fine-grained classification benchmark. Conclusions about small-data behavior would be stronger with multiple datasets: Stanford Cars, CUB-200 Birds, FGVC Aircraft would all be natural extensions. The findings here are evidence, not proof.
6. No formal hyperparameter sweep
Each model was trained with timm-default optimizer settings rather than per-model tuned hyperparameters. This is the right choice for a controlled comparison (varying one thing at a time) but it means none of the absolute accuracy numbers are claimed to be peak performance. A targeted sweep on Swin-B might push it past 96.35%; a sweep on EfficientNet-B7 might soften the inverse-scaling result.
But the relative comparison is what matters here, and the relative comparison was apples-to-apples.
What I'd do differently with more resources
- Multi-GPU with 48GB+ per device. Proper batch sizes, ViT-L training, longer epochs.
- Per-model hyperparameter optimization. Find each architecture's real ceiling on this dataset.
- Three more fine-grained datasets. Cars, Birds, Aircraft: confirm the small-data findings generalize.
- Mixed-precision and gradient accumulation. Get effective batch sizes up without needing more cards.
- Track energy consumption, not just throughput. Real deployment cares about watts.
The honest summary
The study answers a practitioner's question: given limited compute and a small dataset, which architecture should I pick?, with a clean answer: a Swin variant. The conclusions are robust enough to act on. They are not, and don't claim to be, the final word on any of these architectures' performance on small fine-grained data. Treating them as evidence rather than proof is the correct reading.