Swin Transformer: Empirical Evaluation on Small Fine-Grained Data
A controlled four-family architecture comparison (Swin T/S/B, RegNetY CNNs, EfficientNet B3-B7, ViT-B/16) on the Oxford-IIIT Pet Dataset under RTX 4090 constraints. Three findings: Swin's hierarchical attention transfers cleanly to small datasets (93.8-96.35%), EfficientNet's compound scaling breaks (B3 beats B7 by 8.66 points), and ViT catastrophically fails (7.17%: barely above the 2.7% random baseline).
Baseline Analysis: RegNet, EfficientNet, ViT
The three non-Swin families each produced a different kind of result. RegNet was consistent. EfficientNet broke its own scaling laws. ViT failed entirely. This page explains all three.
RegNetY: solid, capped
RegNetY-4G to 16G all came in between 82.6% and 85.25%. Smooth, predictable, no surprises. RegNetY-8G achieved the highest throughput in the entire study at 285 img/s: substantially faster than any Swin variant, and 15× faster than ViT.
What worked: convolutional inductive biases (locality, translation equivariance) plus group convolutions plus a clean ImageNet-pretrained starting point. CNNs are great at transfer learning to small datasets. RegNet shows that.
What didn't: all three variants overfit. Training accuracies of 94-96%, validation accuracies of 81-85%. The model is memorizing rather than generalizing: typical CNN behavior on a small fine-grained dataset without strong regularization or more epochs.
The takeaway: if 85% accuracy is acceptable and you need to run at 285 img/s on a single GPU, RegNetY-8G is hard to beat. But there's a hard ceiling. The convolutional inductive bias that helps with transfer also caps representational capacity for fine-grained discrimination. Swin-T at 94.93% and 113 img/s wins on accuracy; RegNetY-8G at 84.10% and 285 img/s wins on throughput. Pick your tradeoff.
EfficientNet: the inverse scaling result
This is the surprise of the study.
| Model | Params | Image size | Val accuracy |
|---|---|---|---|
| EfficientNet-B3 | 11M | 300² | 80.58% |
| EfficientNet-B4 | 18M | 380² | 75.60% |
| EfficientNet-B5 | 28M | 456² | 75.20% |
| EfficientNet-B6 | 41M | 528² | 64.70% |
| EfficientNet-B7 | 64M | 600² | 71.92% |
B3 with 11M parameters beats B7 with 64M parameters by 8.66 points. Higher input resolutions (a core part of EfficientNet's compound scaling) actively hurt accuracy. This contradicts the entire design philosophy of the architecture.
Why this happens
EfficientNet was designed by Neural Architecture Search optimized on ImageNet-1K. The compound-scaling rule (jointly scale depth, width, and resolution by a fixed coefficient) is the output of an optimization process run on a specific dataset distribution. It's not a law of nature.
On ImageNet: 1.2M images across 1000 classes: bigger models with higher resolutions have more data to learn from, and the scaling law holds. On Oxford-IIIT Pet: 7,400 images across 37 classes: bigger models overfit faster, and higher resolutions amplify the problem by giving the model more pixel-level noise to memorize.
This is a general lesson, not a quirk of one architecture: scaling laws derived from large-scale benchmarks don't necessarily transfer to small-dataset, fine-grained domains. NAS-optimized architectures inherit the assumptions of their optimization target.
Training dynamics confirmed it
- B3: fast convergence, stable validation loss, good generalization
- B4 / B5: peak accuracy by epoch 3, then validation loss rises while training accuracy keeps climbing: classic overfitting
- B6: never properly converged in 5 epochs, very high initial training loss
- B7: signs of overfitting and underfitting simultaneously: the model is too big for the data but somehow still not learning what's there
The practical guidance
If you reach for EfficientNet on a small dataset, use B3. Don't assume bigger is better just because the paper said so.
Vision Transformer: catastrophic failure
ViT-B/16: 7.17% validation accuracy. Random baseline for 37 classes is 2.7%. The original ViT paper reported 84% on ImageNet. The 77-point gap is the story.
What went wrong
Three things compounded:
1. The inductive-bias deficit. ViT has essentially no spatial assumptions baked in: no locality, no translation equivariance, no hierarchy. Everything is learned from data. On ImageNet-22K pretraining and ImageNet-1K fine-tuning (the original paper's setup), there's enough data to learn those biases. On 7,400 pet images, there isn't.
2. The quadratic complexity tax. ViT-B/16 needed ~450 seconds per training epoch on the RTX 4090: about 18× slower than RegNet, processing only 18.2 images per second. Even if I'd doubled the epochs, the model still wouldn't have learned.
3. ViT-L/16 wouldn't even train. Out of memory on a 24GB GPU. The next-larger ViT variant simply doesn't fit.
What the failure tells you
ViT is a great architecture when you have the data scale it was designed for. On small fine-grained datasets, it's the wrong tool, and not by a little. This is the cleanest negative result in the study, and it's worth reporting because it documents a real-world failure mode that researchers often gloss over.
Swin solves this exact problem (see swin-deep-dive.md): same transformer foundations, but with windowed attention and a hierarchy that re-introduce the inductive biases ViT shed. Swin-B achieves 96.35% on the same data where ViT-B/16 hits 7.17%. The architectural choices matter.
The thread tying these together
RegNet succeeds within a ceiling. EfficientNet's scaling laws break. ViT collapses. Swin handles all three failure modes: has enough inductive bias to learn from small data, doesn't over-scale into overfitting, and doesn't pay ViT's quadratic complexity tax. That's the empirical case for hierarchical vision transformers as a default choice for resource-constrained vision work.