Swin Transformer: Empirical Evaluation on Small Fine-Grained Data
A controlled four-family architecture comparison (Swin T/S/B, RegNetY CNNs, EfficientNet B3-B7, ViT-B/16) on the Oxford-IIIT Pet Dataset under RTX 4090 constraints. Three findings: Swin's hierarchical attention transfers cleanly to small datasets (93.8-96.35%), EfficientNet's compound scaling breaks (B3 beats B7 by 8.66 points), and ViT catastrophically fails (7.17%: barely above the 2.7% random baseline).
Results
The full architecture comparison after 5 epochs of transfer learning from ImageNet pretrained weights on the Oxford-IIIT Pet Dataset.
Master comparison table
| Model | Image size | Params | FLOPs | Throughput (img/s) | Peak val accuracy | Notes |
|---|---|---|---|---|---|---|
| Swin-B (224²) | 224² | 86.8M | 15.47G | 61.75 | 96.35% | Best overall accuracy |
| Swin-S | 224² | 48.9M | 8.77G | 61.69 | 95.40% | Excellent balance |
| Swin-T | 224² | 27.5M | 4.51G | 112.92 | 94.93% | Best efficiency-accuracy balance |
| Swin-B (384²) | 384² | 86.9M | 47.19G | 54.5 | 93.80% | High res, early overfitting |
| RegNetY-16G | 224² | 10M | 1.7G | 281.2 | 85.25% | Best CNN result |
| RegNetY-8G | 224² | 6M | 0.8G | 285.0 | 84.10% | Highest throughput overall |
| RegNetY-4G | 224² | 4M | 0.4G | 269.1 | 82.61% | Efficient CNN baseline |
| EfficientNet-B3 | 300² | 11M | 1.9G | 142.7 | 80.58% | Best EfficientNet: beats B7 |
| EfficientNet-B4 | 380² | 18M | 4.5G | 127.3 | 75.60% | Early peak, then plateau |
| EfficientNet-B5 | 456² | 28M | 10.5G | 108.4 | 75.20% | Diminishing returns |
| EfficientNet-B7 | 600² | 64M | 38.3G | 64.9 | 71.92% | Overfitting symptoms |
| EfficientNet-B6 | 528² | 41M | 19.4G | 93.9 | 64.70% | Poor convergence |
| ViT-B/16 | 384² | 86M | 49.4G | 18.2 | 7.17% | Catastrophic failure |
| ViT-L/16 | 384² | 300M+ | >100G | : | OOM | Did not train |
Random baseline for 37 classes is 2.7%.
Performance hierarchy by family
Swin Transformers: 93.8% to 96.35%. Tight band, fast convergence by epoch 2-3, train/val accuracy gaps of only 3-5%. Hierarchical attention and shifted windows produce reliable transfer learning from ImageNet to small fine-grained domains.
RegNet CNNs: 82.6% to 85.25%. Consistent and efficient. The throughput champion of the study (RegNetY-8G at 285 img/s). However, all three variants showed significant overfitting: training accuracies of 94-96% against validation accuracies in the 81-85% band. Solid baseline, hard ceiling.
EfficientNet B3-B7: 64.7% to 80.58%. Inverse scaling. The smallest variant (B3) was the best; the largest (B7) was nearly the worst. Higher input resolutions made things worse, not better. The compound-scaling principle that defines EfficientNet doesn't hold here. See baseline-analysis.md.
Vision Transformers: 7.17% (B/16) and OOM (L/16). Failed. ViT-B/16 trained for ~450 seconds per epoch on the RTX 4090: roughly 18× slower than RegNet, and never escaped near-random performance. ViT-L/16 ran out of memory immediately.
Accuracy per GFLOP
A useful single-number efficiency view: validation accuracy divided by training-pass GFLOPs.
| Model | Accuracy / GFLOP |
|---|---|
| Swin-T | 21.06% |
| RegNetY-4G | 206.5% (note: tiny denominator) |
| Swin-S | 10.88% |
| Swin-B (224²) | 6.23% |
| EfficientNet-B3 | 42.41% |
| ViT-B/16 | 0.15% |
Swin-T is the clear winner when accuracy-per-compute matters and the denominator is meaningful: substantially better than the larger Swin variants and orders of magnitude better than ViT. The tiny RegNetY-4G number is interesting but partly an artifact of how small its FLOP count is; the architecture's accuracy ceiling is lower.
Training dynamics
Swin: fast convergence, stable, minimal overfitting. Validation accuracy improved monotonically through epoch 5 in all variants except Swin-B (384²), which peaked early and then overfit at high resolution.
RegNet: converges fast but overfits hard. Strong training accuracy (94-96%) but a 10-15 point gap to validation. With more regularization or fewer epochs, real-world performance could be different.
EfficientNet: wildly variable across the family. B3 converges fast and generalizes well. B4 and B5 peak by epoch 3 then rise in validation loss as training accuracy keeps climbing. B6 and B7 struggle to converge: high initial training loss, slow learning, signs of underfitting and overfitting at the same time.
ViT: never learned. Training accuracy peaked at 8.03% (against 7.17% validation), meaning the model didn't even fit the training set. The inductive-bias deficit is real, not a hyperparameter problem.
Practical deployment guidance from the results
| Use case | Recommendation |
|---|---|
| Maximum accuracy needed (>95%) | Swin-B 224² (96.35%) |
| Balanced accuracy/efficiency (90-95%) | Swin-T (94.93% at 112.92 img/s) |
| Maximum throughput, accuracy compromise OK | RegNetY-8G (285 img/s, 84.10%) |
| Avoid on small fine-grained data | All ViT variants, EfficientNet B6/B7, high-resolution training |