rpmjp/portfolio
rpmjp/projects/swin-transformer-study/results.md
CompletedMay to Dec 2025

Swin Transformer: Empirical Evaluation on Small Fine-Grained Data

A controlled four-family architecture comparison (Swin T/S/B, RegNetY CNNs, EfficientNet B3-B7, ViT-B/16) on the Oxford-IIIT Pet Dataset under RTX 4090 constraints. Three findings: Swin's hierarchical attention transfers cleanly to small datasets (93.8-96.35%), EfficientNet's compound scaling breaks (B3 beats B7 by 8.66 points), and ViT catastrophically fails (7.17%: barely above the 2.7% random baseline).

PyTorchtimmSwin-T/S/BRegNetYEfficientNet B3-B7ViT-B/16Oxford-IIIT PetRTX 4090
Languages
Jupyter Notebook98%
Python2%
results.md

Results

The full architecture comparison after 5 epochs of transfer learning from ImageNet pretrained weights on the Oxford-IIIT Pet Dataset.

Master comparison table

ModelImage sizeParamsFLOPsThroughput (img/s)Peak val accuracyNotes
Swin-B (224²)224²86.8M15.47G61.7596.35%Best overall accuracy
Swin-S224²48.9M8.77G61.6995.40%Excellent balance
Swin-T224²27.5M4.51G112.9294.93%Best efficiency-accuracy balance
Swin-B (384²)384²86.9M47.19G54.593.80%High res, early overfitting
RegNetY-16G224²10M1.7G281.285.25%Best CNN result
RegNetY-8G224²6M0.8G285.084.10%Highest throughput overall
RegNetY-4G224²4M0.4G269.182.61%Efficient CNN baseline
EfficientNet-B3300²11M1.9G142.780.58%Best EfficientNet: beats B7
EfficientNet-B4380²18M4.5G127.375.60%Early peak, then plateau
EfficientNet-B5456²28M10.5G108.475.20%Diminishing returns
EfficientNet-B7600²64M38.3G64.971.92%Overfitting symptoms
EfficientNet-B6528²41M19.4G93.964.70%Poor convergence
ViT-B/16384²86M49.4G18.27.17%Catastrophic failure
ViT-L/16384²300M+>100G:OOMDid not train

Random baseline for 37 classes is 2.7%.

Performance hierarchy by family

Swin Transformers: 93.8% to 96.35%. Tight band, fast convergence by epoch 2-3, train/val accuracy gaps of only 3-5%. Hierarchical attention and shifted windows produce reliable transfer learning from ImageNet to small fine-grained domains.

RegNet CNNs: 82.6% to 85.25%. Consistent and efficient. The throughput champion of the study (RegNetY-8G at 285 img/s). However, all three variants showed significant overfitting: training accuracies of 94-96% against validation accuracies in the 81-85% band. Solid baseline, hard ceiling.

EfficientNet B3-B7: 64.7% to 80.58%. Inverse scaling. The smallest variant (B3) was the best; the largest (B7) was nearly the worst. Higher input resolutions made things worse, not better. The compound-scaling principle that defines EfficientNet doesn't hold here. See baseline-analysis.md.

Vision Transformers: 7.17% (B/16) and OOM (L/16). Failed. ViT-B/16 trained for ~450 seconds per epoch on the RTX 4090: roughly 18× slower than RegNet, and never escaped near-random performance. ViT-L/16 ran out of memory immediately.

Accuracy per GFLOP

A useful single-number efficiency view: validation accuracy divided by training-pass GFLOPs.

ModelAccuracy / GFLOP
Swin-T21.06%
RegNetY-4G206.5% (note: tiny denominator)
Swin-S10.88%
Swin-B (224²)6.23%
EfficientNet-B342.41%
ViT-B/160.15%

Swin-T is the clear winner when accuracy-per-compute matters and the denominator is meaningful: substantially better than the larger Swin variants and orders of magnitude better than ViT. The tiny RegNetY-4G number is interesting but partly an artifact of how small its FLOP count is; the architecture's accuracy ceiling is lower.

Training dynamics

Swin: fast convergence, stable, minimal overfitting. Validation accuracy improved monotonically through epoch 5 in all variants except Swin-B (384²), which peaked early and then overfit at high resolution.

RegNet: converges fast but overfits hard. Strong training accuracy (94-96%) but a 10-15 point gap to validation. With more regularization or fewer epochs, real-world performance could be different.

EfficientNet: wildly variable across the family. B3 converges fast and generalizes well. B4 and B5 peak by epoch 3 then rise in validation loss as training accuracy keeps climbing. B6 and B7 struggle to converge: high initial training loss, slow learning, signs of underfitting and overfitting at the same time.

ViT: never learned. Training accuracy peaked at 8.03% (against 7.17% validation), meaning the model didn't even fit the training set. The inductive-bias deficit is real, not a hyperparameter problem.

Practical deployment guidance from the results

Use caseRecommendation
Maximum accuracy needed (>95%)Swin-B 224² (96.35%)
Balanced accuracy/efficiency (90-95%)Swin-T (94.93% at 112.92 img/s)
Maximum throughput, accuracy compromise OKRegNetY-8G (285 img/s, 84.10%)
Avoid on small fine-grained dataAll ViT variants, EfficientNet B6/B7, high-resolution training