Results

The full architecture comparison after 5 epochs of transfer learning from ImageNet pretrained weights on the Oxford-IIIT Pet Dataset.

Master comparison table

Model	Image size	Params	FLOPs	Throughput (img/s)	Peak val accuracy	Notes
Swin-B (224²)	224²	86.8M	15.47G	61.75	96.35%	Best overall accuracy
Swin-S	224²	48.9M	8.77G	61.69	95.40%	Excellent balance
Swin-T	224²	27.5M	4.51G	112.92	94.93%	Best efficiency-accuracy balance
Swin-B (384²)	384²	86.9M	47.19G	54.5	93.80%	High res, early overfitting
RegNetY-16G	224²	10M	1.7G	281.2	85.25%	Best CNN result
RegNetY-8G	224²	6M	0.8G	285.0	84.10%	Highest throughput overall
RegNetY-4G	224²	4M	0.4G	269.1	82.61%	Efficient CNN baseline
EfficientNet-B3	300²	11M	1.9G	142.7	80.58%	Best EfficientNet: beats B7
EfficientNet-B4	380²	18M	4.5G	127.3	75.60%	Early peak, then plateau
EfficientNet-B5	456²	28M	10.5G	108.4	75.20%	Diminishing returns
EfficientNet-B7	600²	64M	38.3G	64.9	71.92%	Overfitting symptoms
EfficientNet-B6	528²	41M	19.4G	93.9	64.70%	Poor convergence
ViT-B/16	384²	86M	49.4G	18.2	7.17%	Catastrophic failure
ViT-L/16	384²	300M+	>100G	:	OOM	Did not train

Random baseline for 37 classes is 2.7%.

Performance hierarchy by family

Swin Transformers: 93.8% to 96.35%. Tight band, fast convergence by epoch 2-3, train/val accuracy gaps of only 3-5%. Hierarchical attention and shifted windows produce reliable transfer learning from ImageNet to small fine-grained domains.

RegNet CNNs: 82.6% to 85.25%. Consistent and efficient. The throughput champion of the study (RegNetY-8G at 285 img/s). However, all three variants showed significant overfitting: training accuracies of 94-96% against validation accuracies in the 81-85% band. Solid baseline, hard ceiling.

EfficientNet B3-B7: 64.7% to 80.58%. Inverse scaling. The smallest variant (B3) was the best; the largest (B7) was nearly the worst. Higher input resolutions made things worse, not better. The compound-scaling principle that defines EfficientNet doesn't hold here. See baseline-analysis.md.

Vision Transformers: 7.17% (B/16) and OOM (L/16). Failed. ViT-B/16 trained for ~450 seconds per epoch on the RTX 4090: roughly 18× slower than RegNet, and never escaped near-random performance. ViT-L/16 ran out of memory immediately.

Accuracy per GFLOP

A useful single-number efficiency view: validation accuracy divided by training-pass GFLOPs.

Model	Accuracy / GFLOP
Swin-T	21.06%
RegNetY-4G	206.5% (note: tiny denominator)
Swin-S	10.88%
Swin-B (224²)	6.23%
EfficientNet-B3	42.41%
ViT-B/16	0.15%

Swin-T is the clear winner when accuracy-per-compute matters and the denominator is meaningful: substantially better than the larger Swin variants and orders of magnitude better than ViT. The tiny RegNetY-4G number is interesting but partly an artifact of how small its FLOP count is; the architecture's accuracy ceiling is lower.

Training dynamics

Swin: fast convergence, stable, minimal overfitting. Validation accuracy improved monotonically through epoch 5 in all variants except Swin-B (384²), which peaked early and then overfit at high resolution.

RegNet: converges fast but overfits hard. Strong training accuracy (94-96%) but a 10-15 point gap to validation. With more regularization or fewer epochs, real-world performance could be different.

EfficientNet: wildly variable across the family. B3 converges fast and generalizes well. B4 and B5 peak by epoch 3 then rise in validation loss as training accuracy keeps climbing. B6 and B7 struggle to converge: high initial training loss, slow learning, signs of underfitting and overfitting at the same time.

ViT: never learned. Training accuracy peaked at 8.03% (against 7.17% validation), meaning the model didn't even fit the training set. The inductive-bias deficit is real, not a hyperparameter problem.

Practical deployment guidance from the results

Use case	Recommendation
Maximum accuracy needed (>95%)	Swin-B 224² (96.35%)
Balanced accuracy/efficiency (90-95%)	Swin-T (94.93% at 112.92 img/s)
Maximum throughput, accuracy compromise OK	RegNetY-8G (285 img/s, 84.10%)
Avoid on small fine-grained data	All ViT variants, EfficientNet B6/B7, high-resolution training