Methodology

The goal was a controlled comparison: same dataset, same split, same epoch count, same loss, same hardware. The differences that remained: batch size, input resolution, optimizer: were forced by each model's memory profile, not by experimental drift.

Dataset

Oxford-IIIT Pet Dataset: 37 classes (25 dog breeds + 12 cat breeds), approximately 7,400 images, ~200 per class. Standard fine-grained classification benchmark with real-world challenges: intra-class variability in poses and colors, scale differences across samples, background clutter, and lighting variance.

Why this dataset: small enough to surface small-data scaling problems, structured enough to expect well-pretrained backbones to do well, fine-grained enough that the gap between architectures matters.

Split and preprocessing

Train/validation split: 80% / 20%
Image loading: custom PyTorch Dataset class, label extraction from filename convention
Augmentation pipeline: Resize → ToTensor → Normalize
Normalization: ImageNet statistics: mean [0.485, 0.456, 0.406], std [0.229, 0.224, 0.225]
Pretraining: all models loaded with ImageNet pretrained weights via timm. Only the final classifier head was replaced to match the 37 classes.

Training configuration

Setting	Value
Epochs	5
Loss	CrossEntropyLoss
Optimizer	AdamW
Learning rate	1e-3
Weight decay	Default per timm settings
Workers	0-4 (per family)

Each training epoch was followed by a validation pass. FLOPs were measured with fvcore.

Why 5 epochs

Deliberately constrained, for two reasons. First, it surfaces the convergence behavior of each architecture: does it learn fast or does it need a long warmup? Swin models converged by epoch 2-3 with stable validation; ViT had barely moved off random by epoch 5. That's the signal. Second, it kept the comparison feasible on a single RTX 4090: longer training across 12 model variants would have run for weeks.

The tradeoff: some architectures might still have been improving at epoch 5. The conclusions are about convergence behavior and small-data transfer, not asymptotic performance. The challenges.md page treats this honestly.

Hardware

GPU: NVIDIA RTX 4090 (24GB)
System: AMD Ryzen 9 7950X3D, Ubuntu kernel 6.17.x, CUDA-enabled PyTorch
Constraint: single consumer card, no multi-GPU

The hardware shaped the experiment. Batch sizes had to drop to 16-64 across all families to avoid OOM crashes. ViT-L/16 was attempted and immediately ran out of memory, it doesn't appear in results because it never trained. The CNN families were originally designed around batch sizes of 128-256, which we couldn't sustain. Some of the conclusions about training stability and convergence are partly conditioned on this constraint and are flagged where it matters.

Per-family configuration

Family	Input size	Batch size	Notes
Swin-T	224²	64	timm default
Swin-S	224²	64	timm default
Swin-B (224²)	224²	64	best peak accuracy
Swin-B (384²)	384²	reduced	early overfitting at high res
RegNetY-4G	224²	64	timm default
RegNetY-8G	224²	64	timm default
RegNetY-16G	224²	64	timm default
EfficientNet-B3	300²	16	native input size
EfficientNet-B4	380²	16	native input size
EfficientNet-B5	456²	16	native input size
EfficientNet-B6	528²	16	native input size
EfficientNet-B7	600²	16	native input size
ViT-B/16	384²	64	ImageNet-21K → 1K weights
ViT-L/16	384²	:	OOM, did not train

Native input sizes were used for EfficientNet because they're integral to the compound-scaling design: running B7 at 224² would have defeated the comparison.

Swin Transformer: Empirical Evaluation on Small Fine-Grained Data