rpmjp/portfolio
rpmjp/projects/swin-transformer-study/methodology.md
CompletedMay to Dec 2025

Swin Transformer: Empirical Evaluation on Small Fine-Grained Data

A controlled four-family architecture comparison (Swin T/S/B, RegNetY CNNs, EfficientNet B3-B7, ViT-B/16) on the Oxford-IIIT Pet Dataset under RTX 4090 constraints. Three findings: Swin's hierarchical attention transfers cleanly to small datasets (93.8-96.35%), EfficientNet's compound scaling breaks (B3 beats B7 by 8.66 points), and ViT catastrophically fails (7.17%: barely above the 2.7% random baseline).

PyTorchtimmSwin-T/S/BRegNetYEfficientNet B3-B7ViT-B/16Oxford-IIIT PetRTX 4090
Languages
Jupyter Notebook98%
Python2%
methodology.md

Methodology

The goal was a controlled comparison: same dataset, same split, same epoch count, same loss, same hardware. The differences that remained: batch size, input resolution, optimizer: were forced by each model's memory profile, not by experimental drift.

Dataset

Oxford-IIIT Pet Dataset: 37 classes (25 dog breeds + 12 cat breeds), approximately 7,400 images, ~200 per class. Standard fine-grained classification benchmark with real-world challenges: intra-class variability in poses and colors, scale differences across samples, background clutter, and lighting variance.

Why this dataset: small enough to surface small-data scaling problems, structured enough to expect well-pretrained backbones to do well, fine-grained enough that the gap between architectures matters.

Split and preprocessing

  • Train/validation split: 80% / 20%
  • Image loading: custom PyTorch Dataset class, label extraction from filename convention
  • Augmentation pipeline: Resize → ToTensor → Normalize
  • Normalization: ImageNet statistics: mean [0.485, 0.456, 0.406], std [0.229, 0.224, 0.225]
  • Pretraining: all models loaded with ImageNet pretrained weights via timm. Only the final classifier head was replaced to match the 37 classes.

Training configuration

SettingValue
Epochs5
LossCrossEntropyLoss
OptimizerAdamW
Learning rate1e-3
Weight decayDefault per timm settings
Workers0-4 (per family)

Each training epoch was followed by a validation pass. FLOPs were measured with fvcore.

Why 5 epochs

Deliberately constrained, for two reasons. First, it surfaces the convergence behavior of each architecture: does it learn fast or does it need a long warmup? Swin models converged by epoch 2-3 with stable validation; ViT had barely moved off random by epoch 5. That's the signal. Second, it kept the comparison feasible on a single RTX 4090: longer training across 12 model variants would have run for weeks.

The tradeoff: some architectures might still have been improving at epoch 5. The conclusions are about convergence behavior and small-data transfer, not asymptotic performance. The challenges.md page treats this honestly.

Hardware

  • GPU: NVIDIA RTX 4090 (24GB)
  • System: AMD Ryzen 9 7950X3D, Ubuntu kernel 6.17.x, CUDA-enabled PyTorch
  • Constraint: single consumer card, no multi-GPU

The hardware shaped the experiment. Batch sizes had to drop to 16-64 across all families to avoid OOM crashes. ViT-L/16 was attempted and immediately ran out of memory, it doesn't appear in results because it never trained. The CNN families were originally designed around batch sizes of 128-256, which we couldn't sustain. Some of the conclusions about training stability and convergence are partly conditioned on this constraint and are flagged where it matters.

Per-family configuration

FamilyInput sizeBatch sizeNotes
Swin-T224²64timm default
Swin-S224²64timm default
Swin-B (224²)224²64best peak accuracy
Swin-B (384²)384²reducedearly overfitting at high res
RegNetY-4G224²64timm default
RegNetY-8G224²64timm default
RegNetY-16G224²64timm default
EfficientNet-B3300²16native input size
EfficientNet-B4380²16native input size
EfficientNet-B5456²16native input size
EfficientNet-B6528²16native input size
EfficientNet-B7600²16native input size
ViT-B/16384²64ImageNet-21K → 1K weights
ViT-L/16384²:OOM, did not train

Native input sizes were used for EfficientNet because they're integral to the compound-scaling design: running B7 at 224² would have defeated the comparison.