rpmjp/portfolio
rpmjp/projects/swin-transformer-study/README.md
CompletedMay to Dec 2025

Swin Transformer: Empirical Evaluation on Small Fine-Grained Data

A controlled four-family architecture comparison (Swin T/S/B, RegNetY CNNs, EfficientNet B3-B7, ViT-B/16) on the Oxford-IIIT Pet Dataset under RTX 4090 constraints. Three findings: Swin's hierarchical attention transfers cleanly to small datasets (93.8-96.35%), EfficientNet's compound scaling breaks (B3 beats B7 by 8.66 points), and ViT catastrophically fails (7.17%: barely above the 2.7% random baseline).

PyTorchtimmSwin-T/S/BRegNetYEfficientNet B3-B7ViT-B/16Oxford-IIIT PetRTX 4090
Languages
Jupyter Notebook98%
Python2%
README.md

Swin Transformer: Empirical Evaluation on Small Fine-Grained Data

A comprehensive empirical study of Swin Transformer architectures for fine-grained image classification, conducted as part of CS 634: Deep Learning (Fall 2025) at NJIT. This isn't a Swin reimplementation. It's a deployment-realism question: do the architectural advantages demonstrated on ImageNet-1K and COCO actually translate to the resource-constrained, small-dataset scenarios that real-world projects face?

The study systematically compares four architecture families on the Oxford-IIIT Pet Dataset (37 classes, ~7,400 images), all trained on a single RTX 4090.

The four-family comparison

FamilyModels testedBest result
Swin TransformersSwin-T, Swin-S, Swin-B (224²), Swin-B (384²)96.35% (Swin-B 224²)
RegNetY CNNsRegNetY-4G, 8G, 16G85.25% (RegNetY-16G)
EfficientNetB3, B4, B5, B6, B780.58% (B3)
Vision TransformersViT-B/167.17% (catastrophic)

Three findings worth the writeup

1. Swin's hierarchical attention scales down cleanly. On 7,400 images: three orders of magnitude smaller than ImageNet-22K: Swin variants still achieved 93.8% to 96.35% accuracy. The shifted-window mechanism and hierarchical patch merging transfer to small datasets without retraining from scratch. The original paper's headline result was at scale; this study shows the architecture works without the scale.

2. EfficientNet's compound scaling breaks on small fine-grained data. EfficientNet-B3 (11M parameters) outperformed EfficientNet-B7 (64M parameters) by 8.66 percentage points: an inverse scaling result that contradicts the NAS-derived scaling laws the architecture was designed around. The scaling principles optimized on ImageNet-1K don't generalize to small, specialized domains. Higher input resolutions (600² for B7) also hurt rather than helped.

3. ViT catastrophically fails. ViT-B/16 achieved 7.17% accuracy: barely above the 2.7% random baseline for a 37-class problem. The original paper reported 84% on ImageNet. The difference is the data scale: ViT's quadratic global attention plus near-zero inductive bias requires ImageNet-21K-scale pretraining to work. On 7,400 images, the inductive bias deficit is fatal. ViT-L/16 couldn't be evaluated at all: out of memory on a 24GB GPU.

At a glance

TaskFine-grained image classification
DatasetOxford-IIIT Pet Dataset (37 classes, ~7,400 images)
Architectures compared4 families, 12 model variants
HardwareNVIDIA RTX 4090, 24GB
FrameworkPyTorch + timm
Training5 epochs, AdamW, CrossEntropyLoss, ImageNet pretrained
Split80% train / 20% validation
CourseCS 634: Deep Learning, NJIT, Fall 2025

What this study honestly is, and isn't

It isn't a reproduction of the original Swin paper's headline numbers. The original used ImageNet-1K and ImageNet-22K pretraining, trained for 300 epochs on multi-GPU clusters, and applied Swin as a backbone for detection and segmentation. This study uses a single consumer GPU, 5 epochs, transfer learning from ImageNet, and a small fine-grained dataset.

What it is is a systematic answer to the question a practitioner actually asks: if I have a small dataset and one GPU, which architecture should I pick? The answer turns out to be more interesting than the question.

The hardware constraints are documented openly, not hidden. Batch sizes were forced down to 16-64 across all families to prevent OOM crashes. ViT-L/16 wouldn't fit at all. Some of the performance gaps are partly attributable to suboptimal training conditions: that distinction matters and is called out in challenges.md.

Where to go from here

  • methodology.md: dataset, splits, optimizer, batch sizes per family, hardware
  • results.md: full architecture comparison, accuracy-per-FLOP, training dynamics
  • swin-deep-dive.md: shifted-window attention, hierarchical patch merging, why it transfers to small data
  • baseline-analysis.md: the EfficientNet inverse-scaling result, RegNet consistency, ViT failure
  • challenges.md: RTX 4090 constraints, ViT-L/16 OOM, batch-size compromises
  • references.md: Liu et al. 2021 (Swin), Dosovitskiy et al. 2021 (ViT), Tan & Le 2019 (EfficientNet)