Swin Transformer: Empirical Evaluation on Small Fine-Grained Data

A comprehensive empirical study of Swin Transformer architectures for fine-grained image classification, conducted as part of CS 634: Deep Learning (Fall 2025) at NJIT. This isn't a Swin reimplementation. It's a deployment-realism question: do the architectural advantages demonstrated on ImageNet-1K and COCO actually translate to the resource-constrained, small-dataset scenarios that real-world projects face?

The study systematically compares four architecture families on the Oxford-IIIT Pet Dataset (37 classes, ~7,400 images), all trained on a single RTX 4090.

The four-family comparison

Family	Models tested	Best result
Swin Transformers	Swin-T, Swin-S, Swin-B (224²), Swin-B (384²)	96.35% (Swin-B 224²)
RegNetY CNNs	RegNetY-4G, 8G, 16G	85.25% (RegNetY-16G)
EfficientNet	B3, B4, B5, B6, B7	80.58% (B3)
Vision Transformers	ViT-B/16	7.17% (catastrophic)

Three findings worth the writeup

1. Swin's hierarchical attention scales down cleanly. On 7,400 images: three orders of magnitude smaller than ImageNet-22K: Swin variants still achieved 93.8% to 96.35% accuracy. The shifted-window mechanism and hierarchical patch merging transfer to small datasets without retraining from scratch. The original paper's headline result was at scale; this study shows the architecture works without the scale.

2. EfficientNet's compound scaling breaks on small fine-grained data. EfficientNet-B3 (11M parameters) outperformed EfficientNet-B7 (64M parameters) by 8.66 percentage points: an inverse scaling result that contradicts the NAS-derived scaling laws the architecture was designed around. The scaling principles optimized on ImageNet-1K don't generalize to small, specialized domains. Higher input resolutions (600² for B7) also hurt rather than helped.

3. ViT catastrophically fails. ViT-B/16 achieved 7.17% accuracy: barely above the 2.7% random baseline for a 37-class problem. The original paper reported 84% on ImageNet. The difference is the data scale: ViT's quadratic global attention plus near-zero inductive bias requires ImageNet-21K-scale pretraining to work. On 7,400 images, the inductive bias deficit is fatal. ViT-L/16 couldn't be evaluated at all: out of memory on a 24GB GPU.

At a glance


Task	Fine-grained image classification
Dataset	Oxford-IIIT Pet Dataset (37 classes, ~7,400 images)
Architectures compared	4 families, 12 model variants
Hardware	NVIDIA RTX 4090, 24GB
Framework	PyTorch + timm
Training	5 epochs, AdamW, CrossEntropyLoss, ImageNet pretrained
Split	80% train / 20% validation
Course	CS 634: Deep Learning, NJIT, Fall 2025

What this study honestly is, and isn't

It isn't a reproduction of the original Swin paper's headline numbers. The original used ImageNet-1K and ImageNet-22K pretraining, trained for 300 epochs on multi-GPU clusters, and applied Swin as a backbone for detection and segmentation. This study uses a single consumer GPU, 5 epochs, transfer learning from ImageNet, and a small fine-grained dataset.

What it is is a systematic answer to the question a practitioner actually asks: if I have a small dataset and one GPU, which architecture should I pick? The answer turns out to be more interesting than the question.

The hardware constraints are documented openly, not hidden. Batch sizes were forced down to 16-64 across all families to prevent OOM crashes. ViT-L/16 wouldn't fit at all. Some of the performance gaps are partly attributable to suboptimal training conditions: that distinction matters and is called out in challenges.md.

Where to go from here

methodology.md: dataset, splits, optimizer, batch sizes per family, hardware
results.md: full architecture comparison, accuracy-per-FLOP, training dynamics
swin-deep-dive.md: shifted-window attention, hierarchical patch merging, why it transfers to small data
baseline-analysis.md: the EfficientNet inverse-scaling result, RegNet consistency, ViT failure
challenges.md: RTX 4090 constraints, ViT-L/16 OOM, batch-size compromises
references.md: Liu et al. 2021 (Swin), Dosovitskiy et al. 2021 (ViT), Tan & Le 2019 (EfficientNet)