Swin Transformer: Empirical Evaluation on Small Fine-Grained Data
A controlled four-family architecture comparison (Swin T/S/B, RegNetY CNNs, EfficientNet B3-B7, ViT-B/16) on the Oxford-IIIT Pet Dataset under RTX 4090 constraints. Three findings: Swin's hierarchical attention transfers cleanly to small datasets (93.8-96.35%), EfficientNet's compound scaling breaks (B3 beats B7 by 8.66 points), and ViT catastrophically fails (7.17%: barely above the 2.7% random baseline).
References
Primary papers and tools that this study is built on.
Architecture papers
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021). Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).
→ The foundational Swin paper. Introduces windowed self-attention with shifted-window cross-block connectivity, and the four-stage hierarchical patch-merging design. Reports 87.3% top-1 on ImageNet-1K, 58.7 box AP on COCO test-dev, and 53.5 mIoU on ADE20K.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2021). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations (ICLR).
→ The original Vision Transformer paper. Demonstrates that pure transformers can match CNN performance on image classification, but only with very large pretraining datasets (JFT-300M, ImageNet-21K). Establishes the inductive-bias-versus-data-scale tradeoff that motivated Swin.
Tan, M., & Le, Q. (2019). EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In International Conference on Machine Learning (ICML).
→ Defines compound scaling: jointly scale network depth, width, and input resolution by a fixed coefficient. Optimized via Neural Architecture Search on ImageNet-1K. The inverse-scaling result on Oxford-IIIT Pet (this study) is evidence that the compound-scaling rule is dataset-specific, not universal.
Radosavovic, I., Kosaraju, R. P., Girshick, R., He, K., & Dollár, P. (2020). Designing Network Design Spaces. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
→ The RegNet paper. Introduces a parameterized design space approach to CNN architecture, producing a family of efficient models. RegNetY variants serve as the CNN baseline in this study.
Dataset
Parkhi, O. M., Vedaldi, A., Zisserman, A., & Jawahar, C. V. (2012). Cats and Dogs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
→ The Oxford-IIIT Pet Dataset. 37 categories of cat and dog breeds, ~7,400 images. Standard fine-grained classification benchmark.
Tools
Wightman, R. (2019). PyTorch Image Models (timm). https://github.com/rwightman/pytorch-image-models
→ The model zoo used for every architecture in this study. Provides ImageNet-pretrained weights, consistent input pipelines, and matching evaluation protocols across families, which is what made a controlled comparison possible.
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Köpf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., & Chintala, S. (2019). PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems (NeurIPS).
→ The framework.
fvcore. FLOP counting library used for throughput and FLOP measurements.
Course context
CS 634: Deep Learning, Fall 2025. New Jersey Institute of Technology. Student: Robert Jean Pierre (UCID: RJ447).