References

Primary papers and tools that this study is built on.

Architecture papers

Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021). Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).

→ The foundational Swin paper. Introduces windowed self-attention with shifted-window cross-block connectivity, and the four-stage hierarchical patch-merging design. Reports 87.3% top-1 on ImageNet-1K, 58.7 box AP on COCO test-dev, and 53.5 mIoU on ADE20K.

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2021). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations (ICLR).

→ The original Vision Transformer paper. Demonstrates that pure transformers can match CNN performance on image classification, but only with very large pretraining datasets (JFT-300M, ImageNet-21K). Establishes the inductive-bias-versus-data-scale tradeoff that motivated Swin.

Tan, M., & Le, Q. (2019). EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In International Conference on Machine Learning (ICML).

→ Defines compound scaling: jointly scale network depth, width, and input resolution by a fixed coefficient. Optimized via Neural Architecture Search on ImageNet-1K. The inverse-scaling result on Oxford-IIIT Pet (this study) is evidence that the compound-scaling rule is dataset-specific, not universal.

Radosavovic, I., Kosaraju, R. P., Girshick, R., He, K., & Dollár, P. (2020). Designing Network Design Spaces. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

→ The RegNet paper. Introduces a parameterized design space approach to CNN architecture, producing a family of efficient models. RegNetY variants serve as the CNN baseline in this study.

Dataset

Parkhi, O. M., Vedaldi, A., Zisserman, A., & Jawahar, C. V. (2012). Cats and Dogs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

→ The Oxford-IIIT Pet Dataset. 37 categories of cat and dog breeds, ~7,400 images. Standard fine-grained classification benchmark.

Tools

Wightman, R. (2019). PyTorch Image Models (timm). https://github.com/rwightman/pytorch-image-models

→ The model zoo used for every architecture in this study. Provides ImageNet-pretrained weights, consistent input pipelines, and matching evaluation protocols across families, which is what made a controlled comparison possible.

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Köpf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., & Chintala, S. (2019). PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems (NeurIPS).

→ The framework.

fvcore. FLOP counting library used for throughput and FLOP measurements.

Course context

CS 634: Deep Learning, Fall 2025. New Jersey Institute of Technology. Student: Robert Jean Pierre (UCID: RJ447).

Swin Transformer: Empirical Evaluation on Small Fine-Grained Data

References

Architecture papers

Dataset

Tools

Course context