Yannic Kilcher explains why transformers are ruining convolutions.
This paper, under review at ICLR, shows that given enough data, a standard Transformer can outperform Convolutional Neural Networks in image recognition tasks, which are classically tasks where CNNs excel. In this Video, I explain the architecture of the Vision Transformer (ViT), the reason why it works better and rant about why double-bline peer review is broken.
OUTLINE:
- 0:00 – Introduction
- 0:30 – Double-Blind Review is Broken
- 5:20 – Overview
- 6:55 – Transformers for Images
- 10:40 – Vision Transformer Architecture
- 16:30 – Experimental Results
- 18:45 – What does the Model Learn?
- 21:00 – Why Transformers are Ruining Everything
- 27:45 – Inductive Biases in Transformers
- 29:05 – Conclusion & Comments
Related resources:
- Paper (Under Review): https://openreview.net/forum?id=YicbFdNTTy