Ad

Yannic Kilcher explains why transformers are ruining convolutions.

This paper, under review at ICLR, shows that given enough data, a standard Transformer can outperform Convolutional Neural Networks in image recognition tasks, which are classically tasks where CNNs excel. In this Video, I explain the architecture of the Vision Transformer (ViT), the reason why it works better and rant about why double-bline peer review is broken.

OUTLINE:

  • 0:00 – Introduction
  • 0:30 – Double-Blind Review is Broken
  • 5:20 – Overview
  • 6:55 – Transformers for Images
  • 10:40 – Vision Transformer Architecture
  • 16:30 – Experimental Results
  • 18:45 – What does the Model Learn?
  • 21:00 – Why Transformers are Ruining Everything
  • 27:45 – Inductive Biases in Transformers
  • 29:05 – Conclusion & Comments

Related resources:

  • Paper (Under Review): https://openreview.net/forum?id=YicbFdNTTy
tt ads

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.