Yannic Kilcher explains why transformers are ruining convolutions.

This paper, under review at ICLR, shows that given enough data, a standard Transformer can outperform Convolutional Neural Networks in image recognition tasks, which are classically tasks where CNNs excel. In this Video, I explain the architecture of the Vision Transformer (ViT), the reason why it works better and rant about why double-bline peer review is broken.

OUTLINE:

  • 0:00 – Introduction
  • 0:30 – Double-Blind Review is Broken
  • 5:20 – Overview
  • 6:55 – Transformers for Images
  • 10:40 – Vision Transformer Architecture
  • 16:30 – Experimental Results
  • 18:45 – What does the Model Learn?
  • 21:00 – Why Transformers are Ruining Everything
  • 27:45 – Inductive Biases in Transformers
  • 29:05 – Conclusion & Comments

Related resources:

  • Paper (Under Review): https://openreview.net/forum?id=YicbFdNTTy

Visual scenes are often comprised of sets of independent objects. Yet, current vision models make no assumptions about the nature of the pictures they look at.

Yannic Kilcher explore a paper on object-centric learning.

By imposing an objectness prior, this paper a module that is able to recognize permutation-invariant sets of objects from pixels in both supervised and unsupervised settings. It does so by introducing a slot attention module that combines an attention mechanism with dynamic routing.

Content index:

  • 0:00 – Intro & Overview
  • 1:40 – Problem Formulation
  • 4:30 – Slot Attention Architecture
  • 13:30 – Slot Attention Algorithm
  • 21:30 – Iterative Routing Visualization
  • 29:15 – Experiments
  • 36:20 – Inference Time Flexibility
  • 38:35 – Broader Impact Statement
  • 42:05 – Conclusion & Comments