Yannic Kilcher explains why transformers are ruining convolutions.

This paper, under review at ICLR, shows that given enough data, a standard Transformer can outperform Convolutional Neural Networks in image recognition tasks, which are classically tasks where CNNs excel. In this Video, I explain the architecture of the Vision Transformer (ViT), the reason why it works better and rant about why double-bline peer review is broken.

OUTLINE:

  • 0:00 – Introduction
  • 0:30 – Double-Blind Review is Broken
  • 5:20 – Overview
  • 6:55 – Transformers for Images
  • 10:40 – Vision Transformer Architecture
  • 16:30 – Experimental Results
  • 18:45 – What does the Model Learn?
  • 21:00 – Why Transformers are Ruining Everything
  • 27:45 – Inductive Biases in Transformers
  • 29:05 – Conclusion & Comments

Related resources:

  • Paper (Under Review): https://openreview.net/forum?id=YicbFdNTTy

Generally speaking, Neural Networks are somewhat of a mystery. While you can understand the mechanics and the math that powers them, exactly how the network comes to its conclusions are a bit of a black box.

Here’s an interesting story on how researchers are trying to peer into the mysteries of a neural net.

Using an “activation atlas,” researchers can plumb the hidden depths of a neural network and study how it learns visual concepts. Shan Carter, a researcher at Google Brain, recently visited his daughter’s second-grade class with an unusual payload: an array of psychedelic pictures filled with indistinct shapes and warped […]

Siraj Raval explores how a team of researchers at Google Brain and Georgia Tech developed an AI that learned how to dress itself using various types of clothing.

They demonstrated their technology by presenting a video that shows an animated figure gracefully putting on clothing, and the most interesting part is that it learned how to do this by itself. The technique they used was called Trust Region Policy Optimization and its one of the techniques at the forefront of AI research.