Yannic Kilcher explains why transformers are ruining convolutions.

This paper, under review at ICLR, shows that given enough data, a standard Transformer can outperform Convolutional Neural Networks in image recognition tasks, which are classically tasks where CNNs excel. In this Video, I explain the architecture of the Vision Transformer (ViT), the reason why it works better and rant about why double-bline peer review is broken.

OUTLINE:

  • 0:00 – Introduction
  • 0:30 – Double-Blind Review is Broken
  • 5:20 – Overview
  • 6:55 – Transformers for Images
  • 10:40 – Vision Transformer Architecture
  • 16:30 – Experimental Results
  • 18:45 – What does the Model Learn?
  • 21:00 – Why Transformers are Ruining Everything
  • 27:45 – Inductive Biases in Transformers
  • 29:05 – Conclusion & Comments

Related resources:

  • Paper (Under Review): https://openreview.net/forum?id=YicbFdNTTy

In this deeplizard episode, learn how to prepare and process our own custom data set of sign language digits, which will be used to train our fine-tuned MobileNet model in a future episode.

VIDEO SECTIONS

  • 00:00 Welcome to DEEPLIZARD – Go to deeplizard.com for learning resources
  • 00:40 Obtain the Data
  • 01:30 Organize the Data
  • 09:42 Process the Data
  • 13:11 Collective Intelligence and the DEEPLIZARD HIVEMIND

deeplizard  introduces MobileNets, a class of light weight deep convolutional neural networks that are vastly smaller in size and faster in performance than many other popular models.

VIDEO SECTIONS

  • 00:00 Welcome to DEEPLIZARD – Go to deeplizard.com for learning resources
  • 00:17 Intro to MobileNets
  • 02:56 Accessing MobileNet with Keras
  • 07:25 Getting Predictions from MobileNet
  • 13:32 Collective Intelligence and the DEEPLIZARD HIVEMIND

In this video, Mandy from deeplizard  demonstrates how to use the fine-tuned VGG16 Keras model that we trained in the last episode to predict on images of cats and dogs in our test set.

Index:

  • 00:00 Welcome to DEEPLIZARD – Go to deeplizard.com for learning resources
  • 00:17 Predict with a Fine-tuned Model
  • 05:40 Plot Predictions With A Confusion Matrix
  • 05:16 Collective Intelligence and the DEEPLIZARD HIVEMIND

Visual scenes are often comprised of sets of independent objects. Yet, current vision models make no assumptions about the nature of the pictures they look at.

Yannic Kilcher explore a paper on object-centric learning.

By imposing an objectness prior, this paper a module that is able to recognize permutation-invariant sets of objects from pixels in both supervised and unsupervised settings. It does so by introducing a slot attention module that combines an attention mechanism with dynamic routing.

Content index:

  • 0:00 – Intro & Overview
  • 1:40 – Problem Formulation
  • 4:30 – Slot Attention Architecture
  • 13:30 – Slot Attention Algorithm
  • 21:30 – Iterative Routing Visualization
  • 29:15 – Experiments
  • 36:20 – Inference Time Flexibility
  • 38:35 – Broader Impact Statement
  • 42:05 – Conclusion & Comments