Yannic Kilcher explains why transformers are ruining convolutions.

This paper, under review at ICLR, shows that given enough data, a standard Transformer can outperform Convolutional Neural Networks in image recognition tasks, which are classically tasks where CNNs excel. In this Video, I explain the architecture of the Vision Transformer (ViT), the reason why it works better and rant about why double-bline peer review is broken.

OUTLINE:

  • 0:00 – Introduction
  • 0:30 – Double-Blind Review is Broken
  • 5:20 – Overview
  • 6:55 – Transformers for Images
  • 10:40 – Vision Transformer Architecture
  • 16:30 – Experimental Results
  • 18:45 – What does the Model Learn?
  • 21:00 – Why Transformers are Ruining Everything
  • 27:45 – Inductive Biases in Transformers
  • 29:05 – Conclusion & Comments

Related resources:

  • Paper (Under Review): https://openreview.net/forum?id=YicbFdNTTy

vcubingx provides a visual introduction to the structure of an artificial neural network.

The Neural Network, A Visual Introduction | Visualizing Deep Learning, Chapter 1

  • 0:00 Intro
  • 1:55 One input Perceptron
  • 3:30 Two input Perceptron
  • 4:40 Three input Perceptron
  • 5:17 Activation Functions
  • 6:58 Neural Network
  • 9:45 Visualizing 2-2-2 Network
  • 10:59 Visualizing 2-3-2 Network
  • 12:33 Classification
  • 13:05 Outro

Yannic Kilcher explains the paper “Hopfield Networks is All You Need.”

Hopfield Networks are one of the classic models of biological memory networks. This paper generalizes modern Hopfield Networks to continuous states and shows that the corresponding update rule is equal to the attention mechanism used in modern Transformers. It further analyzes a pre-trained BERT model through the lens of Hopfield Networks and uses a Hopfield Attention Layer to perform Immune Repertoire Classification.

Content outline:

  • 0:00 – Intro & Overview
  • 1:35 – Binary Hopfield Networks
  • 5:55 – Continuous Hopfield Networks
  • 8:15 – Update Rules & Energy Functions
  • 13:30 – Connection to Transformers
  • 14:35 – Hopfield Attention Layers
  • 26:45 – Theoretical Analysis
  • 48:10 – Investigating BERT
  • 1:02:30 – Immune Repertoire Classification

Here’s an interesting article that I found via LinkedIn.

Mind Machine learning experts struggle to deal with “overfitting” in neural networks. Evolution solved it with dreams, says new theory. The world of sport is filled with superstition. Michael Jordan famously wore University of North Carolina shorts under his Chicago Bulls kit; Serena Williams wears the same socks throughout […]

Visual scenes are often comprised of sets of independent objects. Yet, current vision models make no assumptions about the nature of the pictures they look at.

Yannic Kilcher explore a paper on object-centric learning.

By imposing an objectness prior, this paper a module that is able to recognize permutation-invariant sets of objects from pixels in both supervised and unsupervised settings. It does so by introducing a slot attention module that combines an attention mechanism with dynamic routing.

Content index:

  • 0:00 – Intro & Overview
  • 1:40 – Problem Formulation
  • 4:30 – Slot Attention Architecture
  • 13:30 – Slot Attention Algorithm
  • 21:30 – Iterative Routing Visualization
  • 29:15 – Experiments
  • 36:20 – Inference Time Flexibility
  • 38:35 – Broader Impact Statement
  • 42:05 – Conclusion & Comments

Yannic Kilcher retraces his first reading of Facebook AI’s DETR paper and explain my process of understanding it.

OUTLINE:

  • 0:00 – Introduction
  • 1:25 – Title
  • 4:10 – Authors
  • 5:55 – Affiliation
  • 7:40 – Abstract
  • 13:50 – Pictures
  • 20:30 – Introduction
  • 22:00 – Related Work
  • 24:00 – Model
  • 30:00 – Experiments
  • 41:50 – Conclusions & Abstract
  • 42:40 – Final Remarks

Original Video about DETR: https://youtu.be/T35ba_VXkMY

Text-to-speech engines are usually multi-stage pipelines that transform the signal into many intermediate representations and require supervision at each step.

When trying to train TTS end-to-end, the alignment problem arises: Which text corresponds to which piece of sound?

This paper uses an alignment module to tackle this problem and produces astonishingly good sound.

Paper: https://arxiv.org/abs/2006.03575
Website: https://deepmind.com/research/publications/End-to-End-Adversarial-Text-to-Speech

Content index:

  • 0:00 – Intro & Overview
  • 1:55 – Problems with Text-to-Speech
  • 3:55 – Adversarial Training
  • 5:20 – End-to-End Training
  • 7:20 – Discriminator Architecture
  • 10:40 – Generator Architecture
  • 12:20 – The Alignment Problem
  • 14:40 – Aligner Architecture
  • 24:00 – Spectrogram Prediction Loss
  • 32:30 – Dynamic Time Warping
  • 38:30 – Conclusion

Yannic Kilcher explores a recent innovation at Facebook

Code migration between languages is an expensive and laborious task. To translate from one language to the other, one needs to be an expert at both. Current automatic tools often produce illegible and complicated code. This paper applies unsupervised neural machine translation to source code of Python, C++, and Java and is able to translate between them, without ever being trained in a supervised fashion.

Paper: https://arxiv.org/abs/2006.03511

Content index:

  • 0:00 – Intro & Overview
  • 1:15 – The Transcompiling Problem
  • 5:55 – Neural Machine Translation
  • 8:45 – Unsupervised NMT
  • 12:55 – Shared Embeddings via Token Overlap
  • 20:45 – MLM Objective
  • 25:30 – Denoising Objective
  • 30:10 – Back-Translation Objective
  • 33:00 – Evaluation Dataset
  • 37:25 – Results
  • 41:45 – Tokenization
  • 42:40 – Shared Embeddings
  • 43:30 – Human-Aware Translation
  • 47:25 – Failure Cases
  • 48:05 – Conclusion