Yannic Kilcher retraces his first reading of Facebook AI’s DETR paper and explain my process of understanding it.

OUTLINE:

  • 0:00 – Introduction
  • 1:25 – Title
  • 4:10 – Authors
  • 5:55 – Affiliation
  • 7:40 – Abstract
  • 13:50 – Pictures
  • 20:30 – Introduction
  • 22:00 – Related Work
  • 24:00 – Model
  • 30:00 – Experiments
  • 41:50 – Conclusions & Abstract
  • 42:40 – Final Remarks

Original Video about DETR: https://youtu.be/T35ba_VXkMY

Machine Learning Street Talk  Tim Scarfe, Yannic Kilcher and Connor Shorten discuss their takeaways from OpenAI’s GPT-3 language model.

OpenAI trained a 175 BILLION parameter autoregressive language model. The paper demonstrates how self-supervised language modelling at this scale can perform many downstream tasks without fine-tuning. 

Paper Links:

Content index:

  • 00:00:00 Intro
  • 00:00:54 ZeRO1+2 (model + Data parallelism) [GPT-3 DOES *NOT* USE THIS] (Connor)
  • 00:03:17 Recent history of NLP (Tim)
  • 00:06:04 Yannic “Light-speed” Kilcher’s brief overview of GPT-3
  • 00:14:25 Reviewing Yannic’s YT comments on his GPT-3 video (Tim)
  • 00:20:26 Main show intro
  • 00:23:03 Is GPT-3 reasoning?
  • 00:28:15 Architecture discussion and autoregressive (GPT*) vs denoising autoencoder (BERT)
  • 00:36:18 Utility of GPT-3 in industry
  • 00:43:03 Can GPT-3 do math? (reasoning/system 1/system 2)
  • 00:51:03 Generalisation
  • 00:56:48 Esoterics of language models
  • 00:58:46 Architectural trade-offs
  • 01:07:37 Memorization machines and intepretability
  • 01:17:16 Nearest neighbour probes / watermarks
  • 01:20:03 YouTube comments on GPT-3 video
  • 01:21:50 GPT-3 news article generation issue
  • 01:27:36 Sampling data for language models / bias / fairness / politics
  • 01:51:12 Outro

Yannic Kilcher investigates BERT and the white paper associated with it https://arxiv.org/abs/1810.04805

Abstract:We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT representations can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications. BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE benchmark to 80.4% (7.6% absolute improvement), MultiNLI accuracy to 86.7 (5.6% absolute improvement) and the SQuAD v1.1 question answering Test F1 to 93.2 (1.5% absolute improvement), outperforming human performance by 2.0%.