Machine Learning Street Talk  Tim Scarfe, Yannic Kilcher and Connor Shorten discuss their takeaways from OpenAI’s GPT-3 language model.

OpenAI trained a 175 BILLION parameter autoregressive language model. The paper demonstrates how self-supervised language modelling at this scale can perform many downstream tasks without fine-tuning. 

Paper Links:

Content index:

  • 00:00:00 Intro
  • 00:00:54 ZeRO1+2 (model + Data parallelism) [GPT-3 DOES *NOT* USE THIS] (Connor)
  • 00:03:17 Recent history of NLP (Tim)
  • 00:06:04 Yannic “Light-speed” Kilcher’s brief overview of GPT-3
  • 00:14:25 Reviewing Yannic’s YT comments on his GPT-3 video (Tim)
  • 00:20:26 Main show intro
  • 00:23:03 Is GPT-3 reasoning?
  • 00:28:15 Architecture discussion and autoregressive (GPT*) vs denoising autoencoder (BERT)
  • 00:36:18 Utility of GPT-3 in industry
  • 00:43:03 Can GPT-3 do math? (reasoning/system 1/system 2)
  • 00:51:03 Generalisation
  • 00:56:48 Esoterics of language models
  • 00:58:46 Architectural trade-offs
  • 01:07:37 Memorization machines and intepretability
  • 01:17:16 Nearest neighbour probes / watermarks
  • 01:20:03 YouTube comments on GPT-3 video
  • 01:21:50 GPT-3 news article generation issue
  • 01:27:36 Sampling data for language models / bias / fairness / politics
  • 01:51:12 Outro

How far can you go with ONLY language modeling?

Can a large enough language model perform NLP task out of the box?

OpenAI take on these and other questions by training a transformer that is an order of magnitude larger than anything that has ever been built before and the results are astounding.

Yannic Kilcher explores.

Paper

Time index:

  • 0:00 – Intro & Overview
  • 1:20 – Language Models
  • 2:45 – Language Modeling Datasets
  • 3:20 – Model Size
  • 5:35 – Transformer Models
  • 7:25 – Fine Tuning
  • 10:15 – In-Context Learning
  • 17:15 – Start of Experimental Results
  • 19:10 – Question Answering
  • 23:10 – What I think is happening
  • 28:50 – Translation
  • 31:30 – Winograd Schemes
  • 33:00 – Commonsense Reasoning
  • 37:00 – Reading Comprehension
  • 37:30 – SuperGLUE
  • 40:40 – NLI
  • 41:40 – Arithmetic Expressions
  • 48:30 – Word Unscrambling
  • 50:30 – SAT Analogies
  • 52:10 – News Article Generation
  • 58:10 – Made-up Words
  • 1:01:10 – Training Set Contamination
  • 1:03:10 – Task Exampleshttps://arxiv.org/abs/2005.14165
    https://github.com/openai/gpt-3