Text-to-speech engines are usually multi-stage pipelines that transform the signal into many intermediate representations and require supervision at each step.

When trying to train TTS end-to-end, the alignment problem arises: Which text corresponds to which piece of sound?

This paper uses an alignment module to tackle this problem and produces astonishingly good sound.

Paper: https://arxiv.org/abs/2006.03575
Website: https://deepmind.com/research/publications/End-to-End-Adversarial-Text-to-Speech

Content index:

  • 0:00 – Intro & Overview
  • 1:55 – Problems with Text-to-Speech
  • 3:55 – Adversarial Training
  • 5:20 – End-to-End Training
  • 7:20 – Discriminator Architecture
  • 10:40 – Generator Architecture
  • 12:20 – The Alignment Problem
  • 14:40 – Aligner Architecture
  • 24:00 – Spectrogram Prediction Loss
  • 32:30 – Dynamic Time Warping
  • 38:30 – Conclusion

MIT Introduction to Deep Learning 6.S191: Lecture 6 with Ava Soleimany.

Subscribe to stay up to date with new deep learning lectures at MIT, or follow us @MITDeepLearning on Twitter and Instagram to stay fully-connected!!

Lecture Outline

  • 0:00 – Introduction
  • 0:58 – Course logistics
  • 3:59 – Upcoming guest lectures
  • 5:35 – Deep learning and expressivity of NNs
  • 10:02 – Generalization of deep models
  • 14:14 – Adversarial attacks
  • 17:00 – Limitations summary
  • 18:18 – Structure in deep learning
  • 22:53 – Uncertainty & bayesian deep learning
  • 28:09 – Deep evidential regression
  • 33:08 – AutoML
  • 36:43 – Conclusion

Ben Marriott uses the style-transfer software EbSynth on animated footage to see if you really can change the style using a single reference frame.

EbSynth is technically not an artificial intelligence software, It uses style transfer to get its results. I put AI in the title and thumbnail because it does what most people would think of when we talk about this type of AI in video… also for views.

You can get EbSynth here: https://ebsynth.com/

In this video from a recent talk at MIT, Demis Hassabis discusses the capabilities and power of self-learning systems. He illustrates this with reference to some of DeepMind’s recent breakthroughs, and talks about the implications of cutting-edge AI research for scientific and philosophical discovery.

What’s more impressive, is Demis’ biography. From the description:

Speaker Biography: Demis is a former child chess prodigy, who finished his A-levels two years early before coding the multi-million selling simulation game Theme Park aged 17. Following graduation from Cambridge University with a Double First in Computer Science he founded the pioneering video games company Elixir Studios producing award winning games for global publishers such as Vivendi Universal. After a decade of experience leading successful technology startups, Demis returned to academia to complete a PhD in cognitive neuroscience at UCL, followed by postdocs at MIT and Harvard, before founding DeepMind. His research into the neural mechanisms underlying imagination and planning was listed in the top ten scientific breakthroughs of 2007 by the journal Science. Demis is a 5-times World Games Champion, a Fellow of the Royal Society of Arts, and the recipient of the Royal Society’s Mullard Award and the Royal Academy of Engineering’s Silver Medal.