Ad

Text-to-speech engines are usually multi-stage pipelines that transform the signal into many intermediate representations and require supervision at each step.

When trying to train TTS end-to-end, the alignment problem arises: Which text corresponds to which piece of sound?

This paper uses an alignment module to tackle this problem and produces astonishingly good sound.

Paper: https://arxiv.org/abs/2006.03575
Website: https://deepmind.com/research/publications/End-to-End-Adversarial-Text-to-Speech

Content index:

  • 0:00 – Intro & Overview
  • 1:55 – Problems with Text-to-Speech
  • 3:55 – Adversarial Training
  • 5:20 – End-to-End Training
  • 7:20 – Discriminator Architecture
  • 10:40 – Generator Architecture
  • 12:20 – The Alignment Problem
  • 14:40 – Aligner Architecture
  • 24:00 – Spectrogram Prediction Loss
  • 32:30 – Dynamic Time Warping
  • 38:30 – Conclusion
tt ads