Text-to-speech engines are usually multi-stage pipelines that transform the signal into many intermediate representations and require supervision at each step.
When trying to train TTS end-to-end, the alignment problem arises: Which text corresponds to which piece of sound?
This paper uses an alignment module to tackle this problem and produces astonishingly good sound.
Paper: https://arxiv.org/abs/2006.03575
Website: https://deepmind.com/research/publications/End-to-End-Adversarial-Text-to-Speech
Content index:
- 0:00 – Intro & Overview
- 1:55 – Problems with Text-to-Speech
- 3:55 – Adversarial Training
- 5:20 – End-to-End Training
- 7:20 – Discriminator Architecture
- 10:40 – Generator Architecture
- 12:20 – The Alignment Problem
- 14:40 – Aligner Architecture
- 24:00 – Spectrogram Prediction Loss
- 32:30 – Dynamic Time Warping
- 38:30 – Conclusion