The authors build a nearly end-to-end text-to-speech (TTS) synthesis pipeline, resulting in high-fidelity natural-sounding speech approaching state-of-the-art TTS systems.
What can we learn from this paper?
That it is possible to train a good TTS model without multi-stage supervision with expensive-to-create ground truth annotations at each stage.
Prerequisites (to better understand the paper, what should one be familiar with?)
- Deep adversarial networks
- Basics of speech representations (phonemes, Mel spectrograms, etc)
- Dynamic time warping
Discussion
This paper builds on a previous effort by the same team, in which they created GAN-TTS, a generative adversarial network (GAN) consisting of a feed-forward generator producing raw speech audio conditional on the input text, and an ensemble of discriminators operating on random time windows of different sizes. In this work, the GAN-TTS generator is used as a decoder in the model, and the input to it, as opposed to a manually constructed sequence of linguistic and pitch features at 200Hz in the original GAN-TTS paper, is the output of the aligner block of the network (see picture below).
The main goal of the aligner is to predict the length and position of input tokens. For each token, a representation is computed using a stack of dilated convolutions interspersed with batch normalization layers and ReLU activations.
The entire generator architecture is differentiable and is trained end to end. As the authors note, since it’s a feed-forward convolutional network, it is well suited for fast batched inference.
For training, the total loss was calculated as a sum of three components. The first is the adversarial loss, the second is an explicit prediction loss in the log-scaled Mel-spectrogram domain compared to the human-generated ground truth using dynamic time warping to better align the two, and the third is the aligner length loss comparing the total length of the generated output to the ground truth length.
The model was trained using data from multiple native English speakers with varying amounts of recorded speech. In order to accommodate this, a speaker embedding vector was added to the inputs of the model.
The primary metric to evaluate speech quality was the Mean Opinion Score (MOS) on a 5 point scale given by human raters. Compared to the natural speech score of 4.55 and the state-of-the-art WaveNet score of 4.41, the new system, built using far less supervision, achieved a score of 4.08.
To really compare these models and give an idea of the quality of generated speech, here is a sample created by the model based on the phonetic translation of the abstract to this paper (taken from the authors’ page):
From the same source, here is what it sounds like without the phonetic translation (just based on character input, hence somewhat inferior quality):
Finally, here is the same abstract that I converted to speech using a standard WaveNet model (en-US-Wavenet-C) from Google’s texttospeech library:
To me both the new model and the WaveNet TTS conversions, while identifiable as non-human, do sound pretty good, and I even prefer the new model in terms of sounding more natural. I guess you can form your own opinion! To be fair, the WaveNet sample was generated by me using the standard API and may not represent the best that the WaveNet can do.
Original paper link
End-to-End Adversarial Text-to-Speech by J. Donahue, S. Dieleman, M. Binkowski et al, 2020