Pushing the Frontier of Neural Text to Speech

Tutorial @ ISCSLP 2021, January 24-26, 2021

Speakers

Xu Tan, Microsoft Research Asia, xuta@microsoft.com

Abstract

Text to speech (TTS), which aims to synthesize natural and intelligible speech given text, has been a hot research topic in the community and has become an important product service in the industry. Although neural network based end-to-end TTS has significantly improved the quality of synthesized speech, there still exist great challenges when pushing the frontier of neural TTS and making it practical for product deployment. These challenges include 1) slow inference speed: neural TTS usually has high computational cost and slow inference speed in online serving; 2) robustness: the synthesized voice usually has word skipping and repeating issues; 3) controllability: the synthesized voice usually lacks of controllability in terms of speed, pitch, and prosody, etc.; 4) over-smoothing prediction: the TTS model usually predicts the average of training data, which leads to poor voice quality (e.g., dumb or metal voice); 5) high data cost: neural TTS requires huge training data for high-quality voice, which incurs much data collection cost when supporting low-resource languages in TTS; 6) TTS systems need to cover different product scenarios, including multiple speakers, custom voice, noisy speech, singing voice synthesis and talking face synthesis, etc. In this tutorial, we review and introduce a series of TTS research works that address the above challenges correspondingly, including non-autoregressive TTS, robust and controllable TTS, TTS with advanced optimizations, low-resource TTS, and TTS systems for different product scenarios. We further point out some open research problems that are critical to advance the state-of-the-art of neural text to speech and improve the TTS product experience.

Materials

Slides
Project page
Speech demo page

TTS tutorial @ IJCAI 2021
A talk on low-resource TTS @ Jiangmen
A talk on FastSpeech @ NVIDIA GTC China 2020
A webinar talk on TTS @ Microsoft Research