Neural Text to Speech Synthesis

Tutorial @ IJCAI 2021, August 19-26, 2021

Speakers

Xu Tan, Microsoft Research Asia, xuta@microsoft.com
Tao Qin, Microsoft Research Asia, taoqin@microsoft.com

Abstract

Text to speech (TTS), which aims to synthesize natural and intelligible speech given text, has been a hot research topic in the artificial intelligence community and has become an important product service in the industry. As the development of deep learning and artificial intelligence, neural network based TTS has significantly improved the quality of synthesized speech in recent years. In this tutorial, we will give an introduction to neural text to speech, which consists of four parts. In the first part, we will briefly overview the history of TTS technology. In the second part, we will introduce the key components in neural TTS, including text analysis, acoustic model and vocoder. In the third part, we will review the works that push the frontier of TTS research and cover practical TTS products, including end-to-end TTS, non-autoregressive and lightweight TTS, robust/expressive/controllable TTS, low-resource TTS, and custom voice adaptation. At the end of the tutorial, we will describe several challenges of TTS and discuss future research directions.

Outline

  1. Background
  2. Key components in TTS
    2.1 Text analysis
    2.2 Acoustic model
    2.3 Vocoder
    2.4 Towards end-to-end TTS
  3. Advanced topics in TTS
    3.1 Fast TTS
    3.2 Low-resource TTS
    3.3 Robust TTS
    3.4 Expressive TTS
    3.5 Adaptive TTS
  4. Challenges and future directions

Materials

Slides
Project page
Speech demo page

TTS tutorial @ ISCSLP 2021
A talk on low-resource TTS @ Jiangmen
A talk on FastSpeech @ NVIDIA GTC China 2020
A webinar talk on TTS @ Microsoft Research
A talk on Towards Efficient Machine Learning for Speech and Music Applications