• A
  • A
  • A
  • ABC
  • ABC
  • ABC
  • А
  • А
  • А
  • А
  • А
Regular version of the site

Diffusion Probabilistic Modelling in Text-to-Speech

Student: Ivan Vovk

Supervisor: Dmitry Vetrov

Faculty: Faculty of Computer Science

Educational Programme: Statistical Learning Theory (Master)

Year of Graduation: 2021

Score matching and diffusion probabilistic modelling have shown huge potential in estimating complex data distributions. At the same time, recent advances in stochastic calculus have presented a unified perspective on these techniques by introducing a continuous framework based on diffusion-based forward and reverse-time stochastic differential equations (SDEs). The former gradually transforms original data distribution into a standard Gaussian by iterative noise injection and the latter is designed to reconstruct its reverse-time trajectories using a neural network that predicts a time-dependent gradient field of noisy data distribution (a.k.a, score). This work presents an adaptation of these techniques to the task of text-to-speech (TTS) synthesis, where it naturally introduces modified SDEs, which can be used to transform original distribution into Gaussian prior with arbitrary mean and diagonal covariance matrix. Prior generalization is shown to be essential for obtaining faster inference of proposed parallel TTS model called Grad-TTS, in which during inference score-based decoder transforms prior distribution parameterized by text encoder outputs. To learn the alignment function between input text sequence and output acoustic features, Grad-TTS utilizes Monotonic Alignment Search (MAS) algorithm that operates in fully-unsupervised way, and hence eliminates the usage of any external text-speech aligners like many other TTS solutions. Additionally, to show the power of diffusion probabilistic modelling for the task of speech synthesis, experiments with two models are presented: with mel-spectrogram feature generator stacked with pre-trained vocoder and end-to-end pipeline producing raw waveform directly from given text. Subjective human evaluation shows that the proposed mel-spectrogram feature generator matches state-of-the-art in terms of 5-scale Mean Opinion Score and achieves the value of 4.48 underperforming ground truth speech samples just by 0.09.

Student Theses at HSE must be completed in accordance with the University Rules and regulations specified by each educational programme.

Summaries of all theses must be published and made freely available on the HSE website.

The full text of a thesis can be published in open access on the HSE website only if the authoring student (copyright holder) agrees, or, if the thesis was written by a team of students, if all the co-authors (copyright holders) agree. After a thesis is published on the HSE website, it obtains the status of an online publication.

Student theses are objects of copyright and their use is subject to limitations in accordance with the Russian Federation’s law on intellectual property.

In the event that a thesis is quoted or otherwise used, reference to the author’s name and the source of quotation is required.

Search all student theses