• A
  • A
  • A
  • ABC
  • ABC
  • ABC
  • А
  • А
  • А
  • А
  • А
Regular version of the site

Voice Conversion Based оn Phoneme Distribution and Encoder-Decoder Models

Student: Gogoryan Vladimir

Supervisor: Alexey Naumov

Faculty: Faculty of Computer Science

Educational Programme: Statistical Learning Theory (Master)

Year of Graduation: 2021

Modern voice conversion systems have attracted much attention among researchers in recent years. Despite the progress in this area, solutions for the scenario in which the speakers comprise different languages remain sparsely investigated. Significant diversity of the phonetic set in the training data explains the complexity of the task. We investigate the effectiveness of solving the voice conversion problem using phoneme distributions. It is independent of the language and does not carry information about the original speaker, which potentially makes it possible to simplify cross-language transformations. This paper is gradually moving towards a cross-lingual model that does not require parallel data for training and can support an arbitrary number of target voices. To do this, we first train a module to extract the phoneme distributions from the acoustic features, which will work not only for the training data but also for any new speaker. Next, we use speech synthesis models that generate audio recordings for a single speaker to get results in any-to-one mode (Tacotron 2, FastSpeech). We refer to architectures that scale the model to an any-to-many mode to improve the results using approaches that support the number of target speakers greater than one (multi-speaker Tacotron 2, FastSpeech2). We study the models in the any-to-any mode using a pretrained acoustic encoder, which allows generating embeddings for previously unseen speakers. Additionally, we propose to solve the cross-lingual problem by formulating it in terms of generative-adversarial networks, which will also work in the any-to-any mode. The effectiveness of the proposed methods is investigated on a dataset that is represented by Russian and English speakers. The choice of languages is determined by the differences that exist between their phonetic sets.

Student Theses at HSE must be completed in accordance with the University Rules and regulations specified by each educational programme.

Summaries of all theses must be published and made freely available on the HSE website.

The full text of a thesis can be published in open access on the HSE website only if the authoring student (copyright holder) agrees, or, if the thesis was written by a team of students, if all the co-authors (copyright holders) agree. After a thesis is published on the HSE website, it obtains the status of an online publication.

Student theses are objects of copyright and their use is subject to limitations in accordance with the Russian Federation’s law on intellectual property.

In the event that a thesis is quoted or otherwise used, reference to the author’s name and the source of quotation is required.

Search all student theses