• A
  • A
  • A
  • ABC
  • ABC
  • ABC
  • А
  • А
  • А
  • А
  • А
Regular version of the site

Deep Learning Models for Morpheme Segmentation

Student: Dorkin Aleksei

Supervisor: Svetlana Toldova

Faculty: Faculty of Humanities

Educational Programme: Computational Linguistics (Master)

Final Grade: 8

Year of Graduation: 2021

This master's thesis deals with the problem of morpheme segmentation from the point of view of subword-based tokenization. Contemporary approaches to tokenization widely employ unsupervised learning algorithms to separate words into smaller units which makes it possible to significantly reduce the size of vocabulary, compared to word-based tokenization. These algorithms implicitly model a language's morphological system to a certain extent, and that proves to be immensely beneficial on downstream tasks. However, the resulting subword units only partially match with actual morphemes. Thus, linguistically motivated subword tokenization appears to be the next logical step. Such tokenization entails the usage of supervised learning, which in turn requires a considerable amount of labeled data that is relatively hard to obtain for this task. Accordingly, we propose a novel method for generation of labeled data for arbitrary word forms in Russian language using the Russian Wiktionary. In addition, we carry out experiments on several deep learning models in an extensible and reproducible environment to gauge their performance on the task of morpheme segmentation. We present the code for the data generation and the experiments, as well as the data itself in open-source repositories.

Full text (added June 1, 2021)

Student Theses at HSE must be completed in accordance with the University Rules and regulations specified by each educational programme.

Summaries of all theses must be published and made freely available on the HSE website.

The full text of a thesis can be published in open access on the HSE website only if the authoring student (copyright holder) agrees, or, if the thesis was written by a team of students, if all the co-authors (copyright holders) agree. After a thesis is published on the HSE website, it obtains the status of an online publication.

Student theses are objects of copyright and their use is subject to limitations in accordance with the Russian Federation’s law on intellectual property.

In the event that a thesis is quoted or otherwise used, reference to the author’s name and the source of quotation is required.

Search all student theses