Data Augmentation for Dialogue System Training

Student: Paschenko Anatoliy

Educational Programme: Business Informatics (Bachelor)

Year of Graduation: 2021

In the last three years, the natural language processing (NLP) field has undergone significant changes. The legacy LSTM architectures were replaced by those based on the transformer. The breakthrough model was BERT. Since its release in the fall of 2018, almost every new language model has relied on it. Due to the novelty of the current view of the field and the constant appearance of modern SOTA solutions, little research has been devoted to augmentations to improve quality. At the moment, there is no comprehensive overview of how to create useful artificial examples for current dialog systems. In this connection, this thesis considers the main augmentations defined as effective, not only for NLP problems and such systems, in particular but also for visual answers to questions (VQA) and computer vision. Such actions are aimed at identifying effective augmentations for question-and-answer problems and comparing their results to close the mentioned gap in the literature. In addition, fine-tuning on synthetic examples is applied not only to the full-size BERT model but also to its distilled version, to assess how small architectures with similar metric values can handle artificial objects compared to the full-size version. Large models take a long time to set up, are expensive and difficult to implement into products, and therefore usually start with less bulky options. Hence, it is necessary to determine whether they can work with synthetic examples at the level of larger architectures. It was necessary to complete several tasks to achieve this goal: 1. Model selection 2. Selecting a data set 3. Selection of augmentations 4. Determination of the results of the baseline solution (without augmentation) for each of the models 5. Implementation and application of augmentations 6. Calculation of the final metrics after applying each augmentation 7. Comparison of the relative change in metrics as a function of augmentations for each model 8. Identifying effective ways to create synthetic examples for dialog systems The first stage of the work was the use of theoretical methods, namely, the analysis of the literature and its comparison for the selection of models, corpus, and augmentations used in the study. After that, mathematical methods were used to process the data set, fine-tune the model, and predict original and modified data, followed by calculating metrics. This step was followed by a comparison of the results of the previous stage. Based on the results of the practical work carried out, we made many conclusions. First, each augmentation method improved the quality of BERT. However, many variations of DistilBERT proved to be worse than the baseline, and the increase in metrics for the best ones is also significantly lower than for the full-size version. As a result, we can judge that the distilled model processes distorted texts significantly worse than the large one. The most effective augmentations are inserting a word using MLM and replacing it with a synonym. The best quality among all the compared variations, both in EM and F1, was given by a combination with the sequential application of the two methods named. The random choice of how to change texts at the butch level proved to be the worst of all the others, including the baseline. Based on the work results, we recommend that you abandon distilled models as an option for testing the usefulness of BERT-based algorithms on your data, since, despite their close score on SQuAD, such versions perform significantly worse on less high-quality data. In addition, we recommend using each of the augmentations discussed in this paper to create artificial examples to increase the quality or expand the cases due to their high efficiency.

Student Theses at HSE must be completed in accordance with the University Rules and regulations specified by each educational programme.

Summaries of all theses must be published and made freely available on the HSE website.

The full text of a thesis can be published in open access on the HSE website only if the authoring student (copyright holder) agrees, or, if the thesis was written by a team of students, if all the co-authors (copyright holders) agree. After a thesis is published on the HSE website, it obtains the status of an online publication.

Student theses are objects of copyright and their use is subject to limitations in accordance with the Russian Federation’s law on intellectual property.

In the event that a thesis is quoted or otherwise used, reference to the author’s name and the source of quotation is required.

Search all student theses