• A
  • A
  • A
  • ABC
  • ABC
  • ABC
  • А
  • А
  • А
  • А
  • А
Regular version of the site

Service for Recognition Author’s Native Language Based on English Essay

Student: Remneva Vitaliya

Supervisor: Dmitry Alexandrov

Faculty: Faculty of Computer Science

Educational Programme: System and Software Engineering (Master)

Year of Graduation: 2018

In this master's thesis the problem of recognizing the author’s native language based on English essays is considered for which a model based on machine learning algorithms is constructed. As the data for learning the model, TOEFL 11: A Corpus of Non-Native English which was created specifically for the task of recognizing the author's native language is used. The goal of the work is to construct a model, the accuracy of which will surpass the previously achieved results, for which the support vector method is used, which has proved as the best method for solving the problem. The work also considers a promising approach to the problem of recognizing the native language previously not used - convolutional neural networks. The attention is also paid to the presentation of the text data in vector format, for which several different perspective approaches are used: TF / IDF metric, Word2Vec and vocabulary construction. Within the framework of this work, a series of experiments was conducted using various vectorization methods and preliminary processing of the training sample for support vector machine and convolutional neural networks. The support vector method, which uses the TF / IDF metric, unigram and bigram, also with a specific configuration of processing data for learning and parameters, shows a maximum accuracy of 84.16%, exceeding the results of other authors who also used TOEFL 11 corpus. Convolutional neural networks are usually used to process and classify images, but due to the Y. Kim approach used in this work, their applications have become available for working with text data. The approach is showed sufficient results: the maximum achieved accuracy of the model using this approach is 75.15%. Convolutional neural networks demonstrate their applicability to the solving the problem of recognizing author’s native language based on a text and the prospects for further research.

Student Theses at HSE must be completed in accordance with the University Rules and regulations specified by each educational programme.

Summaries of all theses must be published and made freely available on the HSE website.

The full text of a thesis can be published in open access on the HSE website only if the authoring student (copyright holder) agrees, or, if the thesis was written by a team of students, if all the co-authors (copyright holders) agree. After a thesis is published on the HSE website, it obtains the status of an online publication.

Student theses are objects of copyright and their use is subject to limitations in accordance with the Russian Federation’s law on intellectual property.

In the event that a thesis is quoted or otherwise used, reference to the author’s name and the source of quotation is required.

Search all student theses