• A
  • A
  • A
  • ABC
  • ABC
  • ABC
  • А
  • А
  • А
  • А
  • А
Regular version of the site
National Research University Higher School of EconomicsStudent ThesesApplication of statistical approach in a problem of automatic identification of authorship of the text

Student
Title
Supervisor
Faculty
Educational Programme
Final Grade
Year of Graduation
Dmitrij Fyodorov
Application of statistical approach in a problem of automatic identification of authorship of the text
Bachelor’s programme
2014
The topic of the work is “Application of statistical approach in a problem of automatic identification of authorship of the text”.The work contains the analysis of several statistical text features such as average sentence length in words and symbols, punctuation type percent usage, part of speech percent usage and frequency of bigrams usage, and also the analysis of classifiers: Naïve Bayes, Support Vector Machines, Decision Tree and K Nearest Neighbors – in a problem of automatic Russian text attribution.The first chapter is devoted to the concept of author’s invariant and overview of classifiers, which are used in the work.The second chapter provides a description of a program written in C++ with the use of OpenCV and Qt libraries. A schematic diagram, GUI description and the class diagrams are also given.The third chapter describes experimental researches, done by using the developed program. Experimental research took place in five stages: the analysis of statistical features of the text, choosing parameters of classifiers, the analysis of classifiers, studying the dependence of the identification accuracy of training sample size and number of predefined classes, comparison of the results with the existing analogue.The analysis of statistical features showed that such parameters, as part of speech percent usage and frequency of bigrams usage reflect well the individual author's style and help to achieve high accuracy of identification. Calculating punctuation type percent usage and average sentence length help to increase identification accuracy.The best accuracy results among classifiers showed support vector machine method with regularization parameter C = 1000 and polynomial kernel and the naive Bayes classifier with average accuracy of 96.4% and 93.8% respectively. Slightly worse results showed K nearest neighbors method with average accuracy of 91%. Unsuitable for the problem of determining the author of the text was the decision tree method with average accuracy of 55.4%.

Student Theses at HSE must be completed in accordance with the University Rules and regulations specified by each educational programme.

Summaries of all theses must be published and made freely available on the HSE website.

The full text of a thesis can be published in open access on the HSE website only if the authoring student (copyright holder) agrees, or, if the thesis was written by a team of students, if all the co-authors (copyright holders) agree. After a thesis is published on the HSE website, it obtains the status of an online publication.

Student theses are objects of copyright and their use is subject to limitations in accordance with the Russian Federation’s law on intellectual property.

In the event that a thesis is quoted or otherwise used, reference to the author’s name and the source of quotation is required.

Search all student theses