Year of Graduation
Application of statistical approach in a problem of automatic identification of authorship of the text
The topic of the work is “Application of statistical approach in a problem of automatic identification of authorship of the text”.The work contains the analysis of several statistical text features such as average sentence length in words and symbols, punctuation type percent usage, part of speech percent usage and frequency of bigrams usage, and also the analysis of classifiers: Naïve Bayes, Support Vector Machines, Decision Tree and K Nearest Neighbors – in a problem of automatic Russian text attribution.The first chapter is devoted to the concept of author’s invariant and overview of classifiers, which are used in the work.The second chapter provides a description of a program written in C++ with the use of OpenCV and Qt libraries. A schematic diagram, GUI description and the class diagrams are also given.The third chapter describes experimental researches, done by using the developed program. Experimental research took place in five stages: the analysis of statistical features of the text, choosing parameters of classifiers, the analysis of classifiers, studying the dependence of the identification accuracy of training sample size and number of predefined classes, comparison of the results with the existing analogue.The analysis of statistical features showed that such parameters, as part of speech percent usage and frequency of bigrams usage reflect well the individual author's style and help to achieve high accuracy of identification. Calculating punctuation type percent usage and average sentence length help to increase identification accuracy.The best accuracy results among classifiers showed support vector machine method with regularization parameter C = 1000 and polynomial kernel and the naive Bayes classifier with average accuracy of 96.4% and 93.8% respectively. Slightly worse results showed K nearest neighbors method with average accuracy of 91%. Unsuitable for the problem of determining the author of the text was the decision tree method with average accuracy of 55.4%.