Year of Graduation
Application of machine learning for classification of text document style
School of Applied Mathematics and Information Science
The present bachelor’s thesis (выпускная квалификационная работа) addresses the problem of application of machine learning to automatic text style classification.The functional styles are identified according to the basic language functions (communication, information, influence) and are connected to various fields of human activity.The following functional text styles are considered within the present study: scientific, business, journalistic, artistic and conversational. The k-Nearest Neighbors algorithm was chosen as a machine learning method programmed individually.A collection of various functional text styles was developed for experiments conduction. All the texts were collected from the Internet in order to approach the algorithm performance to real-life situations when the speech style of web pages needs to be defined. Every text of the collection was preliminarily processed by the morphologic text analysis module “mystem”. Morphologic text processing enables text transformation into a number of word forms with their grammar information.After the morphologic text analysis is performed every text were estimated according to specific functional characteristics. The following characteristics were considered: subjectivity, качественность, average number of letters in a word. All the parts of speech were considered and the functional text characteristics included the metrics on the frequency of occurrence of various parts of speech in the text. Fourteen speech styles characteristics were considered. Programs of morphological text analysis, evaluation of text characteristics and the machine learning method itself were implemented on the basis of C++ programming language.A set of experiments on functional style identification based on the implemented machine method demonstrated relatively low effectiveness of this approach. As a result of the study specific reasons of such low effectiveness are defined and the possible options of future improvement of the method are recommended.