• A
  • A
  • A
  • ABC
  • ABC
  • ABC
  • А
  • А
  • А
  • А
  • А
Regular version of the site

Student
Title
Supervisor
Faculty
Educational Programme
Final Grade
Year of Graduation
Artem Imaev
Application of machine learning for classification of text document style
School of Applied Mathematics and Information Science
Bachelor’s programme
2014
The present bachelor’s thesis (выпускная квалификационная работа) addresses the problem of application of machine learning to automatic text style classification.The functional styles are identified according to the basic language functions (communication, information, influence) and are connected to various fields of human activity.The following functional text styles are considered within the present study: scientific, business, journalistic, artistic and conversational. The k-Nearest Neighbors algorithm was chosen as a machine learning method programmed individually.A collection of various functional text styles was developed for experiments conduction. All the texts were collected from the Internet in order to approach the algorithm performance to real-life situations when the speech style of web pages needs to be defined. Every text of the collection was preliminarily processed by the morphologic text analysis module “mystem”. Morphologic text processing enables text transformation into a number of word forms with their grammar information.After the morphologic text analysis is performed every text were estimated according to specific functional characteristics. The following characteristics were considered: subjectivity, качественность, average number of letters in a word. All the parts of speech were considered and the functional text characteristics included the metrics on the frequency of occurrence of various parts of speech in the text. Fourteen speech styles characteristics were considered. Programs of morphological text analysis, evaluation of text characteristics and the machine learning method itself were implemented on the basis of C++ programming language.A set of experiments on functional style identification based on the implemented machine method demonstrated relatively low effectiveness of this approach. As a result of the study specific reasons of such low effectiveness are defined and the possible options of future improvement of the method are recommended.

Student Theses at HSE must be completed in accordance with the University Rules and regulations specified by each educational programme.

Summaries of all theses must be published and made freely available on the HSE website.

The full text of a thesis can be published in open access on the HSE website only if the authoring student (copyright holder) agrees, or, if the thesis was written by a team of students, if all the co-authors (copyright holders) agree. After a thesis is published on the HSE website, it obtains the status of an online publication.

Student theses are objects of copyright and their use is subject to limitations in accordance with the Russian Federation’s law on intellectual property.

In the event that a thesis is quoted or otherwise used, reference to the author’s name and the source of quotation is required.

Search all student theses