• A
  • A
  • A
  • ABC
  • ABC
  • ABC
  • А
  • А
  • А
  • А
  • А
Regular version of the site

Classifying the automatically generated VS user-generated poetry

Student: Shaymardanova Alina

Supervisor: Olga Lyashevskaya

Faculty: Faculty of Humanities

Educational Programme: Fundamental and Computational Linguistics (Bachelor)

Final Grade: 7

Year of Graduation: 2019

In the present work the task of selection of a number of criteria, according to which with the greatest probability it is possible to automatically classify the poem according to the type of the author – a computer or a person. As part of the study, a set of data was collected for training and testing the classifier based on 400,000 poems from the open portal stihi.ru and 7000 automatically-generated poetic texts by Generator Ilya Gusev, Yandex.Automata, and generator, Sergey Teterin "CYBER-PUSHKIN". The resulting corpus a manual markup of the author. In addition, for a better understanding of the device of neuropoiesis, the methods of developing models of generators were analyzed, and the process of the above algorithms for generating poetic text was considered. Further, the poems were analyzed in accordance with a number of features on the basis of which the classifier was trained. Features for training models were: the semantic density of the poem, the semantic density of adjectives and adverbs in the poem, the semantic coherence of quatrains in the poem, the allocation of the frequency word in the poem, as well as the analysis of the presence of alliteration, expressed in bi-and trigrams. Also, it should be noted that the distance between the semantic vectors of words will be determined by three different methods: the cosine of the angle between the vectors, the Euclidean distance, as well as the calculation of the scalar product scaled to the unit length of the vectors of words. Thus, as a result of the training of the classifier and the analysis of quality metrics, we show that the most important features of the poem, based on which it is most likely to automatically determine the author of the poem, are: the length of the poetic text in words, the maximum number of occurrences of the frequency word, bigrams with and without spaces, trigrams with spaces. In addition, the Euclidean distance between semantic vectors has become the best metric for determining semantic proximity in the framework of training the classifier. The best models are RandomForestClassifier and DecisionTreeClassifier, as they are well trained on linearly inseparable objects, and are not very sensitive to the presence of "noise" in the data, while the worst model for the same reasons was LogisticRegressionCV. The trained RandomForestClassifie model has also been tested separately on the creativity of each of the above generators to determine the quality of their neuropoiesis. The results of the study showed that the worst performance accuracy characterizes the best in terms of quality poems of Automata.Yandex. The resulting classifier was designed as a function that receives a poem at the entrance, and at the exit gives the user an answer who is the author of the poem – a person or a computer.

Full text (added June 4, 2019)

Student Theses at HSE must be completed in accordance with the University Rules and regulations specified by each educational programme.

Summaries of all theses must be published and made freely available on the HSE website.

The full text of a thesis can be published in open access on the HSE website only if the authoring student (copyright holder) agrees, or, if the thesis was written by a team of students, if all the co-authors (copyright holders) agree. After a thesis is published on the HSE website, it obtains the status of an online publication.

Student theses are objects of copyright and their use is subject to limitations in accordance with the Russian Federation’s law on intellectual property.

In the event that a thesis is quoted or otherwise used, reference to the author’s name and the source of quotation is required.

Search all student theses