Modeling Genre and Topic Features of Text for Advertisement Detection

Student: Nikiforova Anastasiia

Supervisor: Alexey Malafeev

Faculty: Faculty of Humanities (Nizhny Novgorod)

Educational Programme: Fundamental and Applied Linguistics (Bachelor)

Year of Graduation: 2018

Advances in the fields of Natural Language Processing and Machine Learning are broadening the scope of what technology can do in our everyday life. As a result, there is an increasing interest in these fields supported by the availability of large collections of documents online. One of the tasks solved with the help of text corpora is a topic modeling, which is a method of building models of text documents that determine the correlations between topics and documents. Due to its versatility and extensibility, modern methods of topic modeling are used in a wide range of applications, such as resolving term polysemy, performing text classification and annotation and analyzing newsfeeds. In other words, topic modeling allows us to automatically systematize and summarize corpora that are too large for a manual processing. Nowadays, there is a number of different topic models. Widely known methods of probabilistic latent semantic analysis (PLSA) (Hofmann 1999) and latent Dirichlet allocation (LDA) (Blei et al. 2003) are universal and distinguish topics corresponding to broad subject areas. However, the topic model in the present study has to automatically identify distinctive terms of the predefined genres. Therefore, a regularized topic model is used, as it focuses on identifying topics that allow to divide documents into relevant and irrelevant classes (Rubin 2012). Despite there are many scientific publications on topic modeling, none of them apply topic models for identifying commercial texts. That is why the aim of the current study is to develop a genre classification algorithm based on topic modeling for automatic advertisement detection. The study solves several tasks, such as building a custom corpus of texts, manually classifying the test sample as a past of semi-supervised machine learning process, creating a topic model for future classification of out-of-corpus texts, evaluating a model and analyzing the results of the automatic Advertisement Detection.

Student Theses at HSE must be completed in accordance with the University Rules and regulations specified by each educational programme.

Summaries of all theses must be published and made freely available on the HSE website.

The full text of a thesis can be published in open access on the HSE website only if the authoring student (copyright holder) agrees, or, if the thesis was written by a team of students, if all the co-authors (copyright holders) agree. After a thesis is published on the HSE website, it obtains the status of an online publication.

Student theses are objects of copyright and their use is subject to limitations in accordance with the Russian Federation’s law on intellectual property.

In the event that a thesis is quoted or otherwise used, reference to the author’s name and the source of quotation is required.

Search all student theses