• A
  • A
  • A
  • ABC
  • ABC
  • ABC
  • А
  • А
  • А
  • А
  • А
Regular version of the site

Topic Modelling

Student: chacko steve

Supervisor: Dmitry Ilvovsky

Faculty: Faculty of Computer Science

Educational Programme: Master of Data Science (Master)

Year of Graduation: 2024

Abstract The volume of data given by text is big and grows rapidly. We need effective ways of organizing and understanding this data. Topic modelling is a popular technique to find the underlying thematic structure of documents. It is used in Natural Language Processing and Machine Learning. In this paper we look at two popular topic modelling algorithms, Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA) on a data set over movie descriptions from Wikipedia. We will see how to use these two techniques and show the results we get. The paper begins by explaining LSA and LDA and show how they work together with preprocessing techniques and metrics to evaluate topic models. Then we study the two algorithms on the given data set, where we show the results that are the best number of topics for each model investigated given by the coherence score. Finaly we discus the findings and we conclude that for the Wikipedia data movie collection of documents the best result is given with a LDA model with 5 topics. Keywords: Topic Modelling, LSA, LDA, Natural Language Processing, Machine Learning

Student Theses at HSE must be completed in accordance with the University Rules and regulations specified by each educational programme.

Summaries of all theses must be published and made freely available on the HSE website.

The full text of a thesis can be published in open access on the HSE website only if the authoring student (copyright holder) agrees, or, if the thesis was written by a team of students, if all the co-authors (copyright holders) agree. After a thesis is published on the HSE website, it obtains the status of an online publication.

Student theses are objects of copyright and their use is subject to limitations in accordance with the Russian Federation’s law on intellectual property.

In the event that a thesis is quoted or otherwise used, reference to the author’s name and the source of quotation is required.

Search all student theses