• A
  • A
  • A
  • ABC
  • ABC
  • ABC
  • А
  • А
  • А
  • А
  • А
Regular version of the site
For visually-impairedUser profile (HSE staff only)Search

Additive Regularization on Topic Model for Learning Complete Topic Set from Text Document Collection

Student: Ilya Golubev

Supervisor: Konstantin V. Vorontsov

Faculty: Faculty of Computer Science

Educational Programme: Data Science (Master)

Year of Graduation: 2017

Probabilistic topic modeling is a method of a statistical analysis of a collection of text documents, based on EM-algorithm. It reveals the topic structure of a collection of documents as well as a discription of the semantics of each document. In fact, the problem of topic modeling can be seen as a problem of stochastic matrix factorization, which in general does not have the unique solution. It means, that different factors are obtained after factorizing the same matrix, which raises the question of the problem of the stability of topic models. Nevertheless, fitting many topic models and then filtering the resulted set of all found topics in such a way, that there will be a set of basic topics. Those are the topics, convex linear combinations of which allow us to express all other topics. The problem of completeness prompts the question of the possibility of finding the maximum number of topics from the basic set when fitting only one topic model. In this paper, we build the set of basic topics on a half-synthetic data. Then we conduct experiment, that shows there is a strategy for topic models regularization tuning, which leads to an increase in the completeness of the model.

Student Theses at HSE must be completed in accordance with the University Rules and regulations specified by each educational programme.

Summaries of all theses must be published and made freely available on the HSE website.

The full text of a thesis can be published in open access on the HSE website only if the authoring student (copyright holder) agrees, or, if the thesis was written by a team of students, if all the co-authors (copyright holders) agree. After a thesis is published on the HSE website, it obtains the status of an online publication.

Student theses are objects of copyright and their use is subject to limitations in accordance with the Russian Federation’s law on intellectual property.

In the event that a thesis is quoted or otherwise used, reference to the author’s name and the source of quotation is required.

Search all student theses