• A
  • A
  • A
  • ABC
  • ABC
  • ABC
  • А
  • А
  • А
  • А
  • А
Regular version of the site

Student
Title
Supervisor
Faculty
Educational Programme
Final Grade
Year of Graduation
Daria Mutovkina
Applying Hierarchical Text Clustering for Building Category Tree on a Classified Website
Business Informatics
(Master’s programme)
2016
The rapid growth in the number of user-generated online content has created the demand of efficiently classifying such data. Document clustering algorithms play an important role in providing intuitive navigation and browsing mechanisms by organizing large amounts of information into a small number of meaningful clusters. This paper proposes clustering technique intended to build category tree on a classified website Avito.

The paper presents a comprehensive study of document clustering algorithms that build hierarchical solutions and use different criterion functions and merging schemes and proposes the most appropriate method regarding data volume, dimensionality, and noisiness – density-based clustering OPTICS. Algorythm is implemented and optimized with creating statial indexes for data base.

In addition several dimensionality methods are introduced to solve the problem of data scalability: TF-IDF matrix prunning, feature extraction with latent-semantic analysis and feature selection with random forest. All techniques are implemented and tested on a text corpus of 15 000 texts. Comparison analysis of different outputs resulted in final clusters' list that have been proposed to business stakeholders.

Student Theses at HSE must be completed in accordance with the University Rules and regulations specified by each educational programme.

Summaries of all theses must be published and made freely available on the HSE website.

The full text of a thesis can be published in open access on the HSE website only if the authoring student (copyright holder) agrees, or, if the thesis was written by a team of students, if all the co-authors (copyright holders) agree. After a thesis is published on the HSE website, it obtains the status of an online publication.

Student theses are objects of copyright and their use is subject to limitations in accordance with the Russian Federation’s law on intellectual property.

In the event that a thesis is quoted or otherwise used, reference to the author’s name and the source of quotation is required.

Search all student theses