Comparative Analysis of Search Query Clustering Algorithms Based on Side Information

Student: Chuev Ivan

Educational Programme: Software Engineering (Bachelor)

Year of Graduation: 2020

In Data Analytics, people often have to deal with a huge amount of data. Since the analysis of such a number of objects is time-consuming, clustering is often used to form groups of objects and further analyze the resulting clusters as a whole. It has applications in many fields of science, including biology, medicine, the world wide web (WWW), social and computer science. The most common application of clustering is to determine internal relationships, dependencies, similarities, or differences between objects. Clustering is often used in data analysis to study the distribution of object characteristics. To study the characteristics of objects, the analyst forms them into groups using various clustering methods, then studies the dependencies of the observed characteristics relative to the obtained groups. Currently, there are a number of popular clustering methods that are widely used for data analysis, such as K-Means, Mean-Shift, DBSCAN, and others. However, it is not uncommon for these clustering methods to fail: the observed characteristics of objects in clusters are distributed randomly, and it is very difficult to draw any conclusions about the characteristics of objects by examining the resulting groups. This research considers algorithms that use the observed characteristic as additional information in the clustering algorithm so that the resulting clusters describe this characteristic in as much detail as possible. Where the observed characteristic of objects belonging to the same cluster has similar values, which makes it easier to identify the dependencies of the observed characteristic on groups of objects. Keywords: Data clustering, Semi-supervised learning, Pairwise constraints, Clustering with side information

Student Theses at HSE must be completed in accordance with the University Rules and regulations specified by each educational programme.

Summaries of all theses must be published and made freely available on the HSE website.

The full text of a thesis can be published in open access on the HSE website only if the authoring student (copyright holder) agrees, or, if the thesis was written by a team of students, if all the co-authors (copyright holders) agree. After a thesis is published on the HSE website, it obtains the status of an online publication.

Student theses are objects of copyright and their use is subject to limitations in accordance with the Russian Federation’s law on intellectual property.

In the event that a thesis is quoted or otherwise used, reference to the author’s name and the source of quotation is required.

Search all student theses