Documents Сlustering by Type Using Inverted Index and Pairwise Classifier

Student: Darovskikh Igor

Educational Programme: Data Science (Master)

Year of Graduation: 2019

In this paper, we research a problem of documents clustering by type, and develop a realization in python. The problem has arose as an auxiliary task, while developing a product for the documents processing automation and information extraction. Within this task, each document is represented as a multi-page image and optical character recognition results, which is a list of words with coordinates on the document. Documents are to be processed sequentially and a decision must be taken on each of them - whether to classify it to some existing cluster or to create a new one. The clustering model must discriminate documents by structure that is by locations of some main objects, such as logo, emails, phones, addresses etc. The problem has been solved using a pairwise classifier - a machine learning model, trained to predict, whether documents in a given pair are from the same class or not. The paper contains a detailed description of the methods chosen to get features by documents pairs to make use of machine learning models. We also used an inverted index to decrease the number of classifications needed per document. The inverted index is a well-known data structure that allows getting the top-k most similar clusters for a new document, based on the weights of the shared words, the weights are calculated as IDF. Another valuable feature of the solution presented is that the time taken to make a decision about each document does not depend on the number of documents that have been processed before, but just on the number of clusters. While the most expensive part - classification - runs just a constant number of times. This allowed us to gain the speed of 200 documents per second in average for 1000 clusters. The result is arranged as a Python package. We managed to achieve quite good results for the pairwise classifier accuracy - 99.9% on average, as well as for the index - 99.5% for the top-5 predictions and finally 97% for the clustering model.

Student Theses at HSE must be completed in accordance with the University Rules and regulations specified by each educational programme.

Summaries of all theses must be published and made freely available on the HSE website.

The full text of a thesis can be published in open access on the HSE website only if the authoring student (copyright holder) agrees, or, if the thesis was written by a team of students, if all the co-authors (copyright holders) agree. After a thesis is published on the HSE website, it obtains the status of an online publication.

Student theses are objects of copyright and their use is subject to limitations in accordance with the Russian Federation’s law on intellectual property.

In the event that a thesis is quoted or otherwise used, reference to the author’s name and the source of quotation is required.

Search all student theses