• A
  • A
  • A
  • ABC
  • ABC
  • ABC
  • А
  • А
  • А
  • А
  • А
Regular version of the site
National Research University Higher School of EconomicsStudent ThesesDistributed System for Document Storage and Clustering

Student
Title
Supervisor
Faculty
Educational Programme
Final Grade
Year of Graduation
Dmitrij Frolov
Distributed System for Document Storage and Clustering
School of Software Engineering
Master’s programme
7
2014
Graduate work "Distributed System for Document Storage and Clustering" has a volume of 43 pages, contains 5 figures and 4 tables. It contains 47 items in a list of sources. The structure of the work includes: introduction, review of sources and similar decisions, 2 main chapters, the chapter about experimental testing of the developed system and the conclusion. Graduate work is devoted to design and experimental testing of the distributed system for processing of collections of textual documents. In the work, the methods of clustering and document retrieval based on PLSI and LDA models are implemented. These methods allow to make special clustering of documents in collections, extract themes from collections and used for document retrieval in the work. Also new method for document retrieval based on abstract suffix trees (ASD) was designed and implemented. It includes special preliminary processing of document collections, based on selection of particular features for constructing of feature index. All implemented methods were included in the distributed system for document collections processing. An experimental comparison of the implemented methods was performed using "Ozon" web store collections of goods, which are available for download. Experimental results have shown author's modifications can improve document retrieval quality, especially in the case of incomplete search queries or queries which contains errors. Distributed processing of collections can reduce the time spent collections processing. Preliminary processing of document collection reduces the time spent query performing and allows to use Abstract Suffix Trees (ASD) as a powerful mathod for document retrieval. Keywords: Latent Dirichlet Allocation (LDA), Probabilistic Latent Semantic Analysis (PLSI), Abstract Suffix Trees (ASD), Text Documents Clustering, Document Retrieval, Distributed Text Processing Systems

Student Theses at HSE must be completed in accordance with the University Rules and regulations specified by each educational programme.

Summaries of all theses must be published and made freely available on the HSE website.

The full text of a thesis can be published in open access on the HSE website only if the authoring student (copyright holder) agrees, or, if the thesis was written by a team of students, if all the co-authors (copyright holders) agree. After a thesis is published on the HSE website, it obtains the status of an online publication.

Student theses are objects of copyright and their use is subject to limitations in accordance with the Russian Federation’s law on intellectual property.

In the event that a thesis is quoted or otherwise used, reference to the author’s name and the source of quotation is required.

Search all student theses