• A
  • A
  • A
  • ABC
  • ABC
  • ABC
  • А
  • А
  • А
  • А
  • А
Regular version of the site
  • HSE University
  • Student Theses
  • Clustering and Segmentation of Documents in EDM (Electronic Document Management) Systems Using Machine Learning Technologies

Clustering and Segmentation of Documents in EDM (Electronic Document Management) Systems Using Machine Learning Technologies

Student: Terekhin Daniil

Supervisor: Timofey Shevgunov

Faculty: Graduate School of Business

Educational Programme: Business Informatics (Bachelor)

Year of Graduation: 2021

With the development of computer power, the massive use of this technology was only a matter of time. A large number of sectors, retail, banks, various industries and many related others are just beginning to benefit from the introduction of machine learning and artificial intelligence technologies. With such a large amount of information, it is easy to properly use such algorithms as segmentation, classification, clustering, forecasts and other actions that will entail the scaling of the company and even greater coverage audience. The more monotonous a person's work is, the easier it is to automate it using available technologies. Purpose of the work - Development of an algorithm for document segmentation by clustering each sentence using convolutional neural networks in the field of electronic document management. Research object - Segmentation of information documents by classifying each sentence. Subject of research - Various methods of classification and clustering in electronic document flow. The information base of the research is scientific works, as well as research by international experts in the field of artificial intelligence and machine learning, in particular. Among them I would like celebrate Patrick Joshi with his 2019 book Artificial Intelligence with Examples in Python and Chollet François Deep Learning in Python 2018. These books are the basis for describing most processes related to machine learning and neural networks. Project objectives • To fulfill the purpose of the work, you will need several mandatory points: • Generate a database of suggestions for training the model • Mark up sentences using lemmatization and tokenization • Train a convolutional neural network model • Preprocess documents before using the model • Generate output for segment definition

Student Theses at HSE must be completed in accordance with the University Rules and regulations specified by each educational programme.

Summaries of all theses must be published and made freely available on the HSE website.

The full text of a thesis can be published in open access on the HSE website only if the authoring student (copyright holder) agrees, or, if the thesis was written by a team of students, if all the co-authors (copyright holders) agree. After a thesis is published on the HSE website, it obtains the status of an online publication.

Student theses are objects of copyright and their use is subject to limitations in accordance with the Russian Federation’s law on intellectual property.

In the event that a thesis is quoted or otherwise used, reference to the author’s name and the source of quotation is required.

Search all student theses