An Automatic System for Tagging of Scientific Articles

Student: Baranov Alexander

Faculty: HSE Tikhonov Moscow Institute of Electronics and Mathematics (MIEM HSE)

Educational Programme: Information Science and Computation Technology (Bachelor)

Final Grade: 10

Year of Graduation: 2020

An important step in the analysis of documents in a structured machine-readable format is the recognition of the layout of unstructured documents. The most widespread format for scientific articles is the PDF format. Therefore, an object of development of this final qualification work is an automatic system for tagging of scientific articles presented in PDF format. The aim of the work is to create a client-server system capable of extracting text and non-text information from scientific articles in PDF format and presenting it in docx format while preserving the original hierarchy of blocks in the publication. The system’s development language is Python. The basic principle of the system’s operation is as follows: The Mask R-CNN segmentation model selects information blocks from the image of a PDF page and classifies it (text, title, list, figure, table), blocks are sorted according to the original document hierarchy, then information extraction modules are applied for information blocks. Text is extracted using the pdftotext library and Tesseract OCR, depending on the presence of a text layer in the PDF document. Tables are extracted by the PDFPlumber library or remain as images. Formulas are retrieved using the ScanSSD model. After extracting the information, the final document in docx format and additional folders with the extracted non-text elements are created from which the zip archive is collected. The system is divided into client and server parts, implemented using the REST architectural style and the Flask library. The resulting parts are packaged in Docker containers. The final qualification work consists of 69 pages and contains 12 figures and 6 appendices. Source code can be found at github.com: https://github.com/owls-nlp

Full text (added May 28, 2020)

Student Theses at HSE must be completed in accordance with the University Rules and regulations specified by each educational programme.

Summaries of all theses must be published and made freely available on the HSE website.

The full text of a thesis can be published in open access on the HSE website only if the authoring student (copyright holder) agrees, or, if the thesis was written by a team of students, if all the co-authors (copyright holders) agree. After a thesis is published on the HSE website, it obtains the status of an online publication.

Student theses are objects of copyright and their use is subject to limitations in accordance with the Russian Federation’s law on intellectual property.

In the event that a thesis is quoted or otherwise used, reference to the author’s name and the source of quotation is required.

Search all student theses