• A
  • A
  • A
  • ABC
  • ABC
  • ABC
  • А
  • А
  • А
  • А
  • А
Regular version of the site

Multi-Label Classification of Scientific Papers: Detecting Field of Study from Abstract with Machine Learning

Student: Wolf Elena

Supervisor: Evgeny Sokolov

Faculty: Faculty of Humanities

Educational Programme: Fundamental and Computational Linguistics (Bachelor)

Year of Graduation: 2021

Nowadays, publishing scientific papers online instead of printed journals is becoming more popular. Assigning topic labels to papers would improve searching options. Due to a considerable amount of papers, which are already existing on the Internet and which are appearing every day, it’s nearly impossible to mark them up by humans. Thus, the automatic classification tool is needed. Developing a label assigning algorithm is also of interest from the research point of view as it means solving the problem of utterly unbalanced data classification. This study covers ~2 millions texts, more than 200 thousands of which belong to the largest class and only ~4 thousands belong to the smallest one. In contrast to keywords extraction task, the set of possible labels was limited. In the current paper, several methods of text classification are considered. One item could have an arbitrary number of labels. There are 200 labels and thousands of their combinations. We decided to perform a binary classification for each label and combine the results to get a set of labels. The best strategy turned out to be TF-IDF vectorization + Support Vector Machine classifier. This strategy outperformed even the BERT-like transformer model (the possible explanation of these results are provided). It provides the following metrics: accuracy — 0.99, recall — 0.61, precision — 0.62. (Average values for all the labels) Considering the number of classes, the fact that they are utterly imbalanced, and markup errors, we believe that we obtained satisfactory results. Pretrained TF-IDF vectorizer and SVM models were used to develop a web-service which provides an opportunity to test our labeling algorithm.

Student Theses at HSE must be completed in accordance with the University Rules and regulations specified by each educational programme.

Summaries of all theses must be published and made freely available on the HSE website.

The full text of a thesis can be published in open access on the HSE website only if the authoring student (copyright holder) agrees, or, if the thesis was written by a team of students, if all the co-authors (copyright holders) agree. After a thesis is published on the HSE website, it obtains the status of an online publication.

Student theses are objects of copyright and their use is subject to limitations in accordance with the Russian Federation’s law on intellectual property.

In the event that a thesis is quoted or otherwise used, reference to the author’s name and the source of quotation is required.

Search all student theses