• A
  • A
  • A
  • ABC
  • ABC
  • ABC
  • А
  • А
  • А
  • А
  • А
Regular version of the site

Usage of Embedding-based Topic Modeling for Feature Extraction in Software Engineering Tasks

Student: Bogomolov Egor

Supervisor: Timofey Bryksin

Faculty: St. Petersburg School of Physics, Mathematics, and Computer Science

Educational Programme: Software Development and Data Analysis (Master)

Year of Graduation: 2021

The practical application of machine learning models for tasks in the Software Engineering domain requires high models’ accuracy. However, according to the existing work, even high accuracy does not guarantee practical usefulness when the predictions are shown to real developers. One possible way to solve this problem is the usage of interpretable models. A common approach to extracting information in interpretable form from large corpora of textual data is topic modeling. Previous work on topic modeling in the Software Engineering domain employed topic modeling algorithms directly transferred from Natural Language Processing field. This work introduces Code2Topic, an algorithm for topic modeling of source code that takes into account characteristics typical to code such as large vocabulary size. Code2Topic can transform any piece of code into a distribution of topics in it. This information can be further used in practical tasks. Based on the Code2Topic algorithm, this work presents Sosed, a tool for finding similar repositories among a database of 9 million open-source projects. Sosed transforms projects into topic distributions and searches for similar distributions treating them as vectors. Manual labeling of Sosed’s predictions for a dataset of 94 popular GitHub projects proved that its predictions are relevant: average relevance among top-5 predictions is 4.2 out of 5. Another contribution of this work is Dev2Topic, an adaptation of the Code2Topic algorithm to work with the code of individual developers. It allows representing developers’ expertise as a distribution of topics with which they have worked. Such a form of representing expertise turned out to be useful in the task of developer recommend. On a dataset of 9700 issues mined from the YouTrack issue-tracker, the accuracy of predictions was improved from 68% to 75%. Usage of interpretable features allowed to keep the models’ predictions interpretable in both developer recommendation and similar projects suggestion tasks.

Student Theses at HSE must be completed in accordance with the University Rules and regulations specified by each educational programme.

Summaries of all theses must be published and made freely available on the HSE website.

The full text of a thesis can be published in open access on the HSE website only if the authoring student (copyright holder) agrees, or, if the thesis was written by a team of students, if all the co-authors (copyright holders) agree. After a thesis is published on the HSE website, it obtains the status of an online publication.

Student theses are objects of copyright and their use is subject to limitations in accordance with the Russian Federation’s law on intellectual property.

In the event that a thesis is quoted or otherwise used, reference to the author’s name and the source of quotation is required.

Search all student theses