• A
  • A
  • A
  • ABC
  • ABC
  • ABC
  • А
  • А
  • А
  • А
  • А
Regular version of the site

Classification of Engineered Open Source Projects through Machine Learning

Student: Kozhevnikov Dmitriy

Supervisor: Dmitry Pantiukhin

Faculty: Faculty of Computer Science

Educational Programme: System and Software Engineering (Master)

Final Grade: 10

Year of Graduation: 2017

On-line software forges as GitHub, Bitbucket, GitLab, etc. represent an abundant resource of insightful information, which can be mined to foster understanding of code architecture, code longevity, life-cycle and evolution, programming language analysis, etc. However, a remarkable number of repositories harbors student assignments, small software experiments or simply serve the purpose of generic storage. Including such repositories into analysis may introduce considerable distortion to results of the study, hence inappropriate projects should be sifted away. In this work we propose an evaluation framework for software repositories to formally express various features that compose the notion of engineered software project. Our framework employs both meta-information acquired from the repository and data received from static analysis of source code contents. The automation of feature collection is implemented with a tool called Reporanger, which is able to perform extraction for a given repository without human intervention. We select and label a set of 300 repositories to train a number of classifiers based on machine learning algorithms. The best performing classifier employs Decision Forest model that demonstrates Area Under Curve = 0.958 and F-Score = 0.906, which is significantly better than the classifier based on conventional stargazer criteria (Area Under Curve = 0.821 and F-Score = 0.782). Our research does not only produce a software project classifier, but also identifies the most impactful features that are help best to highlight the differences between engineered and non-engineered projects.

Full text (added June 4, 2017)

Student Theses at HSE must be completed in accordance with the University Rules and regulations specified by each educational programme.

Summaries of all theses must be published and made freely available on the HSE website.

The full text of a thesis can be published in open access on the HSE website only if the authoring student (copyright holder) agrees, or, if the thesis was written by a team of students, if all the co-authors (copyright holders) agree. After a thesis is published on the HSE website, it obtains the status of an online publication.

Student theses are objects of copyright and their use is subject to limitations in accordance with the Russian Federation’s law on intellectual property.

In the event that a thesis is quoted or otherwise used, reference to the author’s name and the source of quotation is required.

Search all student theses