Кожевников Дмитрий Денисович
Classification of Engineered Open Source Projects through Machine Learning
Системная и программная инженерия
On-line software forges as GitHub, Bitbucket, GitLab, etc. represent an abundant resource of insightful information, which can be mined to foster understanding of code architecture, code longevity, life-cycle and evolution, programming language analysis, etc. However, a remarkable number of repositories harbors student assignments, small software experiments or simply serve the purpose of generic storage. Including such repositories into analysis may introduce considerable distortion to results of the study, hence inappropriate projects should be sifted away. In this work we propose an evaluation framework for software repositories to formally express various features that compose the notion of engineered software project. Our framework employs both meta-information acquired from the repository and data received from static analysis of source code contents. The automation of feature collection is implemented with a tool called Reporanger, which is able to perform extraction for a given repository without human intervention. We select and label a set of 300 repositories to train a number of classiﬁers based on machine learning algorithms. The best performing classiﬁer employs Decision Forest model that demonstrates Area Under Curve = 0.958 and F-Score = 0.906, which is signiﬁcantly better than the classiﬁer based on conventional stargazer criteria (Area Under Curve = 0.821 and F-Score = 0.782). Our research does not only produce a software project classiﬁer, but also identiﬁes the most impactful features that are help best to highlight the differences between engineered and non-engineered projects.
Текст работы (работа добавлена 4 июня 2017г.)