Linguistic Features of Commit Messages Corpus

Student: Glazunov Evgenii

Educational Programme: Fundamental and Computational Linguistics (Bachelor)

Year of Graduation: 2019

Today developers often use version control systems (VCS). VCS help developers to track the changes in documents, program files and projects in general. Contributors provide a comment (message) for their set of changes (commit). Using these messages, all of the project participants may understand what has been changed. These specific messages are made in natural language and they have specific linguistic features. The only research made on the data of commit messages were about sentiment analysis. Thus, this data is linguistically unexplored. In this situation a broad study considering different aspects and levels of language may help to define interesting research questions and directions since the texts represents a specific language layer. The first aim is to define the most obvious features of commit messages and provide a theoretical interpretation of present phenomena from the perspective of context, pragmatics, information structure and discourse. The second aim of research is exploring syntax of commit messages and their morphological features. A part of its features can be explained in theoretical section but some features need to be described more closely. The last aim of research is description of the vocabulary. As a domain specific corpus, commit messages have their own vocabulary. It is important for a domain specific corpus to explore the size of the vocabulary, frequent words (that are different from general English texts), and collocations. The source of commits is GitHub - one of leading software development platforms. The sample in this work contains approximately 80 thousand projects with 75 million commits containing 1 billion tokens. Twitter is a chosen reference corpus as an example of short texts and computer-mediated communication. Twitter corpus has 400 thousand messages (6 million tokens). Instruments used in this work are Python 3, Universal Dependencies Parser, semantic vector models, MySQL database. In order to achieve better understanding of linguistic features of commit messages in this research are used both classic and modern NLP methods and their combination with data analysis: automatic syntactic analysis, comparison with reference corpus, frequency dictionary, semantic vector models, clusterization of semantic vectors, calculating collocation metrics, sentiment analysis, time-series analysis and some other. Commit messages describe changes so everyone can track them. They don’t have to be a well-formed text as long as communication is successful. Studies that explore features of short texts mainly describe Internet communication (e.g. Twitter), titles of newspaper titles etc. The general field that can be relevant for this genre is computer-mediated communication (CMC) since these messages are digital and developers write them via computer. Along with CMC, commit messages may be compared with newspaper articles because they have some common features. This exploratory research creates an overview of linguistic features of commit messages from the perspective of different branches of language studies. Commit messages follow the linguistic profile of computer-mediated communication and newspaper headlines. They demonstrate omission of subject, different abbreviation strategies and other kinds of language reduction. These changes may be explained by discursive factors and some general grammar processes. It also signifies the formation of genre. In vocabulary there are common features with CMC as well as genre-specific. This work has an overview of semantic clusters, abbreviation strategies, collocations and some other features of this specific genre. This study contributes to the description of domain specific corpora and short text analysis. In terms of methodology, this study provides a pipeline for making a research on a large domain specific corpora and short text analysis using automatic instruments of natural language processing.

Student Theses at HSE must be completed in accordance with the University Rules and regulations specified by each educational programme.

Summaries of all theses must be published and made freely available on the HSE website.

The full text of a thesis can be published in open access on the HSE website only if the authoring student (copyright holder) agrees, or, if the thesis was written by a team of students, if all the co-authors (copyright holders) agree. After a thesis is published on the HSE website, it obtains the status of an online publication.

Student theses are objects of copyright and their use is subject to limitations in accordance with the Russian Federation’s law on intellectual property.

In the event that a thesis is quoted or otherwise used, reference to the author’s name and the source of quotation is required.

Search all student theses