• A
  • A
  • A
  • АБВ
  • АБВ
  • АБВ
  • А
  • А
  • А
  • А
  • А
Обычная версия сайта
ФИО студента
Название работы
Руководитель
Факультет
Программа
Оценка
Год защиты
Глазунов Евгений Владимирович
Linguistic Features of Commit Messages Corpus
2019
Today developers often use version control systems (VCS). VCS help developers to track the changes in documents, program files and projects in general. Contributors provide a comment (message) for their set of changes (commit). Using these messages, all of the project participants may understand what has been changed. These specific messages are made in natural language and they have specific linguistic features.

The only research made on the data of commit messages were about sentiment analysis. Thus, this data is linguistically unexplored. In this situation a broad study considering different aspects and levels of language may help to define interesting research questions and directions since the texts represents a specific language layer.

The first aim is to define the most obvious features of commit messages and provide a theoretical interpretation of present phenomena from the perspective of context, pragmatics, information structure and discourse.

The second aim of research is exploring syntax of commit messages and their morphological features. A part of its features can be explained in theoretical section but some features need to be described more closely.

The last aim of research is description of the vocabulary. As a domain specific corpus, commit messages have their own vocabulary. It is important for a domain specific corpus to explore the size of the vocabulary, frequent words (that are different from general English texts), and collocations.

The source of commits is GitHub - one of leading software development platforms. The sample in this work contains approximately 80 thousand projects with 75 million commits containing 1 billion tokens. Twitter is a chosen reference corpus as an example of short texts and computer-mediated communication. Twitter corpus has 400 thousand messages (6 million tokens).

Instruments used in this work are Python 3, Universal Dependencies Parser, semantic vector models, MySQL database.

In order to achieve better understanding of linguistic features of commit messages in this research are used both classic and modern NLP methods and their combination with data analysis: automatic syntactic analysis, comparison with reference corpus, frequency dictionary, semantic vector models, clusterization of semantic vectors, calculating collocation metrics, sentiment analysis, time-series analysis and some other.

Commit messages describe changes so everyone can track them. They don’t have to be a well-formed text as long as communication is successful. Studies that explore features of short texts mainly describe Internet communication (e.g. Twitter), titles of newspaper titles etc. The general field that can be relevant for this genre is computer-mediated communication (CMC) since these messages are digital and developers write them via computer. Along with CMC, commit messages may be compared with newspaper articles because they have some common features.

This exploratory research creates an overview of linguistic features of commit messages from the perspective of different branches of language studies.

Commit messages follow the linguistic profile of computer-mediated communication and newspaper headlines. They demonstrate omission of subject, different abbreviation strategies and other kinds of language reduction. These changes may be explained by discursive factors and some general grammar processes. It also signifies the formation of genre.

In vocabulary there are common features with CMC as well as genre-specific. This work has an overview of semantic clusters, abbreviation strategies, collocations and some other features of this specific genre.

This study contributes to the description of domain specific corpora and short text analysis. In terms of methodology, this study provides a pipeline for making a research on a large domain specific corpora and short text analysis using automatic instruments of natural language processing.

Выпускные квалификационные работы (ВКР) в НИУ ВШЭ выполняют все студенты в соответствии с университетским Положением и Правилами, определенными каждой образовательной программой.

Аннотации всех ВКР в обязательном порядке публикуются в свободном доступе на корпоративном портале НИУ ВШЭ.

Полный текст ВКР размещается в свободном доступе на портале НИУ ВШЭ только при наличии согласия студента – автора (правообладателя) работы либо, в случае выполнения работы коллективом студентов, при наличии согласия всех соавторов (правообладателей) работы. ВКР после размещения на портале НИУ ВШЭ приобретает статус электронной публикации.

ВКР являются объектами авторских прав, на их использование распространяются ограничения, предусмотренные законодательством Российской Федерации об интеллектуальной собственности.

В случае использования ВКР, в том числе путем цитирования, указание имени автора и источника заимствования обязательно.

Расширенный поиск ВКР