Character Level Models for Hashtag Segmentation

Student: Glushkova Taisiya

Supervisor: Ekaterina Artemova

Educational Programme: Applied Mathematics and Information Science (Bachelor)

Year of Graduation: 2018

The main purpose of this work is to investigate the problem of hashtag segmentation in Russian. We treat a hashtag as a sequence of Cyrillic letters, which form a few words with omitted white spaces. Although the majority of existing methods use unsupervised language models, we treat this problem in a supervised way by using a neural-based character level models. I provide a brief overview of different approaches which were used recently by several research groups and propose an application of LSTM recurrent neural networks to this problem. I assume that RNNs as a baseline NLP technique are very suitable for solving a problem of segmentation of multiple words joined together. This technique is being tested on real-life data. The data was collected from an open source – social media VK. Usually, a post in this social network can be divided into the text and the corresponding hashtags. As a result of this division, a collection of texts in Russian and a list of hashtags were created. The training data is a synthetic data generated from the collected corpus of texts. And the test data is a collection of widely used real hashtags. The dictionary approach, described in the chapter "Natural Language Corpus Data" by Peter Norvig from the book "Beautiful Data", was taken as a baseline method in the problem of hashtag segmentation. In particular, the probabilistic language model, which makes the decision whether to split the hashtag or not, based on the frequency of unigrams and bigrams. The experiments depict that the character level model outperformes the dictionary method. Despite the experiments with the architecture of the model, I have considered the active training approach in order to achieve a greater interpretability of the model. The main stages of work: literature review, collection of texts, methods, experiments and results, software implementation and conclusion.

Student Theses at HSE must be completed in accordance with the University Rules and regulations specified by each educational programme.

Summaries of all theses must be published and made freely available on the HSE website.

The full text of a thesis can be published in open access on the HSE website only if the authoring student (copyright holder) agrees, or, if the thesis was written by a team of students, if all the co-authors (copyright holders) agree. After a thesis is published on the HSE website, it obtains the status of an online publication.

Student theses are objects of copyright and their use is subject to limitations in accordance with the Russian Federation’s law on intellectual property.

In the event that a thesis is quoted or otherwise used, reference to the author’s name and the source of quotation is required.

Search all student theses