• A
  • A
  • A
  • ABC
  • ABC
  • ABC
  • А
  • А
  • А
  • А
  • А
Regular version of the site

Sentence Compressor for Russian

Student: Kuvshinova Tatiana

Supervisor: Dmitry Alexandrov

Faculty: Faculty of Computer Science

Educational Programme: System and Software Engineering (Master)

Final Grade: 9

Year of Graduation: 2020

Sentence compression is the task of removing redundant information from a sentence while preserving its original meaning. While widely studied for English, no researcher to the best of our knowledge has considered the task for Russian. In this paper, we approach deletion-based sentence compression for the Russian language. We use the data from the plagiarism detection corpus (ParaPlag) to create a corpus for sentence compression in Russian of almost 3000 pairs of sentences, each of them has at least 30 % of compressed words. We resolve paraphrases in the corpus and then align source sentences and their compressions using the Needleman-Wunsch algorithm and perform human-evaluation of the corpus by readability and informativeness. We consider sentence compression as a binary classification task on tokens. Then we use bidirectional LSTM to solve sentence-compression task for Russian, which is a typical baseline for the problem. We also experiment with RuBert and Bert-multilingual. For the latter, we use transfer-learning, firstly pretraining the model on English data, which improves performance. We perform automatic evaluation by classification metrics and evaluation with humans by readability and informativeness and do error analysis for the models. We are able to achieve f-measure of 74.8 %, readability of 3.88 and informativeness of 3.47 (out of 5) on test data. We also implement post-hoc syntax-based evaluator, which can detect some of the wrong compressions, increasing overall quality of the system. We provide the aligned data and baseline results for future studies. We also include lemmatized and pos-tagged aligned data so that it can be directly used with words embeddings with lemmatized or pos-tagged vocabulary.

Full text (added May 22, 2020)

Student Theses at HSE must be completed in accordance with the University Rules and regulations specified by each educational programme.

Summaries of all theses must be published and made freely available on the HSE website.

The full text of a thesis can be published in open access on the HSE website only if the authoring student (copyright holder) agrees, or, if the thesis was written by a team of students, if all the co-authors (copyright holders) agree. After a thesis is published on the HSE website, it obtains the status of an online publication.

Student theses are objects of copyright and their use is subject to limitations in accordance with the Russian Federation’s law on intellectual property.

In the event that a thesis is quoted or otherwise used, reference to the author’s name and the source of quotation is required.

Search all student theses