• A
  • A
  • A
  • ABC
  • ABC
  • ABC
  • А
  • А
  • А
  • А
  • А
Regular version of the site

Automatic Extraction of Information from Grammars

Student: Ermolaeva Natalia

Supervisor: Svetlana Toldova

Faculty: Faculty of Humanities

Educational Programme: Fundamental and Computational Linguistics (Bachelor)

Year of Graduation: 2019

This paper addresses a number of fundamental problems in the field of natural language processing (NLP). First of all, it is a language recognition task. The process of identifying a language is to establish the natural language in which the transmitted document is written. Initial studies on this problem were conducted in the 70s of the 20th century. However, attempts to resolve this issue were based exclusively on monollingual documents. However, the main materials of this work are grammar texts, which are a multilingual material, that is why we conduct a series of experiments using well-established methods for determining the language of the document (symbolic N-grams) in relation to materials consisting of more than one language. Moreover, during this study, we need to determine the parts of the existing textual material that we are interest in and what will be extracted form grammars. Based on the extracted information, we compiled the dictionaries of the source languages, as well as the corpuses with examples (including their glossed versions). Thus, as a result of this work, a computer tool was created to automatically extract relevant information from grammar texts by applying a set of methods, among which, first of all, the language definition of a multilingual (in this case, bilingual) document with subsequent extraction of units whose language is different from the metalanguage grammar, which allowed to solve the problem of automatically creating and filling the dictionary of a language. Moreover, by applying classification models to input materials, the problem of extracting composite examples contained in grammar texts was solved, as well as their separation into three components: a sentence in the original language, its glossed version, and a translation equivalent. Due to this classification, we have the opportunity to automatically create and fill parallel corpuses for various languages ​​of the world.

Student Theses at HSE must be completed in accordance with the University Rules and regulations specified by each educational programme.

Summaries of all theses must be published and made freely available on the HSE website.

The full text of a thesis can be published in open access on the HSE website only if the authoring student (copyright holder) agrees, or, if the thesis was written by a team of students, if all the co-authors (copyright holders) agree. After a thesis is published on the HSE website, it obtains the status of an online publication.

Student theses are objects of copyright and their use is subject to limitations in accordance with the Russian Federation’s law on intellectual property.

In the event that a thesis is quoted or otherwise used, reference to the author’s name and the source of quotation is required.

Search all student theses