Extraction of Polysemic Words from Connectivity Graph of a Text

Student: Okhapkina Anna

Educational Programme: Fundamental and Computational Linguistics (Bachelor)

Year of Graduation: 2020

In this paper, the concept and structure of the distribution model of the semantics of the language, as well as the advantages of distribution models over theoretical approaches to the allocation of the components of meaning, are studied. Automatic models are based solely on empiricism, able to split the meanings of words into a very large number of components, able to determine words that are more or less close in meaning and at the same time work quickly. As part of a review of existing approaches, we examined examples of specific distribution models, paying attention to the various approaches to resolving semantic disambiguation that are implemented in these models. We also examined SOTA models based on the neural architecture of transformers, and concluded that neural networks, showing high quality of work, require tremendous computing power, which is not always possible when working with a case. As part of our work, we decided to do the contrary - to build a very simple model that takes into account only statistical data on the co-occurrence of words in the corpus - and to explore the possibilities of such a model in solving the problem of semantic disambiguation. In this work, we implemented an algorithm in Python that allows us to build a distribution model based on statistics on the joint occurrence of words from a pre-processed custom corpus. We consider three methods of visual interpretation of information that is obtained from the output of the model: 1. scatter plot projections of word vectors on the plane using the t-SNE dimensional reduction algorithm; 2. a graph of distributive similarity between words, constructed by converting the distance matrix into a contingency table of nodes; 3. heatmap of pairwise cosine similarity between words.

Student Theses at HSE must be completed in accordance with the University Rules and regulations specified by each educational programme.

Summaries of all theses must be published and made freely available on the HSE website.

The full text of a thesis can be published in open access on the HSE website only if the authoring student (copyright holder) agrees, or, if the thesis was written by a team of students, if all the co-authors (copyright holders) agree. After a thesis is published on the HSE website, it obtains the status of an online publication.

Student theses are objects of copyright and their use is subject to limitations in accordance with the Russian Federation’s law on intellectual property.

In the event that a thesis is quoted or otherwise used, reference to the author’s name and the source of quotation is required.

Search all student theses