HSE Doctoral Student Develops E-thesaurus for the Russian Language

Daniil Alexeevsky, doctoral student in Philology, presented the final part of his thesis on the development of a large electronic lexical database of the Russian language, similar to Princeton’s Wordnet.

Books similar to Princeton’s WordNet are widely used for solving various problems arising in automatic text processing, which involves determining the semantic similarity of words, as well as problems relating to automatic translation. Although these resources are in clear demand, today there is no open-access Russian language thesaurus that meets Princeton WordNet’s standards.

Daniil Alexeevsky has developed a chain of programmes that process dictionaries in order to encode relations between words as ‘super-subordinate’ (also called hyperonymy, hyponymy or ISA relation) – WordNet also works in a similar way. This correctly defines the word (general term) and interpretation (to an accuracy of 85%, or significantly higher than in similar published works), although disambiguation (interpretation of the general term), requires further improvements.

However, for some noun classes, disambiguation works successfully, for example, the terms for musical instruments and technical tools and devices are correctly extracted and divided in the dictionary.

Daniil plans to improve disambiguation using Word2Vec, and then analyze and compare the results of processing several dictionaries.