• A
  • A
  • A
  • ABC
  • ABC
  • ABC
  • А
  • А
  • А
  • А
  • А
Regular version of the site

Measuring Gender Bias in Word Embeddings for Russian Language

Student: Pestova Alena

Supervisor: Kirill A. Maslinsky

Faculty: Saint-Petersburg School of Social Sciences

Educational Programme: Sociology and Social Informatics (Bachelor)

Year of Graduation: 2021

The problem of gender bias in Natural Language Processing (NLP) models has become a growing concern in the NLP community in recent years. Different types of NLP models are shown to demonstrate social biases in terms of gender, race, and religion which they inherit from training texts. Word embeddings (WE) as a very common framework in NLP were shown to reproduce various prejudices as well and gender bias, in particular. Besides, word embeddings are often used in social sciences for studying corpora, their authors or some social phenomena in general. For such studies, it is important to understand the nature and causes of the emergence of bias in the models, as well as the conditions under which bias is inherited by models from training texts. Existing research on gender bias in word embeddings often focus on English language models and there is no such research for models for Russian. There is also a gap in research on influence of model parameters on its bias. In this paper, an experimental approach was chosen, 36 models were built on 4 corpora with varying both model parameters (algorithm, window size) and corpus ones (corpus size, genre). GeoWAC (sample for Russian), Russian Wikipedia, Russian National Corpus and DetCorpus are taken for training word embeddings. Then, gender bias in word embeddings was analyzed with the Word Embedding Association Test method and the influence of corpus composition and model parameters on its gender bias is investigated. Results and conclusions made in this study can be useful for social scientists, as well as for researchers in the field of NLP.

Student Theses at HSE must be completed in accordance with the University Rules and regulations specified by each educational programme.

Summaries of all theses must be published and made freely available on the HSE website.

The full text of a thesis can be published in open access on the HSE website only if the authoring student (copyright holder) agrees, or, if the thesis was written by a team of students, if all the co-authors (copyright holders) agree. After a thesis is published on the HSE website, it obtains the status of an online publication.

Student theses are objects of copyright and their use is subject to limitations in accordance with the Russian Federation’s law on intellectual property.

In the event that a thesis is quoted or otherwise used, reference to the author’s name and the source of quotation is required.

Search all student theses