• A
  • A
  • A
  • ABC
  • ABC
  • ABC
  • А
  • А
  • А
  • А
  • А
Regular version of the site

Representation of Minor Languages of Russia on the Internet: Quantitative Description and Data Analysis

Student: Krylova Irina

Supervisor: Boris Orekhov

Faculty: Faculty of Humanities

Educational Programme: Computational Linguistics (Master)

Final Grade: 10

Year of Graduation: 2016

The study is on the representation of minority languages of Russia on the Internet. The aim of the study is to create a quantitative description of the minor languages Internet, to determine the characteristics of national Internets and to find the parameters, which can be used to predict, whether a minor language presents on the Internet and to what extent. The study used different external language data: the number of native speakers, data on the titular region of language distribution etc, as well as the Internet collections data: the number of sites in the minor language, the number of web pages, the tokens, the median value of tokens per web page. To find relationships and dependencies in data correlation analysis, multivariate linear regression and cluster analysis was used. To determine the characteristics of the national Internet the web graph was built. The result was a quantitative description of national Internets for more than 40 languages, the discussion also included sites on minor languages and their registration data, as well as the ratio of the median values of the tokens on a web page and the total number of web pages. The languages with the biggest number of sites on a minor language (Bashkir, Tatar, Yakut, Udmurt), unfortunately, did not participate in further analysis. A linear regression model was built based on offline and online data in the minor language. The model can be used to roughly predict whether a minor language out of the sample will present on the Internet. Hierarchical clustering showed that all the languages in the sample can be divided into two clusters by representation in the online and offline environment. Most national Internets are disassortative weakly connected graphs, however, the general graph of national internets is assortative and not weakly connected.

Full text (added June 7, 2016)

Student Theses at HSE must be completed in accordance with the University Rules and regulations specified by each educational programme.

Summaries of all theses must be published and made freely available on the HSE website.

The full text of a thesis can be published in open access on the HSE website only if the authoring student (copyright holder) agrees, or, if the thesis was written by a team of students, if all the co-authors (copyright holders) agree. After a thesis is published on the HSE website, it obtains the status of an online publication.

Student theses are objects of copyright and their use is subject to limitations in accordance with the Russian Federation’s law on intellectual property.

In the event that a thesis is quoted or otherwise used, reference to the author’s name and the source of quotation is required.

Search all student theses