Growth areas in a language structure: researches in corpora and corpus-based simulations

Priority areas of development: humanitarian
The project has been carried out as part of the HSE Program of Fundamental Studies.

The main Project goals

Growth areas is the name for the zones of systemic speech deviations that reflect speech variability and as such demonstrate language development tendencies. The researches are based on regarding non-accidental violations of recommended norms as the ground for emerging new speaking trends. The following electronic resources have been expanded and upgraded within the Project: non-standard Russian and English speech production corpora, namely,  corpora of dialectal variation, corpora of Russian heritage speakers from Russian families living in English-speaking environment, corpora of students’ academic Russian and English. These resources pose high pedagogic and research potential as they pave the way to the analysis and prevention of language errors underlying speech reflective observations of both instructors and learners, as well as revealing typological parallels between Russian and other languages.

Researches in 2017 aimed at consistent description of lexical and grammatical variability in modern Russian speech production and in the speech of Russian learners of English, with the focus on deviation areas, namely, erroneous uses of words and constructions, gaps between the functioning system and the norm as signifying the developmental potential.  A special emphasis was given, first, to the functions of comparative constructions in the languages under observation; second, to variability of light verbs as a result of semantic erosion with the potential grammaticalization; third, to the coordination of elements within the grammar structure of a sentence; fourth, to changes in labile verbs; fifth, to negation constructions; sixth, to redundant verbalisation of existence presupposition with the help of participles (uses of participles with the semantics of existence or attitudes);  and, seventh, to elements of redundancy in academic Russian.


Source texts for the corpora are created in a guided discussion (cf. Transcripts of field research of regional varieties of Russian), speech production typed in by the author as learning tasks performed in class or independently at home by speakers of other languages, heritage speakers of the language, students of academic variety of Russian and English; similar tasks which have been typed in by the Project participants from a hand-written production.

Corpora present the necessary tools for fast processing of large data collections, for spotting diachronic changes in lexical and grammar variability, for distributing samples by their types and attribution features, for specifying the researched lexical and structural contexts.

The Project realised in 2017 includes:

  • upgrading non-standard speech production linguistic patterns;
  • description of the standard shift in the nominal and verbal areas;
  • reconstruction and simulation of objectively existing patterns of deviations from the norm regulating speech production;
  • expansion of corpora collections, improvements in annotation practices, and fine-grading search opportunities;
  • testing and expanding training exercises and tests;
  • creation of codes for automated annotation of errors and deviations researched in the Project;
  • defining the principles for teaching new competencies and new pieces of linguistic description that would cover the researched shift in the recommended norms;
  •  compiling the features of the heritage speech of non-balanced bilingual speakers;
  • compiling the features of the Russian speech produced by speakers of other languages;
  • compiling the features of the Russian speech produced by speakers of regional varieties of Russian;
  • compiling the features of the academic Russian speech produced by learners of this new register;
  • updating the list of features of the Russian speech produced in communications on the Internet.

It is essential that within the Project an error is regarded as a valuable linguistic feature revealing the ever-changing tendencies in language use.

Russian and English learner corpora.

Learners, who are 1-st- and 2-nd-year Bachelor students, write texts in Russian/English as tasks to be submitted in their courses; detailed information about the author of each text together with relevant sociolinguistic parameters make up the text’s metadata; the texts then undergo morphological analysis with the help of MYSTEM,a code that identifies morphological categories of the word and its form; then are annotated in the code Les Crocodiles 2.0 before being added to the corpus.

Heritage corpus

Texts in the corpus come from Russian as a Foreign Language instructors and represent a variety of genres from a short open-ended answer to the question to miniessay (one paragraph long) on a certain topic, to an argumentative essay covering a broader problem to even an analytic report, comparison, or commentary. Level of proficiency is identified. Search in the corpus can be carried out to get a sub-corpus by any feature form the metadata (texts of one author; texts of speakers of a certain language; texts of a certain genre, etc.) Expert error annotation is performed according to the typological categorisation developed for CoRST.

Russian regional varieties corpus

The corpus is expanded with the field research materials, both auditory and visual, collected by the Project participants working in different regions of Russia. In 2017, the field trip under the auspices of the Project under the supervision of Yulia Kuvshinskaya took place in remote villages of the Tver region. The recordings were transcribed, and Bachelor students are currently using them as materials for their academic theses.

Achievements across corpora

1. Russian Error-Annotated Learner English Corpus http://realec.org/  (REALEC)

Russian Error-Annotated Learner EnglishCorpus of Research University Higher School of Economics Bachelor students’ papers is the first Russian educational corpus in public domain. It was developed by Research University Higher School of Economics, School of Linguistics. The core of the Corpus is exam type essays, of which the Corpus accumulated several  thousands.

The first phase of REALEC-based researches conducted in the School of Linguistics of the Research University Higher School of Economics (2012-2016) demonstrated its ample opportunities for English students and instructors, as well as for linguistic observations. At the same time, it became obvious that the tools used to work with the Corpus need improvement. To improve these tools, it is necessary to perform a detailed analysis of certain classes of errors, to reveal which wrong selections of English words or word colocations are the most frequent errors.

The initial research into errors of that type in the texts of Russian-speaking English learners demonstrated that there are three different categories of the wrong word selection: lexical errors, errors of discourse nature, and morpho-syntactical errors. Correspondingly, during the error analysis the participants of the Project develop recommendations and set up training systems for students to facilitate the recognition and prevention of such errors, and correspondingly, the Corpus incorporates means of easy visualization of lexical components of essays, and develop user specific interfaces – for example, annotator interface that is based on an innovative recognition algorithm for such errors.

The division of the essays in the corpus has been redesigned towards better correspondence to metadata in every essay.

There have been 2481 new texts and 16566 annotations introduced in REALEC in 2017.

In 2017 the addition to the Corpus of examination essays written by HSE students in 2015 and 2016 was completed, the essays were annotated, and work on adding  examination essays of 2017 has started. Verifying and correcting annotation practices, and developing new program-generated tests are going on. In 2017, two tests (one with 50 questions, the other with 60 questions) for verb forms were tested on large groups of students of two departments of HSE, and the development of a third test with 600 questions from various grammatical and lexical areas is being completed; the work is underway to develop new more comfortable user interfaces presenting more information to different users of REALEC.  New Student Research Group was formed at the beginning of 2017, and all new members are assigned their own tasks.

2. Russian Learner Corpus  http://www.web-corpora.net/RLC/  (RLC)

The Russian Learner Corpus contains samples of spoken and written language of two categories of non-standard Russian speakers: Russian students and so-called heritage speakers. For the first category, the Russian is not native, while the representatives of the second category have started to learn it as the first language during their childhood, however, for their main communication they use another language for different reasons (mainly, due to emigration). The Corpus offers tools for searching by lexical-grammatical properties of a word or word collocation and also by various types of deviations from Russian speaker standards – starting with orthographic mistakes and to the selection of lexical units and grammar structures. The linguistic analysis and tagging is performed by members of the Project Work Group.

The resource foreign partner membership has expanded:

Maria Polinsky (Harvard University)
Olesja Kiseleff (Penn State University) 
Evgenij Dengodub (Middlebury Language School языковая школа Миддлбери)
Irina Dubinina (Brandeis University) 
Anna Mikhailova (Oregon State  University) 
Alla Smyslova (Columbia University)
Ekaterina Protasova (Helsinki University)
Anna Pavlova (Johannes Gutenberg University of Mainz)
Anna Møhl (Zurich University)
Anka Bermann (Humboldt University of Berlin университет им. Гумбольдта)
Irina Kor-Shain Кор-Шаин (Aix Marcelle  University)
Suhen Li (Seoul National University)
Svetlana Slavkova (University of Bologna)
Francesca Biangini Биаджини (University of Bologna)
Monica Perotto (University of Bologna) 
Svetlana Sokolova (University of Tromsø)
Natalia Ringblom Рингблом (Stockholm University)
Hajashida Rie (Osaka University)
Cuneto Sugo (Osaka University)
Margarita Kazakevich (Osaka University)
Naziya Zhanpeisova (S.Baishev Aktyubinsk University) 
Ekaterina Protasova (University of Helsinki) 
Alexander Krasovitskiy (Oxford University)
Rashida Kasymova (al Farabi Kazakh National University)
Aimgul Kazkenova (Abai Kazakh National Pedagogical University)

Currently, the Corpus contains texts created by non-standard speakers with such dominant languages as American English, German (including Switzerland variant of German), French, Italian, Serbian, Japanese, Korean, Kazakh, Finnish, Norwegian,  Swedish, Dutch.

Year 2017data:

6,067  texts

1,295,278  words

104,823  sentences

46,540  annotations

4. Corpus of Russian Student Texts (CoRST) http://web-corpora.net/learner_corpus/ (CoRST)

The Corpus of Russian Student Texts(CoRST) – is a collection of texts in Russian written by students of various higher education schools. The total volume of the Corpus is about 3.1 million words. Texts are provided with several types of tags (meta-text tags, morphological tags, and error tags) enabling various Corpus searches.

The Corpus of Russian Student Textsisan information and reference system designed for researches, instructors, students, and also for everyone who is interested in problems of modern Russian grammar, current processes in the fields of lexis, morphology, and syntaxes of the modern Russian language.

Learner texts has been written by baccalaurean and magister level students of various higher education schools. The core types of the Corpus texts are term, diploma, and graduation papers, essays, annotations, summaries, conspectus, auto-biographies, paragraphs (short texts of various origin: home tasks, answers to questions, etc.)

The Corpus has information about the academic year, term, and module a text was written in and to which knowledge field it is related. The knowledge field may not be the same as a student’s major. For example, for a linguistics student writing a history essay we indicate his major - (linguistics) and the subject, in which the essay was written (history).

The Corpus has texts of students of the following specialties: Economics, Sociology, Political Science, Jurisprudence, Psychology, Journalism, Linguistics, History, Philology, Logistics, Mathematics, Philosophy. As a rule, the Corpus has data on sex and age of the author, and his/her year of study (first year of Baccalaureate, second year of Master’s program, etc.). Some texts have information about the region where the author lived up to the age of 18 and whether he/she is bi-lingual.

The Corpus collection is increasing fast, tags and the interface are being improved.

Collection Statistics as of December 2017:

3,677 tests

3,115,212 words

301,079 sentences

27,593 tagged elements.

The Corpus contains texts written by students of first-sixth year of study of 15 humanity majors and 14 genres.

Python and its public domain libraries were used for this project.

Initial Processing of Texts

A program was developed for processing texts produced during graduation qualification and diploma exams to delete information that is not important or needed for future work with Corpora and text tagging. With full diploma or graduation qualification paper as its input, the program retains only the main part of the text between Introduction and Conclusion, inclusively.

Data Backup

A Data Base backup was created for data safety. Should emergency occur (for example, a failure of the Corpus technical platform), a non-zero version of the resource will still be available  .

There are 3,677 documents that was backed up, of which 1,618 have tags.

The backup is available for downloading from Google-disk:



Documentation was written for Corpus future taggers, programmers, and administrators that will allow them to understand technicalities and familiarize them with the  web-corpora based Project structure.

The documentation is available at: https://github.com/acRnR/learner_corpus/wiki

5. Prototype Corpus of Folklore Texts (Tverskaya oblast regional dialect)

During 2017 field trip, texts produced by village inhabitants recorded.

Using the results of the summer 2017 field trip, about 18 hours of audio and video recordings were segmentsed, transcribed, tagged, and entered into tables for the database totaling to about  300 texts recorded in 8 localities (villages Kreva, Maloe Vasilevo, Privolzhskiy, Beliy Gorodok, Pechetovo, Bereslovo, Seltsy, Volodarskoe) from 13 subjects (11 women born between 1927 and 1947, and 2 men born in 1934 and 1947).

  • analytic description of the following nodes of lexis and grammar system generating speech deviations:

Sub-categorization of Russian exclamatives based on the analysis of their structure and speaker intentions (Verbless kakoj-exclamatives in Russian: Evidence from Usage Data; syntactically dependent Russian exclamatives)

Identification of basic properties of Russian meta-comparatives in typological perspective

Identification of specifics of the Russian verbs with opening-closing semantics with regard to other languages of different types

Interpretation of differences between speech practices of ХIХ century (on the basis of Zhukovsky texts) and modern language using computer data analysis

Explanation of order of different semantic class adjectives in Russian from the Corpora data perspective

Interpretation of verb aspect in teaching Russian as a Foreign Language from the Corpora data perspective

Construction and explanation of plural/singular coordination between the predicate and the subject expressed by a noun phrase that has such words as “a half” («половина») or “a one third” («треть»)

Classification of types of uses of adverbs as a distributor of adjectival participles in modern Russian language

Typology of standard and non-standards metaphorical transfers

Conceptualization of specifics of dative case subject behavior in Russian speech from historical perspective

Paradigmatization of standard and occasional alternations of consonants in modern Russian language.

Improvementofstyle-improvement exercises

This research is created to be used by instructors of courses of rhetorics, academic writing, literary editing, speech culture, modern Russian language, and Russian as a Foreign Language in order to facilitate the preparation of focused tasks to train students in selecting exact words of a proper register, correction of lexical compatibility, structuring grammar constructions, compositional organization of grammatical texts.

научно-исследовательская разработка адресована преподавателям курсов по риторике, академическому письму, литературному редактированию, культуре речи, современному русскому языку, русскому языку как иностранному с целью облегчить составление сфокусированных заданий для отработки навыков выбора точного слова нужного регистра,  корректировки лексической сочетаемости, структурирования грамматических конструкций, композиционной организации грамотного текста

Research Implementation Degree, Implementation Recommendations, or Implementation Results

The resources render opportunities for obtaining credible and statistically representative data on lexis and grammar variations in modern Russian speech.

Full-fledged corpora of non-standard texts present interest for Ethnologists, Anthropologists, Philologists, Linguists, Sociologists, Political Scientists, Historians, Culturologists, Journalists, Translators, Psychologists, Area Studies Specialists: non-standard language usage analysis  makes it possible to identify changes in the linguistic picture of the wprld around in the set of mind of a speaker or a writer, and language shifts that are caused by the current language situation and are becoming the basis for language development.


