• A
  • A
  • A
  • ABC
  • ABC
  • ABC
  • А
  • А
  • А
  • А
  • А
Regular version of the site

Corpus studies of language variation: from deviations to linguistic norm

Priority areas of development: humanitarian
Department: Linguistic Laboratory of Corpus Technologies

The goal of the Laboratory in 2015 was to expand the results achieved in text collections and search tools of 2012-2014 onto non-standard linguistic corpora that have been set up and are being refined.

The Project included setting up a multilevel speech-error database – a multi-faucet updateable archive of “negative examples of speech" (L. V. Scherba) that enables the analysis of the development of a language in its environment. The focus of  activities in 2015 was to simulate the system of rules that reflect the functioning of a language in contemporary communication on the Internet (the analysis of blog texts and description of the generation of the today’s speech standards reflected in the Internet communication), in studying Russian as a Foreign Language (RFL for speakers of other languages) and in studying Russian as an inherited language (for families living in a foreign language environment), in regions (Dagestanian Russian), as well as in educational environment (junior students writing abstracts or research papers in academic English or in academic Russian). Project participants concentrated their efforts on the description of the variation scale of means to express a speaker’s/writer’s intentions implementing the guidance directed by the language and its environment.


The methods used have been developed within the Erratology Theory (The Grammar of Errors), and the research carried out represents best the area of Corpus Linguistics. Research into the boundaries of speech variations over the corpora is based on data obtained during the implementation of Non-standard Text Collection Project, namely, the collected texts are being tagged to enable automated searches and to reflect non-standard word usage and generation of the speech patterns when they are regulated by standards and instructions (dictionaries and grammar books).

The updateable corpora set up in the NRU HSE Laboratory for Corpora Research Technologies include the following:

  • Russian Learner Corpus: the corpus is divided into two parts - Corpusof Russian Heritage Speakers (Russian speakers brought up in émigré families from the RF in the idiom environment foreign to them, namely, in the linguistic  environment of the language of the state that accommodated the family) and Corpus of RFL[1] Speakers(texts produced by students of Russian as their second, third, etc foreign language);
  • Corpus of Russian Student Text (CoRST, academic research papers of students studying Academic Writing in Russian);
  • Russian Learner-Translator Corpus (from Russian into English and from English into Russian);
  • Regional Russian Speakers Corpus (by now, mainly corpus of texts produced by rural residents of Dagestan);
  • Russian Error-Annotated Learner English Corpus (REALEC, comprises texts written in English by Russian-speaking English learners);
  • Blog Corpus (speech productions of Internet users).

During 2015 the addition of collections (sources that constitute the base of non-standard Russian speech corpora) and the development of tagging to facilitate searching in the corpora continued.

The key categories for the research were errors, speech disfluency, agrammatism, variance in language usage. For this Project, the main materials for the corpora of the non-standard Russian and for studying important lexical and grammatical trends in modern Russian were broadly interpreted deviations from the recommended standard, so called “negative examples of speech" (L. V. Scherba), which gave rise to the development of a special speech variation and language development research area – La Grammaire des Fautes (Henri Frei).

Empirical Base for the Research

Data collection component: field study as a source of texts produced by informants during a guided conversation (transcripts of verbal interviews conducted by project participants with Regional Russian speakers in Dagestan villages and in Makhachkala); students performing academic assignments in class or as homework (speech productions of foreign speakers, uneducated translators, heritage language speakers, students learning Russian academic register and submitting them via the computer); texts typed in by the author; spontaneous speech productions of bloggers extracted from the Internet.

Russian Learner Corpus, consists of two sub-corpora: Corpus of Russian Heritage Speakersand Corpus of RFL Speakers. Texts were submitted by Russian as a Foreign Language instructors. The following genres of speech production are presented: open-ended short answer to a posed question, argumentative essay on a given issue; mini-essay (single paragraph) on a certain subject; analytical note, report on analytical study, results of performing comparison, summarising, writing a review or comments to source texts. The language fluency level is indicated. The search options for context expansions are provided. It is possible to include in a search a query for a sub-corpus based on meta-tags: texts from the same informant, texts from non-native speakers separately from heritage speakers, texts of same genre. Errors are tagged in accordance with the typology developed for the Russian Academic Writing Corpus (КРУТ).

Russian Academic Writing Corpus ( КРУТ/ CoRST). Students write texts as tasks received in academic disciplines they study, the texts are meta-tagged (with data about an informant in accordance with a detailed questionnaire reflecting relevant socio-linguistic parameters of the speech production) and are fed to Morphology Analyzer (MYSTEM, software that identifies inflectional classes and classifies the grammar form of a word), then texts are tagged by Les Crocodiles 2.0 software, and tagged files are added to the corpus.

Russian Learner-Translator Corpus. Texts are generated during translations performed in studying academic disciplines at Philological Departments of a few RF universities (various RF regions have been incorporated) and intranslation clubs. Incorporated are various translations of each source text performed by different translators enabling the comparison of the selection of expression means and text generation strategies at micro- (paragraph) and macro-levels. Speech disfluency risk zones (systemic, anticipated errors resulted from typologically universal speech generation rules) demonstrate high variations in the approach to selecting lexical and grammar forms of speech.

Regional Russian Speakers Corpus. Expands on the basis of audio and video recordings obtained during linguistic expeditions to Dagestan villages in conversations with regional Russian speakers, that are subsequently transcribed and tagged by project participants for enabling computer data searches.

Russian Error-Annotated Learner English Corpus (REALEC, comprises texts written in English by Russian-speaking English learners). Student texts in the corpus (mainly essays, but also answers to questions in class as well as a few theses) provide the material for statistical processing of large volumes of data, identification of diachronic changes in lexis and grammar, grading output as dependant on text type and tags, and further rectification of lexical and grammar performance in the written examinations. Over the 2014-2015 year, REALEC has expanded to the size of about 800,000 tokens from about 3200 essays, and there have been about 38,000 errors marked in them through manual expert annotation. Two qualification theses and two research papers have been written by students of the School of Linguistics in that period, and the report on the annotator agreement was presented by the research workers of the Laboratory at the *th Corpus Linguistics Conference in Lancaster University in July 2015.

Results of research:

Besides the achievements mentioned above, the project of 2015 included the development of templates for non-standard linguistic corpora, the description of shifts in the common practice of the usage of nouns and verbs, construction of algorithms for and simulation of rules – objectively existing systematic regularities enabling us to describe deviations from the prescriptive codification scheme that governs speech generation; the project also included improvement of linguistic corpora by providing:

  • additions to collections
  • optimization of annotations (tags, tagging instructions and search tools)
  • testing and upgrading stylistic trainers
  • automation of speech generation rules involved in the project research
  • postulating educational principles for students’ work with the corpora
  • development of competence-oriented linguistic tasks based on the description of shifts in the standards identified in comparison with speech generated in accordance with the norms
  • analytical description of speech peculiarities of Russian speakers with native Kazakh language
  • description of speech peculiarities of heritage Russian speakers (non-balanced bilinguals) and non-native Russian speakers
  • description of important trends in the development of modern Russian language reflected in speech production on the Internet.

It is important to note that an error within the suggested approach is regarded not as a shameful and punishable violation of a rule, but as a valuable language material revealing important trends in the development of a language.

The role of a search template is to provide grouping of texts from the collection in a tagged corpus in accordance with certain combination of context attributes that vary in frequency and are regarded by users as relevant for research tasks. Example of queries: verb - preposition – noun in accusative case, verb – noun in dative case - noun in accusative case. Based on N-gramm output results (compatible two-, three, four, and five-word combinations) with punctuation and ‘lex’ tag (lemma) ‘gr’ tag (part of speech and grammar, and also on syntactical tagging, we will fill in the template using ‘grep’ command. Template developer algorithm obtains a list of standard sketches using SketchEngine, prepares a list of missing sketches for lexicographical tasks, identifies all N-gramms that correspond to standard sketches in SketchEngine, assigns sketch="..." tag for N-gramms left untagged, identifies which morphological and syntactical tags are relevant and obtains statistics of their combinations, and selects productive tags based on this statistics (back to the beginning of the cycle) -- "automated path".

Level of implementation,  recommendations on implementation or outcomes of the implementation of the results

The following practical results have been achieved during the implementation of the Corpus Studies of Language Variation: from Deviations to Linguistic Norm Project carried out by The Linguistic Laboratory for Corpora Research Technologies (Faculty of Humanities, NRU HSE) in 2015:

  1. templates for non-standard linguistic corpora have been developed and implemented. A template is viewed as a set of software operations that enable a researcher to set forth search commands that will group the output of language material from a corpus in accordance with a stipulated combinations of contextual attributes that vary in frequency and are regarded by users as relevant for research tasks. Examples of strings that are searchable with templates are, in particular, the following: Example: verb - preposition – noun in accusative case, verb – noun in dative case - noun in accusative case. It is the first time that templates have been used with non-standard corpora;
  2. the linguistic corpora have been improved (collections have been expanded, annotating and search engine have been optimized);
  3. stylistic trainer for assessing speech competency and for developing academic text editing skills has been thoroughly tested and has been expanded;
  4. speech rules developed during the project implementation have been partially automated;
  5. basic educational principles have been formulated, and test competence-oriented linguistic tasks based on the description of a deviation from a standard in comparison with standard complying speech have been developed (the tasks are being used in class at the NRU HSE and at the NRU HSE Lyceum);
  6. a propaedeutic environment has been set up that assists a Russian speaker/writer in developing self-analysis skills with regard to variation in usage of lexical and grammatical units on the basis of non-standard Russian and stylistic trainer corpora;
  7. some major rules that are subconsciously guiding a speaker/writer in accordance with increased usage of new standards during speech generation have been formulated (growing-points for the observation of the development of a language in its natural environment).


[1] RFL – Russian as a foreign language


Kutuzov A. B., Kuzmenko E. Comparing Neural Lexical Models of a Classic National Corpus and a Web Corpus: The Case for Russian, in: Computational Linguistics and Intelligent Text Processing, Lecture Notes in Computer Science.: Springer International Publishing, 2015. С. 47-58. 
Апресян В. Ю. Cвязь семантических и коммуникативных свойств языковых единиц // Компьютерная лингвистика и интеллектуальные технологии. 2015. Т. 1. C. 2-18. 
Ахапкина Я. Э. Рефлексивные глаголы «убираться» и «играться»: кодификация и узус // Труды института русского языка им. В.В. Виноградова. 2015. № 6. C. 392-412. 
Vinogradova O. I. Learner Corpora Researches Review (trends observed in the 8th conference CORPUS LINGUISTICS - 2015) // Journal of Language and Education. 2015. 
Dobrushina N. The Verbless Subjunctive in Russian // Scando-Slavica. 2015. Vol. 61. No. 1. P. 73-99. doi
Кутузов А. Б., Кузьменко Е. А. Использование корпусных технологий для изучения ошибок: learner corpora на факультете филологии НИУ ВШЭ // Научно-техническая информация. Серия 2: Информационные процессы и системы. 2015. № 1. C. 21-26. 
Плисецкая А. Д., Филимонов К. В. Фрейминг и рефрейминг в речевых стратегиях американских политических лидеров // Вестник Московского университета. Серия 21: Управление (государство и общество). 2015. № 4. C. 160-176. 
Рахилина Е. В. Степени сравнения в свете русской грамматики ошибок // Труды института русского языка им. В.В. Виноградова. 2015. № 6. C. 310-333. 
Рахилина Е. В. Стилистически маркированные глаголы в русском языке: совать-сунуть // Вестник Томского государственного университета. 2015. 
Corpus Linguistics 2015: Abstract Book. Lancaster : Lancaster University Press, 2015. 
Daniel M. Stem initial alternation in Russian third person pronouns: variation in grammar, in: Компьютерная лингвистика и интеллектуальные технологии. По материалам ежегодной Международной конференции "Диалог" (2015). Moscow : Изд-во РГГУ, 2015. С. 95-103. 
Zevakhina N., Dzhakupova S. Corpus of Russian student texts: design and prospects, in: Материалы 21-й Международной конференции по компьютерной лингвистике "Диалог". Moscow : Изд-во РГГУ, 2015. 
Kutuzov A. B., Kuzmenko E. Semi-automated typical error annotation for learner English essays: Integrating frameworks, in: Proceedings of the 4th workshop on NLP for Computer Assisted Language Learning at NODALIDA 2015, Vilnius, 11th May, 2015.: Linköping University Electronic Press, 2015. С. 35-41. 
Жукова С. Ю., Зевахина Н. А., Джакупова С. С. Контаминация конструкций в речи нестандартных русскоговорящих на материале корпуса русских учебных текстов, in: Труды Международной научной конференции "Корпусная лингвистика-2015". Санкт-Петербург : Издательство СПбГУ, 2015. С. 390-397. 
Kuznetsova J. Genitive of cause and cause of genitive, in: Donum semanticum: Opera linguistica et logica in honorem Barbarae Partee a discipulis amicisque Rossicis oblata. Moscow : Языки славянских культур, 2015. С. 135-146. 
Ахапкина Я. Э. Прикладные аспекты эрратологии: грамматика ошибок и речевая практика (конструкция "когда ... то"), in: Психолингвистические аспекты изучения речевой деятельности. Екатеринбург : Издательство Екатеринбургского университета, 2015. С. 196-208. 
Kutuzov A. B., Andreev I. Texts in, meaning out: neural language models in semantic similarity task for Russian, in: Computational Linguistics and Intellectual Technologies. Papers from the Annual International Conference “Dialogue” (2015). Moscow : Издательский центр «Российский государственный гуманитарный университет», 2015. С. 143-154. 
Плисецкая А. Д. «Свои» и «чужие» в московской предвыборной кампании 2013: стратегия сегментации аудитории, in: Философия. Язык. Культура. Санкт-Петербург : Алетейя, 2015. С. 449-463. 
Kuznetsova J., Rakhilina E. V. Genitive of cause and cause of genitive, in: Donum semanticum: Opera linguistica et logica in honorem Barbarae Partee a discipulis amicisque Rossicis oblata. Moscow : Языки славянских культур, 2015. С. 137-147. 
Виноградова О. И., Кашкин Е. В. ЧТО ВИДИТ СЛЕПОЙ И СЛЫШИТ ГЛУХОЙ: К ЛЕКСИЧЕСКОЙ ТИПОЛОГИИ СЛОВ ДЛЯ ОТСУТСТВИЯ ЧУВСТВЕННОГО ВОСПРИЯТИЯ // Вестник Воронежского государственного университета. Серия: Лингвистика и межкультурная коммуникация. 2016. № 3. C. 92-98. 
Slioussar N., Magomedova V. Stem-final consonant mutations in modern Russian // Morphology. 2016. 
Plisetskaya A. D. Conceptualization of migration during Moscow mayor campaign in 2013, in: XVI Апрельская международная научная конференция по проблемам развития экономики и общества: в 4 кн.. Moscow : Издательский дом НИУ ВШЭ, 2016. С. 422-430. 
Volkova A. A. Reflexivity in Meadow Mari: Binding and Agree // Studia Linguistica. 2017. Vol. 71. No. 1-2. P. 178-204. doi
Zevakhina N., Dzhakupova S. Russian metalinguistic comparatives: a functional perspective / ГУ ВШЭ. Series WP BRP "Linguistics". 2015. No. 39. 
Vinogradova O. I., Kashkin E. ..