The dynamics of the standard and the marginal in Russian language

Priority areas of development: humanitarian
The project has been carried out as part of the HSE Program of Fundamental Studies.

Goal of research

The following topical areas were infocus of the 2019m project: the nature of changes in the notion of speech norms; the zones of blurred borderlines between norms in normtiv and non-normative speech production; causes of systematic failures in following rigidly recommended norms. Speech standards were observed on the background of non-stndard speech production – that of speakers of other languages learning a particular language, heritage speakers, speakers of a particular regional variety of a language, and, finally, students mastering an academic register, new for them in their native language or in a language they are studying. The main goal in 2019 was to interpret with the help of corpus methods grammar and vocabulary features of non-standard pieces of speech production on the background of traditional standard speech. Amongbthe specific goals were attesting, classifying and interpreting cases if lack of correlation between the use and the codification in the areas of word formation, micro-and macrosyntax, and lexical combinability.


Throughout theyears 2012-2019 the source texts for the corpora in the project have been produced (1) by the interviewees in the course of guided interviews, which have been recorded and then deciphered by the project participants; (2) in class and during independent work in preparation for classes. The collections of speech production allow the participants of the project to register the peculiarities of the speech produced by speakers of other languages, naïve translators, heritage speakers, university students in their courses of academic register in Russian and English languages, and school students. Uploading of the texts is carried out by the author, who is a non-standard speaker or writer. Thanks to the corpora, big volumes of balanced data can be processed, time changes in grammar and vocabulary can be registered, and extracted data can be categorised in relation to types of texts and nature of its attribution. In corpus searches, lexical and grammatical context features in focus get refined. 2019 project includes reconstruction and modelling of the rules taken as objectively observed  systemic regulations, which provide the ground for the description of the process of non-standard speakers going astray from the recommended schemes in their oral and writing speech production. It also helps to  improve language corpora by optimising the annotation and search functions during the corpus upgrades, as well as stylistic trainer upgrade. The rules of speech production within the project get automated description with the help of computer programs, and they pave way for forming the foundations of study courses and competence-oriented linguistic tasks,  based on tracing the deviations from the standardly organised speech. As a result, the functioning of language in non-standard or new conditions can be researched, and new trends in language development can be identified, such as current tendencies of Russian language used on the Internet. It is important that within such approaches an error is seen not as a shameful or punishable violation, but as a valuable linguistic phenomenon allowing researchers to reveal what way the language is taking in its development.

Empirical base of research

Russian and English learner texts corpora (КРУТ, REALEC). Learners are given tasks to write texts in the course of their studies; while being uploaded to the corpus, a text is provided with metadata including sociolinguistic information on the author and the speech event; texts get automated annotation (MYSTEM and Les Crocodiles 2.0 program, or TriTagger) and manual annotation for errors. The corpora are supervised by Yulia Kuvshinskaya and Olga Vinogradova, correspondingly.

Russian Learner Corpus – a corpus of heritage Russian and Russian produced by speakers of other languages. The textsare submitted by instructors of Russian as a Foreign Language. There is  a sufficient genre variability – short open-ended answer to a question; argumentative essay; a paragraph-long mini-essay in answer to a certain stimulus; analytical notes; report on analytical research; the description of comparative, referencing, annotating, or commenting effort over the source texts. Language proficiency level is specified for the text. In carrying out searches, one can choose a subcorpus, extract all texts produced by a particular learner, specify the type of learners among the authors, or ask for texts of a particular genre. The typology of errors is the same as in КРУТ. The corpus is supervised by Anastasiya Vyrenkova and Olga Kultepina..

Regiolect Corpus. Texts are recorded during the interviews designed as guided conversations with local residents of the areas visited during field trips. Transcribing of the recordings is carried out by the research group under the supervision of Boris Orekhov

Results of research

In theory:

In the sphere of producing calques form the domineering language the research determined how much frequency of use influences transfer of constructions form the dominant to the studied language or to the heritage language; in the area of сommon mistakes attested in the texts of the learner the correlations between the dominant language, proficiency level in the studied language, and the nature of errors have been revealed. Similarly, systemic errors in English learner texts have been registered and interpreted as L1 Interference errors. Research was carried out into distribution of Russian and Slavic indicators of disjunction and their interaction with other logical operators (no; if; etc.) in different types of texts, learner texts among them. Non-standard uses of prepositions were compared in Russian speech of speakers of other languages and that of Russian speakers. In regional Russian texts a circle of plots concerning the Sacred Lake were attested: the structure of narration was identified, and motives and types of the story line were related to the historical background and traditional  interpretation. The corpus also was supplemented with new singing formulas, new ballads, and new romantic songs. A fixed ritual speech was identified in the wedding rites of the Yuzhsk area of Ivanovsk region. Detailed analysis of errors in English learner texts focused on the use of English verb tenses, especially aspectual forms - Perfect and Continuous.

In methodology:

For the resources of non-standard speech under observation formal interpretation of morphological, syntactic and meta-annotation tags was suggested, so that tags were clusterised, and their areas of application were further specified; as a result; annotation approaches were improved. More annotator agreement experiments were set up and carried out. The system of oral annotation was developed for RLC corpus. New search function was applied in КРУТ. Videotutorial appeared for annotation of errors in REALEC. Further editing efforts were undertaken in all learner corpora. A new interface was developed for work with news from the media, and automated  processing of texts before uploading was introduced. Uploading of text from email was developed. For Russian regional studies, a pattern for folklore songs subcorpus was developed on the basis of Sobolevsky’s collection of songs: database was set up, annotation was automated, and categorisation of new texts was completed.

In the area of empiric research:

Specific features of narrative in speech production of bilingual children were revealed. Heritage features were described in the speech of bilingual Russian-English child. RLC  got the texts of 400,000 more words annotated; REALEC got new essays of about 500,000  words total annotated; the corpus of regional Russian got recordings of about 120 hours of texts, both audio- and video-recorded,  from Tver region, Kimry region; Ivanov region; Yuzhsk area and Pyestyakov area.

Level of implementation, recommendations on implementation or outcomes of theim plementation of the results

Language corpora of non-standard speech productionКРУТ, RLC и REALEC are in the open access, their data can be downloaded and are subject to statistical observations; and reveal empirically grounded trends in language development. The results are of use to researches in different areas of humanities, teachers, speech therapists, and any specialists working with language texts.


