Corpus technologies in linguistic and interdisciplinary research

Priority areas of development: humanitarian
Department: Linguistic Laboratory of Corpus Technologies

The specific object of the research is the variety of ways in which the intended meanings are expressed by the authors of spoken and written utterances. 

The focus is in particular on the deviations from recommended norms in the speech of Russian native speakers, including non-standard speakers such as residents of the Russian regions, bilingual speakers with another L1 besides Russian, student speakers beginning to master academic Russian, student writers beginning to master academic English, Russian emigrants at the stage of losing Russian as their L1, and Russian blog writers on the Internet.

Corpus as a collection of texts with special search tools provides researchers with the quick access to any speech acts of interest to them. Non-standard utterances are researched in comparison with the samples of the norms, with the conventions imposed by the language itself and by the environment. Another direction of the research is to reveal the current trends in forming the following Russian lexical and grammar usage norms: orthoepical shifts, productive derivational patterns, variation in inflexion and in sentence generation.

Interdisciplinary character of the project is accounted for by the areas the results of the investigation can be implemented, namely: in political science and regional studies, they can be applied as presenting the lists of features typical of a regional variety of Russian in the areas with a dominant L1 different from Russian; in psychology, psychopathology and neurology, they help enumerate speech disorders as symptoms typical of specific medical conditions of general etymology; in social sciences, they give additional evidence for the idiolectal profile of different social strata; in social anthropology, they contribute to building speech ethnical profile for the cases of linguistic interaction and long-term contacts.

The project undertakes research into non-standard variations of Russian language in the following directions:

  • Russian mastered by a speaker of other languages;
  • Regional varieties of Russian with the Russian Federation;
  • Russian in registers and genres new to a native speaker or writer;
  • Russian inherited from the family living in non-Russian environment.

The goals of this long-term research are to develop, introduce into linguistic practice and maintain linguistic corpora of non-standard uses in Russian speech and writing, and also corpus analysis oriented towards reveling the current norm-forming trends for Russian language. The latter includes searches for speech anomalies, agrammatical uses and system-induced speech faults in comparison with the standard uses.

The project in 2014 focused on the design and development of the style trainer, a system for the analysis of system-induced faults in writing and speaking and for the prevention of unconscious violation of norms and conventions by learners of Russian standard variety.

In the process of carrying out the research the linguists succeeded in working out methodological approaches to organizing text collections in the following corpora: the corpus of Russian regional speech samples; the corpus of speech samples produced by descendants of Russian emigrants living outside the RF; the corpus of 1-year students’ written works produced in the course of Academic Writing in Russian; the corpus of Russian in Internet blogs. Annotation systems were adjusted mutually, search tools were improved to ensure quick and easy search for units and their combinations, categorization of system-induced violations of speech norms (specifically, errors at multiple levels, writers’ faults and diffusion in the borderline between the norm and speech trends in usage) were described in full detail.

As a result, the corpora of non-standard usage (Russian academic writing corpus and heritage speakers corpus) were further developed and updated, modern Russian current trends in restructuring lexical and grammar standards were listed, categorized and interpreted.

4. The results of work in 2014 were the following achievements:

  • the main mechanisms underlying contextual, pragmatic and purely grammatical deviations in the texts produced by non-native Russian speakers, heritage Russian speakers and native Russian speakers were outlined, such as deterioration of comparative construction, case variations in verbal and nominal constructions, choice of Singular or Plural for of a predicate for a Plural subject;
  • a linguistic theory was proposed for analysing language data, one that goes back to Frei’s The Grammar of Faultsand to Scherba. Within this theiry deviations from norms are considered to reflect the development of the overall language system towards overcoming the rigid limitations of the speech traditions and conservative system of norms. Such interpretation allows researchers to reveal lexical and grammar typological variations of the norm rather than regard them as deviations from the norm.
  • a style trainer was developed and introduced into teaching practices to prevent typical usage mistakes by making learners concentrate on areas where they are more prone to making a mistake and by drawing their attention to variability of means while editing their texts.
  • trends towards variability were described on the basis of the data collected from the corpora and on the Internet, as well as from the results of field research devoted to such areas as front explosives palatalisation, syntactical anomalies of search queries, specifics of naïve poetry lexicon; shifts in the main notions in political and media discourse.
  • Data base collections in Russian were set up and described with multiple uses of lexical items and grammaticalised constructions taken into account; they are intended to be used as patterns for lexical typology research across a representative number of larger and smaller languages.
  • Theoretical foundation for categorization nets for the process of revealing speech anomalies in the utterances of non-standard speakers and writers was formed, and samples of such nets were presented.
  • In the area of methodology, categorization schemes for grammar and semantic speech faults were introduced and then updated for all the resources developed within the project, and corresponding instruction manuals were offered to annotators working with the corpora.
  • methods of organising the collections of texts in all corpora were improved and systematised.
  • Patterns of meta-annotating were mutually adjusted.
  • Search tools were made more efficient.
  • Detailed categorization of system-induced faults(multi-layered errors and faults, diffusion in the borderline between the norm and usage trends) was suggested

Corpora of utterances produced by “non-standard Russian speakers” were initially set up in 2012-2013, and in 2014 contributions to their volumes grew larger due to the cooperation with University of Helsinki, Finland; Harvard University, USA; and Teacher Community in Berlin. The following corpora have been formed as a result: Corpus of Russian texts produced by descendants of Russian emigrants living abroad; Corpus of Russian regional variations in the Caucasus with the possible extension to the Corpus of Russian regional variations in the former Soviet areas; Corpus of errors in Russian academic texts written by Russian learners; Corpus of errors in the written texts produced by learners of Russian; Russian Learner Translator Corpus (RLTC); Corpus of errors in English academic texts written by Russian learners; Corpus of errors in English essays written by Russian learners (REALEC); Corpus of blogs written in Russian on the Internet.

Below are the data on each of these corpora.

Corpus of errors in Russian academic texts written by Russian learners:


Over 2,280,000 tokens. Metadata (on writers and texts, a sign for quotations). Morphological parsing. 300,000 manual error annotations.

Russian Error-Annotated Learner English Corpus (REALEC)


225,000 tokens in 794 student texts (357 texts at 93,234 tokens overall were added in 2014). 10,364 annotated errors (7,800 of them were made in 2014). 151 error types. Categorisation of errors introduced in 2012-2013, worked out to more detail in the fields of verb pattern and discourse errors. Instruction manuals for annotators. Experiment on annotator agreement carried out in 2014.

Corpus of Russian texts produced by descendants of Russian emigrants living abroad


400,000 tokens (by May 2014).

Russian Learner Translator Corpus (RLTC)



1,305,515 tokens added in 2014. Improved technological functioning.

Corpus of Russian regional variations in the Caucasus

10 more hours of recorded materials with 60,000 tokens processed by Praat and written down; about 1000 errors annotated manually.

Description of the style training system

12 style training tests across the scholarly disciplines (sociology, law, economics, psychology, philosophy) 15 sentences each, with optional variants, of which 18 have been double-tested, and 10 more are still being tested. The system provides students with the means to improve their self-editing skills for academic writing, to master the necessary linguistic intuition, and it gives instructors individual, group and placement diagnostic procedures. Exercises with questions requiring short guided answers and with open-ended questions can be graded automatically. The training system was developed on the basis of solid theory of stylistics, discourse analysis and erratology for Russian language and can be applied in different areas of testology with modern competence approach. Corpus data used for creating the exercises in the style training system were categorized using the same principles as the collections of learner texts underwent when setting up a corpus.

5-6. Conclusions.

In all corpora, annotation systems were elaborated and developed further. Tags in different corpora were designed to satisfy the needs that arise in searching for different anomalous phenomena – either typical or common problematic areas, or L1-induced errors. All annotation tags correspond to some findings in the corpora to which they were applied.

Corpus researches carried out in the laboratory contribute to the development of the theory of sociolinguistic changes that arise from language interference. All the corpora follow a usage-based approach to non-standard uses of a language. The investigation of issues of competence insufficiency, growth and change potential, areas of variability and system-induced choice included regional and social factors into consideration. Universal and particular parameters in the lexicon of a language were described with non-standard registers in mind.

Research carried out on the basis of the corpora described above has become an important complementation to field researches, and visa versa. As a result, the principles of collecting data in field trips have been elaborated, annotations in different corpora have become more closely adjusted, specific features of non-standard uses are now better represented, and there will soon be teaching methods and educational curricula based on the knowledge about different groups of imperfect learners (such as corrective additions for course of Russian for regional Russian speakers).

Another area where the corpora of the laboratory was of great benefit for the research is typological study of errors: system-induced types of mistakes as marking the potential growth areas were pointed out, all mistakes got attributed to the corresponding layer in the language strata; specialized uses were outlined for electronic communication and for academic writing.




