Tendencies of language change in the mirror of corpora

Priority areas of development: humanitarian
Department: Linguistic Laboratory of Corpus Technologies

Goal of research. The project in 2016 aimed at revealing the contemporary as well as diachronically important tendencies in vocabulary design and rules of combining language units at different linguistic levels with speaker’s and writer’s intentions taken into account. These goals have been achieved  on the basis of vast corpus data, with some of them compiled in the process of the project realization. The special feature of the 2016 project is the focus on the diachronic component of the research into the data. One of the directions within this component is related to corpus studies of the ancient Russian orthography: unifying patterns of non-standard linguistic corpora make up a significant improvement achieved in this area. Research into the description of shifts in grammar and lexical norm in language use together with the reconstruction and modeling of the systematic functioning has made it possible to describe the exceptions to the codified schemes, which constitutes one more area of the laboratory’s interests. Besides, corpora optimization with the growth of all corpora in question, improvements in the adopted annotation practices and in the search opportunities have brought about changes in the stylistic training systems and automated systems of introducing the speech rules and regulations researched in the course of the project. This, in turn, has helped to form and improve teaching principles and linguistic tasks for the courses of academic writing in Russian and in English, as well as for learners of Russian as a foreign language. One more direction of the research carried out in the laboratory is tracing the tendencies in the Russian norms demonstrated in Russian dialects as well as in the Internet varieties of Russian.

Methodology. In data collecting the resources were acquired throughout interviews with speakers of regional Russian, first, in Dagestan remote villages and in the capital city of Makhachkala, and, second, in the north of Russia. These interviews were especially worked out by the project participants. Speech production of learners of Russian as a foreign language was submitted in the course of practice in class and independently as home assignments. The same can be stated about samples produced by Russian students taking a course of Academic Writing in Russian. Spontaneous speech production of Russian bloggers is based on the extracting techniques from the Internet content. Data processing is carried out as a corpus research method, namely with the help of an annotator’s workstation: features within the research area are tagged to enable automated search for lexical and grammar phenomena or for particular speech errors. Corpus tools provide for the validity of statistical processing of big data attributed materials, both for the purposes of qualitative and quantitative research.

Empirical base of research. The following collections form the basis for the research carried out by the participants of the project:

  • written works of the Russian- and non-Russian native speakers, heritage Russian speakers among them, for all of whom Russian is not a domineering first language or just a language to be studied;
  • interviews with dialectal Russian speakers;
  • Internet users’ replies to focused enquiries (in the open access);
  • written text collections of different times (partly from Russian National Corpus).

Results. Throughout  2016, the following corpora of non-standard Russian (non-standard here implies mastering new language registers) have been supplemented with new materials and updated in the process of achieving the project’s goals:

  • Russian regional dialects in Dagestan, supervised by Nina Dobrushina;
  • Russian regional dialects in the north of Russia, supervised by Nina Dobrushina;
  • Russian Error-Annotated Learner English Corpus (REALEC) made up of works written in English by Russian students learning English, supervised by Olga Vinogradova; available at http://www.realec.org/
  • Russian Learner Text Corpus of academic Russian (КРУТ), supervised by Natalya Zevakhina; available at http://web-corpora.net/learner_corpus
  • Russian Learner Corpus (RLC) of heritage Russian (speech of Russian inherited in the non-Russian environment) and learner Russian as a foreign language, supervised by Anastasiya Vyrenkova; available at web-corpora.net/RLC.

Russian Learner Corus (RLC) is at the moment represented by two sub-corpora - the longitudinal heritage subcorpus RULEC and a sub-corpus of written texts collected as an update to the main corpus. The 2016 collection has been sorted out by the domineering language of a contributing learner. The main feature of this year is neutralizing the differences in the strategies applied by annotators. The causes underlying these differences have been worked out, and the algorythm for the elimination of multiple approaches has been developed (by Ivan Smirnov). Evgeniya Smolovskaya formed the new collection of written learner texts produced by Russian as a foreign language learners during the 2016 summer school conducted by the project participants.

In Russian Learner Text Corpus of academic Russian (КРУТ), the update includes student works over 2015-2016 academic year, produced by 1st- and 2nd-year Bachelor students taking courses “Rhetorics: oral and written communication practice,” “Academic Writing (in Russian),” “Cultural Speech Competence,” “Literary Text Editing,” and “Foundations of Literary Translation.” A major achievement has been the updated virtual annotator workstation. Search can be carried out at http://web-corpora.net/learner_corpus/search/, and the statistics can be accessed at http://web-corpora.net/learner_corpus/stats/. The update to the stylistic training system has been developed by Anna Levinzon for the purposes of the diagnosis of the level in a student speech competence and for practice in text editing for Bachelor (students involved in this part of the project are responsible for 12 exercises corresponding to 15 rubrics tasks, as well as for improvements of the previously collected exercises). The access can be achieved at http://web-corpora.net/acad/accounts/login/

REALEC, similarly to RLC, is also represented by two sub-corpora – the bigger-size corpus of anonymised students essays and student theses from 2012-2015, tagged with part-of-speech markers and manually annotated by experts, and a smaller collection consisting of two parts - non-anonymised essays written by 2nd-year HSE students in their English examination in 2016, which are being annotated at the moment, and essays written by the current students majoring in Linguistics, not anonymised yet, but already manually annotated by their English instructors – available at http://realec.org/hse/#/. The annotation approaches adopted in the corpus have undergone many improvements with the increased annotator agreement as a result (the project participants responsible for the updates are Elizaveta Kuzmenko, Alyona Fenogenova, and Olga Vinogradova). The training systems based on automated extraction of the test questions form the corpus have been developed and tested (available at http://web-corpora.net/realec/).

The project research into the tendencies of the phonetic, lexical, and grammar systems of the Internet usage and of the influence of pragmatic approaches on the Internet utterance structure has been carried out in the following areas:

  • consonant change in suffixes and word endings in today’s uses;
  • derivation and word-forming strategies including non-standard but typical ones (such as archaic change in neologisms, deviation from historical change, variation in palatalized/non-palatalized options and in area and manner of sound formation);
  • corpus research into old-Russian orthography to confirm the hypothesis that spelling variation in the endings was different from that of the stem and was of the morphological nature;
  • the functions of Russian habitual subjunctive constructions (for situations repeated at different times, in different places, in different manner and with different participants without any modality involved, for the construction of temporal semantics by formal means typical of conditional subjunctive constructions, and for the analysis of syntactic constructions that impose transposition of conditional-concessive clauses);
  • detailisation of ambitransitive character of Arabic transitivity as an example of corpus studies;
  • introduction of non-connective markers of connectivity instead of “and” in Russian;
  • uses of the verbs "добиться" and "следить" when the subject of the predicate is or is not coreferential with the subject of the clause or infinitive construction;
  • the research into the dynamics of the formation of “own VS alien” dichotomy in the authorities discourse: the evolution of representation in voicing opinions and formation of the enemy/ally image (от имени правительства, от имени нации) on the basis of the political discourse lexical, semantic and pragmatic components analysis;
  • functions of quantifying expressions modern usage at the background of historic transformations in their use; their broadened combinatory power and abtractivation in meaning; the loss of evaluation in their semantics and their grammaticalisation in different registers (such as lower register phrasemes "как грязи", "до жути"; neutral "постольку-поскольку", "ни на шаг", or paradoxically productive archaic "ни на йоту", "видимо-невидимо"); linguistic fashion (one-day quantifiers in rapidly changing slang, such as "децл"); detailed analysis of diachronic changes in collocation width of dterminalisation phrasemes (for example, "мизер", "мизерный"); focus on reduplication and development of constructions with reduplication ("едва-едва", "чуть-чуть");
  • new features of verbalising the spatial relations with the corresponding English construction at the background (such as redundant for Russian uses of "вниз");
  • correlation between the choice of singular or plural form of the predicate with "кто" as a subject and/or the syntactic or semantic features of a statement across the range from marginal standard to agrammatical uses;
  • analysis of the preserved and strengthened syncretism of semantic roles in a shift from initial to metaphoric meaning;
  • research into pragmatic functions of the redundant participle.

Achievements in methodology:

  • Analyses of the role and applications of expert annotation adopted in REALEC for the formation of the appropriate correction strategies in teaching a foreign language, for optimizing learning efficiency in a way of prevention of speech errors, as well as for research into the phenomenon of language interference and focus on appropriate staging in mastering a non-native language;
  • Introduction of deictic categories and frame approach in grammar descriptions;
  • Efficiency of description of the grammar of errors as a strategy of generating the language grammar system;
  • Studies of no-native speakers’ and heritage speakers’ productions as leading to the reconstruction of the systemic character of the language and as demonstrating language development tendencies;
  • Improvements of data processing mechanisms by introducing the program of search enquiry filtering.

Level of implementation,  recommendations on implementation or outcomes of the implementation of the results. Resources developed within this long-term project are going to be of use to linguists, philologists, historians, anthropologists, culturologists, philosophers, psychologists, journalists, PR specialists, translators and interpreters, teachers of Russian and of Russian as a foreign language, instructors who set up courses in rhetorics, communication competence, text editing, language teaching methodology – in other words, anyone with an interest in the linguistic image of the world. Both researchers in the humanities and those keen on popularizing achievements in the humanities will find something important here as well. An instructor in the course of humanities can use to results in designing a set of cases for their learners. Focus on language errors helps to reveal the tendencies of a language development. 


