Goal of research
The goal of the project was to identify specific as well as typical features of the system under observation; to present the ranges of idiolectal variation of specch strategies, means to get the meaning though to the audience, and systemic tendencies in the development of non-standard language variation at the background of the conventional norms.
Throughout the project source texts for the collections were created by the respondents, on the one hand, during the guided interviews, which were recorded and then transcribed by the by the project teams, and, on the other, in classes and as home assignments in the language courses. The collections thus present written and oral speech production of speakers of languages other than the langugae that respondents are learning, or those of inexpert translators, heritage speakers, student learners including secondary school learners, and Russian-speaking learners studying Russian academic register. In the majority of those texts it was the author of the speech production (a non-standard speaker) that typed it in him/herself.
The corpora in the project allow researchers to process large volume of comparable data statistically, idetifying lexical and grammar changes in diachrony, as well as to adjust the search to the type of text, its genre, and attribution required, clarifying the lexical and grammatical features of the context during the search.
2018 porject also includes setting up and modelling the system of language rules, i.e. the existing systematic regualrities demonstrating recommended schemes and deviations from them generated by non-standard speakers and writers. One more direction was to improve functional features of the corpora within the project: uploading additions to the corpora, optimizing annotation practices search options; testing and updating of training systems for learners; working out algorythms for the system of rules revealed in the research to be available in automated regime; formation of new teaching approaches, tools and competence-oriented exercises based on the description of new non-standard shifts in comparison with the stndard speech production; academic research into speech production of non-balanced bilingual and other types of heritage speakers of Russian and Russian students mastering foreign languages. The project takes up further tracing of the developments in modern trends in using Russin on the Internet. It is essential to point out that within the project an error is treated not as a deficiency but as a valuable area of liguistic inquiry which allows linguistis to reveal important tendencies in language development.
Empirical base of research
The following non-standard language corpora have been set up, updated and upgraded within the project:
Russian and English learner corpora (REALEC). Student learners write texts as a part of their activities in class or in preparation for classes; texts are uploaded with the addition of metadata (information about the author of the text including some relevant sociolinguistic data) and annotated by codes (MYSTEM – a system for identifying parts of speech and grammar forms, or Stanford POS tagger with the same functional potential) automatically as well as by expert annotators (on the platforms Les Crocodiles 2.0 or BRAT), and files thus annotated are added to the corresponding parts of a corpus.
Russian Learner Corpus (RLC) (Russian language inherited from the family and Russian language learnt by speakers of other languages). The texts are contributed by instructors of Russian as a foreign language and include the following genre varieties: short answer to the open-ended question, argumentative essays discussing the problem in the task, paragragh-long mini-essay as a response to the stimulus; analytical essay; research report; comparison, summary, or commentary of the source texts. For each text the level of the language porficiency is identified. A researcher working with the corpus can get extended context for each query, or set up a suncorpus according to metacharacteristics of the text or meta data of the authors. Errors are annotated by experts in accordance with the approach adopted in REALEC.
FolkloreCorpus(folklore texts). All interviews are recorded with the help of and then transcribed, coded and supplied with metadata before being uploaded to the corpus.
Results of research
Theoretical results achieved within the project:
- Within the part of the project devoted to Russian Learner Corpus (RLC) http://www.web-corpora.net/RLC/ microsyntactic calques from the dominant language incorporated into the grammar system of the acquired language were interpreted; and student errors of systemic nature made by learners of Russian as a foregn language at different level of language acquisition – from acquaintance with the idioms at the beginner level to mastering the lexical means to fluent speech porduction were listed;
- Within the part of the project devoted to Corpus of Russian Students Texts (CoRST) http://web-corpora.net/learner_corpus/ the result of analysis of the agreement between the predicate and the noun phrase governed by the quantifier nouns (ряд, половина, часть, множество), the correlation between the morphosyntactic and and the semantic features of the quantifying nounce and the choice of the noun form and the choice of the verbal form revealed; influence of the features in the context on the choice of the verbal form in the predicate was traced thoughout the uses in 19th and 20th centuries; grammatical limitations and semantic peculiarities of the construction нашелчемудивить were identified, and its position in the typology of subordinate clauses with exclamatory interpretation was defined;
- Within the part of the project devoted to Russian Error-Annotated English Learner Corpus (REALEC) http://realec.org/) the research identified errors in learner texts of examination essays made under the influence of intereference with native Russian language, correlation between academica writing features and level of success in English examination has been researched. The new version of the code for automated generation of questions for English tests has been suggested for the convenience of English instructors who do not have specialised programs on their computer, so that they should be able to use the test-maker on their own.
The main achievements in the area of corpus resources organisation:
- Russian Learner Corpus (RLC, http://www.web-corpora.net/RLC/) saw the updated system of metadata coding, so that pseudonyms for the authors of the learner texts are not required any more;
- In Corpus of Russian Student Texts (CoRST, http://web-corpora.net/learner_corpus/) the algorithm of the devisualisation of many metadata elements from the learner texts has been reorganised (quotations, tables, graphs, examples, title pages, references, appendices can now be dealts with easily in the corpus);
- In Russian Error-Annotated English Learner Corpus (REALEC, http://realec.org/) a new approach with automated identification and annotation of several types of errors has been introduced to work in the area of English examination essays written by 2nd-year Russian-speaking Bachelor students of the Higher School of Economics; aceessibility of metadata has been updated; anonymisation of student works has been ensured.
The results in the area of corpora functional achievements:
In Russian Learner Corpus (RLC, http://www.web-corpora.net/RLC/) the regular contribution of texts to the collection goes on, annotation practices have been improved, the range of learners’ dominant languages has expanded.
- The corpus details:
- Texts, total – 8002
- Words, total – 1508277
- Sentences, total – 129342
- Annotations, total - 59898
In Corpus of Russian Student Texts (CoRST, http://web-corpora.net/learner_corpus/) the database of the texts has expanded, presentation of metadata has been updated towards more uniformity/
- The corpus details:
- Texts, total – 3677
- Words, total – 3115212
- Sentences, total – 301079
- Annotations, total - 31472
In Russian Error-Annotated English Learner Corpus (REALEC, http://realec.org/) there has been an expansion of the genres in the collection, the tools for working with the corpus have been upgraded.
- Texts, total – 11482
- Words, total – 2883229
- Annotated texts, total - 10964
A field trip «Folklore traditions of the Sacred Lake” to the Yuzhny and Pestyakovsky districts of the Ivanovo region took place in July 2018 under the supervision of Yulia Kuvshinskaya. While researching the specific regiolect of the area, work has been carried out to attest everyday and ritual speech acts of the local people, to trace the wandering plots about flooded cultural objects. The group included 9 students and 2 members of the staff.
In the annual international research conference of the Higher School of Economics (“ April conference”) the section “Russian language in the multilanguage world” took place on April 12-14, 2018. The reports were related to methods of teachingbRussian as a foreign language, acquisition of Russian as a second language, and functioning of Russian in bilingual societies. Results of corpus research in RLC (https://www.hse.ru/ma/foreign/news/217989758.html) were also presented.
The hypothesis concerning the influence of extralinguistic factors (gender, age, mobility, educational level) on regionalism identification and their application by regional Russian dialects urban speakers living in Tver was checked within the project. Regional Tver lexicon has never been researched before in the sociolinguistic perspective, though similar approaches have been applied to speakers from Novosibirsk, Vyatka, and Pskov. A socilogical survey in the form of a written questionnaire with check-up lists of regionalisms was carried out with two groups of respondents – first, purely residents of Tver (younger than 25 yo and older than 25)? And second, three groups of girls living in Moscow, Saint Petersburg, and Tver. The data from the survey were analysed for the correlation between social factors and identification of regionalisms with the help of statistics code in R (regression method as the main one). As a result, the following conclusions have been drawn: a) no extralinguistic factor had any bearing on the use of regionalisms, namely, Tver residents identify and apply their regional specific language uses at the same level; b) Tver lexicon is not endangered and is a part of the regional standard, which speakers of other regional varieties of Russian (those from Moscow and Saint Petersburg) successfully identify; c) the experimental model with two-level interviews and check-lists can be regarded as satisfactory, even if more focused questions about new social parameters can result in higher statistical significance in future.
Within the diachronic analysis of the speech standards and in a way of studying comparative standardisation of related languages, historical consonat shifts were traced in Ukrainian language via corpus research and experimentally. This direction is the continuation of the research series, carried out by the linguistic laboratory of Russian corpus technologies devoted to blurring of the system of historic consonant shifts. Contrary to Russian, in Ukrainian the alternations in noun paradigms related to the second palatalisation have been preserved (for example, рука - в руцi). The systemic character of these alternations in neologisms and quasi-words has been documented, and the comparison has been drawn in the project of these alternations with those related to the first palatalisation, which are present both in Russian and Ukrainian.
Linguistic factors that influence the choice of gender in coining new words in Russian have been described, again through corpus and experimental research. The literary Russian standards have it that diminutive and augmentative suffixes do not lead to any change in the gender of the word they form (for example, маленький домишко, огромный домина). Nevertheless, the change in the gender of the words formed with these suffixes has been attested in Russian native speakers production. The task in both types of research is to identify the factors that may cause gender changes in such cases.
In the area of foreign language acquisition the specific preferences of speakers of English as opposed to those of learners of English are being compiled. The purpose of the research is to demonstrate on the material of collocational uses how different collocational behaviour is formed at different stages in exam preparation, find out the collocational differences between learners of English and leraners of Russian, and set up task-specific collocational profiles for two genres of examination essays. Corpus analysis followed by the use of appropriate statistical instruments is invaluable. The preliminary results achieved with lexical clusterisation show that the level of lexical standardisation for Russian learners of English is higher than for speakers of other languages learning Russian, and essays of examination formats are closer in REALEC than are written texts of similar genres in RLC. The least diffrence from the features of native speakers of English is attested in long essays written by students or Moscow Pedagogical University, which can be accounted for by the fact that the latter are much less formatted, require the use of recommende vocabulary to a much lower degree, and include reviews of English literature and, correspondingly, lengthy quotations from English fiction. The closest in all cluster analysis were the REALEC subcorpora of Bachelor HSE student essays written in the examination of IELTS type and essays written by Master’s studnets of HSE taking a MAGOLEGO course in debating in English. Corpus of student texts from the Academic Russian course, on the other hand, is the farthest from amy cluster under observation.
Within the research of Russian exclamatives, Russian complex sentences with pronouns какой, сколькоandктоintroducing the subordinate clause were analysed in Russian National Corpus. Matrix predicate and clausal frequency distribution give way rather to hypothesis, namely, to the conclusion that independent exclamations arose from insubordination of a series of subordinated constructions with interrogative pronouns. Thus, the analysis takes one more step towards eliminating the subordinate nature of exclamatives.
The achievements reached in the project served as materials for preparing a coursebook for Academic Writing course desigend for non-humanities students, in which the exercises were created on the basis of data from the learner corpora in the project. They will be of help in the prevention of errors in all aspects of work with academic texts - reading, analysing and writing academic texts. The coursebook is due to be out in Yurite publishing house in Deecmber 2018.
Level of implementation, recommendations on implementation or outcomes of the implementation of the results
All corpora in the project are in the open access for researchers in humanities (historians, anthropoligists, etnographers, specialists in reional studies, philologists, linguists, translators and specialists in traslation studies, sociologists, psychologists, speech therapists) and for the needs arising in teaching (secondary school teachers, university professors, methodologists and other educators). Corpus observations are a significant research area and also the essential component to take into account in the creation of teaching materials.