• A
  • A
  • A
  • ABC
  • ABC
  • ABC
  • А
  • А
  • А
  • А
  • А
Regular version of the site

Child language corpora CHILDES

Longitudinal study allows us to investigate natural development of children’s speech. For the Russian language, longitudinal studies were done by A. N. Gvozdev, S. N. Tseytlin, N. V. Gagarina, M. D. Voeykova, Eva Bar-Shalom, Vera Kempe and others. As a result, important conclusions about children’s acquisition of grammar were made. Meanwhile, for more reliable conclusions about the early stages of language acquisition we need a large amount of data analyzed by means of computational linguistics techniques. The CHILDES (Child Language Data Exchange System) database contains recordings of children's speech for more than 40 languages, but the amount of recordings in Russian is comparatively small.

The purpose of the CHILDES project for the Russian language is to create a modern corpus of oral speech transcripts based on video recordings of everyday communication between Russian-speaking 1-to-3 year-old children with their adults. We study not only the speech that is produced by the children but also the language input, i.e. the speech of adults addressed to the child. It is known that the language input has a significant impact on language acquisition, and specifically the longitudinal records provide access input.

In 2016-2019, child-parent everyday interactions in the two families were video-recorded and transcribed according to the guidelines of the CHILDES database. Families that participate in the project are asked to video-record daily interaction of a child with adults for one hour once in two weeks. Then linguists transcribe the recorded speech in the CLAN (Child Language Analysis) program. This transcription is morphologically parsed using the program MyStem, the cases of morphological ambiguity are resolved manually. In 2020-2022, five more families joined the Russian CHILDES project.

In 2020 we focused on studying the acquisition of grammatical categories of nouns and verbs and the grammatical characteristics of the input as well as its impact on the acquisition of grammar. Preliminary results for the two corpora indicate that neuter nouns in children's speech are less frequent than masculine and feminine, singular forms predominate over plural forms. The most frequent case in children’s speech is nominative, among the oblique cases accusative and genitive are the first to be acquired. The most difficult to acquire is instrumental case. Interestingly, our data showed that the statistical properties of the input did not change significantly as the children grew older.

We also found that imperfect verbs prevailed over perfect verbs at almost all stages of acquisition. Our results indicated that verb forms in singular were acquired earlier than in plural. The most difficult tense to acquire is the future tense, the second person is the last to be learned. Both children heard a large number of infinitives and imperatives and often used them in speech. The study of the input showed that both children heard overall more imperfect verbs; the present tense prevailed over the past and the future; children heard more 2nd person forms than the 3rd person forms and the 1st person forms. Singular forms in input were also more frequent than the plural ones. Statistical properties of grammatical categories in child-directed speech did not change as children grew older.

In 2021 we studied vocabulary acquisition and checked the hypothesis suggested in (Mani & Ackermann, 2018): children acquire novel words from dense semantic categories faster than from sparse categories. In our work, for the first time we used vector semantic analysis for this purpose. The results showed that clusters formed during the first period increased in size by the second period. At the same time, during the second period new clusters appeared but their size was comparatively small. This result partly confirms our hypothesis.

We found out that first person verb forms appeared in child languages earlier than others, second person forms appeared later. Results about phrasal structure in child speech indicate that word order tends to be SVO, but children are more likely to put object in pre-verb position, than adults. At early stages, it is difficult for children to build sentences with two nominal verb arguments. Subject or object is usually expressed by a personal pronoun in child sentences.

We continued to study early vocabulary acquisition and checked the hypothesis about the existence of vocabulary spurt in children’s speech. We used data of two children for that. This hypothesis was confirmed, however, the age borders of emergence of the phenomenon differed from those claimed in the previous literature. In both children in our study, a vocabulary spurt occurred during the third year of life, while usually its appearance was limited to two years.

In 2022, the Center for Language and Brain decided to join colleagues from other countries and create a unified database with data of mono- and bilingual Russian-speaking children. Those data will be annotated and processed in accordance with the BiRCh protocole. The purpose of the BiRCh project is to create a longitudinal corpora of child language, which contains data of mono- and bilingual children during 5-10 years of their lives. Now we are working with data of three bilinguals and converting collected monolingual data into a new format.

Data collected in the CHILDES project are now used to create the Index of Productive Syntax for Russian. It is a method for evaluating and quantifying the grammatical complexity of young children’s spontaneous language samples. We also started to work with bilingual data and plan to check the hypothesis of reduced input in bilinguals.


 

Have you spotted a typo?
Highlight it, click Ctrl+Enter and send us a message. Thank you for your help!
To be used only for spelling or punctuation mistakes.