Linguistic and ethnic diversity in Daghestan

Priority areas of development: humanitarian
Department: Laboratory of the Caucasian Languages

Goal of research: The project investigates parameters of variation, at different linguistic levels, across the languages of the Caucasus; as well as analyses the structure of the genetic diversity  of one of the ethnic groups in Daghestan. 

One of the study of the project investigates the data on the use of spatial forms with verbs of speech. This marking is typical of the basic speech verb in most East Caucasian languages (with the exception of several languages of the Lezgic branch). For East Cacuasian languages, several realizations of this spatial strategy have been considered, including the languages where the addressee of speech is coded by an essive rather than lative form. Our study puts into question the widely spread idea of the universal ditransitive metaphor of the verbs of speech as verbs of information transfer.

The aim of the second study is a description of nominal inflection systems in some of the northern Dargwa lects, focusing on case forms and spatial forms, spatial and non-spatial uses of the spatial forms, the inventory and use of the peripheral cases as well as peripheral uses of nuclear cases.

Further, we have studied the data of Archi and Agul which both have a special grammatical category of verificative which conveys the meaning of verification of the truth value of the predicate corresponding to the lexical verb that carries the verificative marker (i.e. verbP-Verif = ‘ verify , whether P is true’). The verificative is, according to the available typological data, a unique type of grammaticalization of ‘see’. The question then arises how this typologically unique category may have originated in Archi and Agul, two remotely related Lezgic languages. The scenario according to which it has been inherited from proto-Lezgian may be rejected on various grounds; but a direct contact scenario may be excluded as well because we have no evidence of historical contacts between the Aguls and the Archis.

We have also transcribed our field records of the dictionary of Mehweb into (a) a morphological dictionary of Mehweb verbs (about 150 entries) and (b) an online dictionary of Mehweb verbs, nouns and adjectives (about 450 entries). Audiofiles from Mehweb and also Shiri, another Dargwa lect, have been annotated to facilitate further access to the acoustic data. These data will be used in a study of cross-linguistic variation in the properties of stops – what Daghestanian languages have in common and where they show variation. The following acoustic parameters have been annotated: closure duration; burst time; creaky voice period.

Within the molecular anthropological part of the project, we have been working on methods to evaluate the divergence time and common ancestry for relatively closely related patrilineal groups based on an analsysis of Y-chromosomal markers. We calculated frequencies of microsatellite alleles of the Y-chromosome loci characteristic of each sampling and thus obtained data on genetic relatedness and divergence order of the patrilineal groups and the age of the philogenetic nodes. We mainly focused on the problem of dating the patrilineal groups and on their absolute and relative ages in order to establish a correspondence between the genetic reality and the social reality of kinship.


The corpus data come with morphological glosses; in some cases the corpora are used that have been processed in FieldWorks environment (fieldworks.sil.org). Most textual data are normalized narratives of various genres. In some cases, examples coming from spontaneous recorded texts are used. Direct elicitation assumes collecting grammatical information based on a questionnaire. Thus, establishing the set of localizations for each dialect was based on a standard set of translation stimuli covering a wide range of various spatial configurations that may be grammaticalized in East Caucasian languages. The stimuli also cover non-spatial uses of spatial forms (the Causee, unintentional Agent etc.).

Collecting dictionary and acoustic data requires using high-sensitivity microphones and data storage in WAV format. In the online version of the Mehweb dictionary, however, the files are available in MP3; they have been converted to make their online use more convenient. Recording requires several consequent pronunciations to level phrasal intonation and to avoid other acoustic aberrations.

Sample collection and analysis of genetic data deserve a more detailed description. Each donor was asked to provide answers to questions that covered his patrilineal affiliation and the degrees of kinship with other men in his clan. The samples were collected from donors within the same patrilineal group only in those cases where they claimed to be unable to establish a direct kinship relation between each other. DNA was extracted from an aliquot of buccal cell suspension using adsorption on microscopic glass bids in ethanol-based phosphate buffer. Each sample was further processed to acquire amplicons for mitochondrial D-loop and 13 STR loci of the Y-chromosome. D-loop amplicons were sequenced from both strands (to exclude abberations) according to standard methods for Sanger sequencing. Utilizing capillary electophoresis in linear polyacrilomyde gel we measured fluorescently labeled DNA fragments produced on the previous stage. The data thus obtained served as basis for reconstructing the sequence for target mitochondrial chromosome fragments.

Empirical base of research:

The project of 2014 is based on several types of empirical data.

First, the project involves using language corpora. The corpus data are used in the study of the verificative in Archi, where the vast majority of the examples come from different text collections that have been morphologically glossed. Also, corpus data are partly used in the study of addressee-of-speech marking across East Caucasian family.

Second, a vast amount of data have been collected by direct elicitation. This is the source of all dictionary data, of the data on spatial systems in Dargwa lects, as well as supplementary data on verificative and on marking of the addressee of speech.

Third, empirical basis of the study includes field audiorecords. These data have been used in the online audio-dictionary of Mehweb and in the annotation of acoustic parameters of the stops in Mehweb and Shiri lects of Dargwa.

Fourth, an analysis of the genetic diversity is based on DNA analysis of buccal epithelium samples obtained during fieldwork. Our collection includes samples from eight villages: Juli, Zildik, Gjukhrag, Juljag, Juljniv, Furdag, Urga, Vertil, a total of some 100 samples. At the present stage, about half of the Tabassaran villages have been covered out of all the villages where the sample collection has been envisaged.

Results of research:

We have investigated and interpreted the data on the use of spatial cases to express the addressee with verbs of speech. In our interpretation, this is evidence that, at some level of granularity, the addressee is an independent element of the cross-linguistic inventory of semantic roles. we have shown that the dative marking on the addressee in those East Caucasian languages where it is attested can be interpreted as another realization of the spatial strategy. We also put forth a theoretical issue, whether the data of Standard Average European could not be re-interpreted as a realization of the same strategy rather than metaphorical extension of the ‘give’ pattern.

The field data collected in Deybuk, Kharbuk, Ginta, Gapshima, Mugi and Aymaumakhi were used to produce a description of the nominal declension in each of the dialects. We have collected both spatial and non-spatial uses of each of the spatial forms. We have also considered uses of peripheral cases (comitative, causal etc.) as well as some peripheral uses of core cases (availability of ergative marking for Instruments etc.).

We carried out an empirical (based on corpora) and theoretical (within the framework of the typology of grammaticalization) the evidence from Archi and Agul who feature a special category of verificative. Verificative conveys the meaning of verification of the truth value of the predicate expressed by the lexical verb carrying the verificative morpheme. It is shown that, like in Agul, in Archi it is not an isolated morphological form but a whole inflectional subparadigm. Moreover, the verificative adds a new Agent-like participant – the Verifier, which requires agentivization of the verb ‘see’. The latter is the source of grammaticalization of the verificative in Archi and at least in some dialects of Agul and, in East Caucasian languages, shows the dative rather than ergative case frame. There is a short discussion of whether the genesis of the verificative in Agul and Archi is historically independent.

An audio-dictionary of Mehweb has been compiled on the basis of our field recordings and put online at (mehwebdict.wc.lt). The source file are stored in WAV, but the online version uses mp3 files to make remote access and the use of the data more feasible. Each entry includes the quotation form, the oblique form for the nouns and one of the TAM forms for the verbs, audio for each of the forms represented, and the translation of the lexical item into Russian and English. The dictionary includes 451 lexical entries; this number will be expanded during future fieldwork, and the existing entries will be checked and corrected phonetically.

A morphological dictionary of non-derived verbs in Mehweb has been compiled. The dictionary includes some 150 lexical items, with the forms from both perfective and imperfective subparadigms: aorist, imperfect, imperative, infinitive. In the future we will have to control the accuracy of the transcription and to elaborate the semantic description of the verbs, but the declensional properties of the verbs may be considered final.

As a result of our work with the acoustic data recorded in the villages of Shiri and Mehweb, we have a word by word and sound by sound annotated recordings as well as annotation of the components of stops and their immediate context. These results may be used in a cross-linguistic study of acoustic properties of stops, their common properties and distinctions across languages of the Caucasus. The word by word annotation also facilitates other cross-linguistic studies of acoustic variation.

The methodology of the present project suggests collecting genetic samples with the record of the social (patrilineal) structure of the village kinship. So far, more than 100 samples have been collected and analyzed. Every sample has been documented, including the information on kin relations of the donor to other men of his patrilineal group. For all samples, a DNA has been extracted. The sample covers the villages located at different places of the Tabassaran area and the data are analyzed while controlling the source location of the samples. For each of the samples, we have fluorescent marked amplicons within 13 segments that contain micro-satellite repetitions of the Y-chromosome, and amplicons of the control region of the D-loop of the mitochondrial DNA. There is a high variation in mitochondrial haplotypes within the Tabassaran population; however, in the absence of detailed source location information for most of other ethnic groups and samples that have been presented in the literature does not allow to model DNA exchange between the Tabassarans and their neighbors. Based on the length of the micro-satellite repetitions we have calculated the frequencies of the alleles of the micro-satellite loci of the Y-chromosome and have quantitatively evaluated their genetic diversity. The degree of relatedness and the relative time of divergence for the patrilineal groups have been calculated based on median joining method for the haplotype network and through coalescence models. In both cases, our reconstructions are more or less isomorphic to the social models of the village population structure. We have been able to group the patrilineal groups together by relatedness and to establish the timeline of their coalescence towards common ancestry.

Implementation of the results of the research

One implementation of the results of the project is the online audio-dictionary of Mehweb available online (mehwebdict.wc.lt). This is a working version of the dictionary that will serve as a base of the future open linguistic internet-resource.


Perspectives on Semantic Roles. Typological studies in language 106. Philadelphia : John Benjamins Publishing Company, 2014. 
Daniel M. Against the addressee of speech – Recipient metaphor: Evidence from East Caucasian, in: Perspectives on Semantic Roles. Typological studies in language 106. Philadelphia : John Benjamins Publishing Company, 2014. С. 205-239. 
Daniel M., Khurshudian V. Temperature terms in modern Eastern Armenian, in: Linguistics of Temperature. Amsterdam : John Benjamins Publishing Company, 2015. С. 392-439.