• A
  • A
  • A
  • ABC
  • ABC
  • ABC
  • А
  • А
  • А
  • А
  • А
Regular version of the site
Master 2020/2021

Linguistic Data: Quantitative Analysis and Visualisation

Type: Compulsory course (Computational Linguistics)
Area of studies: Fundamental and Applied Linguistics
Delivered by: School of Linguistics
When: 1 year, 3, 4 module
Mode of studies: offline
Open to: students of one campus
Instructors: Olga Lyashevskaya, Ivan Pozdniakov
Master’s programme: Computational Linguistics
Language: English
ECTS credits: 5
Contact hours: 64

Course Syllabus

Abstract

The course is devoted to modern methods of data analysis, as applied to linguistic data, including methods of statistical inference and explanatory data analysis with visualizations. We begin with theoretical background in mathematical statistics and discuss limitations of statistical methods and their applicability to linguistical problems. From practical point of view, we use R system to do actual analysis with real datasets. We also discuss different visualization techniques using popular library ggplot2.
Learning Objectives

Learning Objectives

  • Within this course you will: ● learn about the principal steps of a quantitative research in linguistics; ● learn about the possibilities and limitations of quantitative approaches as applied to different research questions; ● learn to formulate research questions and develop them into testable hypotheses; ● explore the possibilities of data collection and different approaches to sampling; ● learn to evaluate the quality of a quantitative approach; ● study the most common corpus, experimental, and mixed design of the linguistic studies and learn to evaluate research plans, discover and prevent the associated threats to data validity; ● practice in preparing your quantitative data for analysis, evaluating the quality of your data; treating missing data; ● learn about the possibilities and limitations of conventional statistical techniques and criteria, as well as some popular contemporary multivariate statistical methods; ● learn to choose and apply in practice a set of appropriate statistical tests for your research question.
Expected Learning Outcomes

Expected Learning Outcomes

  • Освоение базовых навыков работы в R
  • Умение формулировать иссследовательский вопрос, формулировать и тестировать гипотезу / Learn to formulate research questions and develop them into testable hypotheses
  • Знание базовых типов данных / Account for basic types of data used in linguistic research
  • Теоретическое и практическое освоение подходов к анализу лингвистических данных, критическая оценка их применения / Learn and apply, critically discuss the limitations of commonly used methods for answering research questions about language
Course Contents

Course Contents

  • Введение в R / Introduction to R
    Типы данных. Датафреймы. Функции и аргументы. Пайпы (dplyr). Визуализации: базовые и ggplot2. / Types of data. Dataframe. Functions and arguments. Dplyr style in R, pipes. Visualizing data: basic style and ggplot2.
  • Дизайн исследования и формулирование гипотез. Дескриптивная статистика. Базовые визуализации. / Research design and Hypothesis testing. Descriptive statistics. Basic visualizations.
    Основные этапы проведения исследования. Тестирование гипотез. Типы распределений. Независимые и повторяющиеся наблюдения. p-values. Точный биномиальный тест, t-test, ANOVA. Доверительный интервал. Хи-квадрат и точный тест Фишера для категориальных данных. / Research design. Hypothesis testing. Types of distribution. P-values. Exact binomial test, t-test, ANOVA. Confidence intervals. Chi-squared and Fisher exact test.
  • Корреляции и регрессионный анализ / Correlation and Regression
    Корреляции. Линейная и полиномиальная регрессия. Логистическая регрессия / Correlation. Regressionsː linear and polynomial. Logistic regression.
  • Смешанные модели / Mixed-effects models
    Фиксированные и случайные эффекты. Смешанные модели. / Fixed and random effects. Mixed-effects models
  • Бутстрап. Деревья решений. Случайный лес. / Bootstrap. Decision trees. Decision forests
    Бутстрап. Деревья решений. Случайный лес. / Bootstrap. Decision trees. Decision forests
  • Кластеризация и снижение размерности / Clusterization and Dimension reduction
    Distance matrices. Clusterization. Dimension reduction, visualisations using MDS, PCA, CA, MCA.
  • Байесовская статистика / Bayesian statistics
    Байесовское правило. Обобщенные линейные модели. Сравнение и выбор модели / Bayes' rule for statistical inference. (Generalized) linear models. Model comparison and selection.
Assessment Elements

Assessment Elements

  • non-blocking Домашние задания
  • non-blocking Подготовка индивидуального исследовательского проекта
  • blocking Устный экзамен
    The student is expected to prepare the final project in a written form as electronic document. The exam is conducted in the form of oral defense of the final project. The Exam Score measures the overall quality of the final project. It is integer number from 0 to 10. 
The Final Score is obtained from the following formula: Final Score = 0.6 × (Homework Score) + 0.4 × (Exam Score). Exam is conducted on Zoom platform, according to schedule that will be published using official telegram channel of the course. Link to Zoom meeting will be published there as well. First retake is conducted like exam. Second retake is conducted in form of written test. Connectivity problems are not considered as violations of the rules (if they allow to finish examination).
Interim Assessment

Interim Assessment

  • Interim assessment (4 module)
    0.6 * Домашние задания + 0.4 * Подготовка индивидуального исследовательского проекта
Bibliography

Bibliography

Recommended Core Bibliography

  • Gries, S. T. (2013). Statistics for Linguistics with R : A Practical Introduction (Vol. 2nd revised edition). Berlin: De Gruyter Mouton. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=604318
  • Levshina, N. (2015). How to Do Linguistics with R : Data Exploration and Statistical Analysis. Amsterdam: John Benjamins Publishing Company. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=nlebk&AN=1093048

Recommended Additional Bibliography

  • Gries, S. T. (2017). Quantitative Corpus Linguistics with R : A Practical Introduction (Vol. Second edition). Milton Park, Abingdon, Oxon: Routledge. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=nlebk&AN=1386645
  • Hadley, W. (2016). Ggplot2 : Elegant Graphics for Data Analysis. New York, NY: Springer. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1175341
  • Harney, H. L. (2016). Bayesian Inference : Data Evaluation and Decisions (Vol. 2nd ed). Switzerland: Springer. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1301176
  • McElreath, R. (2016). Statistical Rethinking : A Bayesian Course with Examples in R and Stan. Boca Raton: Chapman and Hall/CRC. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=nlebk&AN=1338291