Анализ лингвистических данных: квантитативные методы и визуализация

Магистратура 2020/2021

Статус: Курс обязательный (Компьютерная лингвистика)

Направление: 45.04.03. Фундаментальная и прикладная лингвистика

Кто читает: Школа лингвистики

Где читается: Факультет гуманитарных наук

Когда читается: 1-й курс, 3, 4 модуль

Формат изучения: без онлайн-курса

Охват аудитории: для своего кампуса

Преподаватели: Ляшевская Ольга Николаевна, Поздняков Иван Сергеевич

Прогр. обучения: Компьютерная лингвистика

Язык: английский

Кредиты: 5

Контактные часы: 64

Full Syllabus

Abstract

The course is devoted to modern methods of data analysis, as applied to linguistic data, including methods of statistical inference and explanatory data analysis with visualizations. We begin with theoretical background in mathematical statistics and discuss limitations of statistical methods and their applicability to linguistical problems. From practical point of view, we use R system to do actual analysis with real datasets. We also discuss different visualization techniques using popular library ggplot2.

Learning Objectives

Within this course you will: ● learn about the principal steps of a quantitative research in linguistics; ● learn about the possibilities and limitations of quantitative approaches as applied to different research questions; ● learn to formulate research questions and develop them into testable hypotheses; ● explore the possibilities of data collection and different approaches to sampling; ● learn to evaluate the quality of a quantitative approach; ● study the most common corpus, experimental, and mixed design of the linguistic studies and learn to evaluate research plans, discover and prevent the associated threats to data validity; ● practice in preparing your quantitative data for analysis, evaluating the quality of your data; treating missing data; ● learn about the possibilities and limitations of conventional statistical techniques and criteria, as well as some popular contemporary multivariate statistical methods; ● learn to choose and apply in practice a set of appropriate statistical tests for your research question.

Expected Learning Outcomes

Освоение базовых навыков работы в R
Умение формулировать иссследовательский вопрос, формулировать и тестировать гипотезу / Learn to formulate research questions and develop them into testable hypotheses
Знание базовых типов данных / Account for basic types of data used in linguistic research
Теоретическое и практическое освоение подходов к анализу лингвистических данных, критическая оценка их применения / Learn and apply, critically discuss the limitations of commonly used methods for answering research questions about language

Course Contents

Введение в R / Introduction to R
Типы данных. Датафреймы. Функции и аргументы. Пайпы (dplyr). Визуализации: базовые и ggplot2. / Types of data. Dataframe. Functions and arguments. Dplyr style in R, pipes. Visualizing data: basic style and ggplot2.
Дизайн исследования и формулирование гипотез. Дескриптивная статистика. Базовые визуализации. / Research design and Hypothesis testing. Descriptive statistics. Basic visualizations.
Основные этапы проведения исследования. Тестирование гипотез. Типы распределений. Независимые и повторяющиеся наблюдения. p-values. Точный биномиальный тест, t-test, ANOVA. Доверительный интервал. Хи-квадрат и точный тест Фишера для категориальных данных. / Research design. Hypothesis testing. Types of distribution. P-values. Exact binomial test, t-test, ANOVA. Confidence intervals. Chi-squared and Fisher exact test.
Корреляции и регрессионный анализ / Correlation and Regression
Корреляции. Линейная и полиномиальная регрессия. Логистическая регрессия / Correlation. Regressionsː linear and polynomial. Logistic regression.
Смешанные модели / Mixed-effects models
Фиксированные и случайные эффекты. Смешанные модели. / Fixed and random effects. Mixed-effects models
Бутстрап. Деревья решений. Случайный лес. / Bootstrap. Decision trees. Decision forests
Бутстрап. Деревья решений. Случайный лес. / Bootstrap. Decision trees. Decision forests
Кластеризация и снижение размерности / Clusterization and Dimension reduction
Distance matrices. Clusterization. Dimension reduction, visualisations using MDS, PCA, CA, MCA.
Байесовская статистика / Bayesian statistics
Байесовское правило. Обобщенные линейные модели. Сравнение и выбор модели / Bayes' rule for statistical inference. (Generalized) linear models. Model comparison and selection.

Assessment Elements

Домашние задания
Подготовка индивидуального исследовательского проекта
Устный экзамен
The student is expected to prepare the final project in a written form as electronic document. The exam is conducted in the form of oral defense of the final project. The Exam Score measures the overall quality of the final project. It is integer number from 0 to 10.  The Final Score is obtained from the following formula: Final Score = 0.6 × (Homework Score) + 0.4 × (Exam Score). Exam is conducted on Zoom platform, according to schedule that will be published using official telegram channel of the course. Link to Zoom meeting will be published there as well. First retake is conducted like exam. Second retake is conducted in form of written test. Connectivity problems are not considered as violations of the rules (if they allow to finish examination).

Interim Assessment

Interim assessment (4 module)
0.6 * Домашние задания + 0.4 * Подготовка индивидуального исследовательского проекта

Bibliography

Recommended Core Bibliography

Gries, S. T. (2013). Statistics for Linguistics with R : A Practical Introduction (Vol. 2nd revised edition). Berlin: De Gruyter Mouton. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=604318
Levshina, N. (2015). How to Do Linguistics with R : Data Exploration and Statistical Analysis. Amsterdam: John Benjamins Publishing Company. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=nlebk&AN=1093048

Recommended Additional Bibliography

Gries, S. T. (2017). Quantitative Corpus Linguistics with R : A Practical Introduction (Vol. Second edition). Milton Park, Abingdon, Oxon: Routledge. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=nlebk&AN=1386645
Hadley, W. (2016). Ggplot2 : Elegant Graphics for Data Analysis. New York, NY: Springer. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1175341
Harney, H. L. (2016). Bayesian Inference : Data Evaluation and Decisions (Vol. 2nd ed). Switzerland: Springer. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1301176
McElreath, R. (2016). Statistical Rethinking : A Bayesian Course with Examples in R and Stan. Boca Raton: Chapman and Hall/CRC. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=nlebk&AN=1338291

Course Syllabus