• A
• A
• A
• АБB
• АБB
• АБB
• А
• А
• А
• А
• А
Обычная версия сайта
Меню
Магистратура 2020/2021

## Анализ лингвистических данных: квантитативные методы и визуализация

Статус: Курс обязательный (Компьютерная лингвистика)
Направление: 45.04.03. Фундаментальная и прикладная лингвистика
Когда читается: 1-й курс, 3, 4 модуль
Формат изучения: без онлайн-курса
Охват аудитории: для своего кампуса
Прогр. обучения: Компьютерная лингвистика
Язык: английский
Кредиты: 5

### Course Syllabus

#### Abstract

The course is devoted to modern methods of data analysis, as applied to linguistic data, including methods of statistical inference and explanatory data analysis with visualizations. We begin with theoretical background in mathematical statistics and discuss limitations of statistical methods and their applicability to linguistical problems. From practical point of view, we use R system to do actual analysis with real datasets. We also discuss different visualization techniques using popular library ggplot2.

#### Learning Objectives

• Within this course you will: ● learn about the principal steps of a quantitative research in linguistics; ● learn about the possibilities and limitations of quantitative approaches as applied to different research questions; ● learn to formulate research questions and develop them into testable hypotheses; ● explore the possibilities of data collection and different approaches to sampling; ● learn to evaluate the quality of a quantitative approach; ● study the most common corpus, experimental, and mixed design of the linguistic studies and learn to evaluate research plans, discover and prevent the associated threats to data validity; ● practice in preparing your quantitative data for analysis, evaluating the quality of your data; treating missing data; ● learn about the possibilities and limitations of conventional statistical techniques and criteria, as well as some popular contemporary multivariate statistical methods; ● learn to choose and apply in practice a set of appropriate statistical tests for your research question.

#### Expected Learning Outcomes

• Освоение базовых навыков работы в R
• Умение формулировать иссследовательский вопрос, формулировать и тестировать гипотезу / Learn to formulate research questions and develop them into testable hypotheses
• Знание базовых типов данных / Account for basic types of data used in linguistic research
• Теоретическое и практическое освоение подходов к анализу лингвистических данных, критическая оценка их применения / Learn and apply, critically discuss the limitations of commonly used methods for answering research questions about language

#### Course Contents

• Введение в R / Introduction to R
Типы данных. Датафреймы. Функции и аргументы. Пайпы (dplyr). Визуализации: базовые и ggplot2. / Types of data. Dataframe. Functions and arguments. Dplyr style in R, pipes. Visualizing data: basic style and ggplot2.
• Дизайн исследования и формулирование гипотез. Дескриптивная статистика. Базовые визуализации. / Research design and Hypothesis testing. Descriptive statistics. Basic visualizations.
Основные этапы проведения исследования. Тестирование гипотез. Типы распределений. Независимые и повторяющиеся наблюдения. p-values. Точный биномиальный тест, t-test, ANOVA. Доверительный интервал. Хи-квадрат и точный тест Фишера для категориальных данных. / Research design. Hypothesis testing. Types of distribution. P-values. Exact binomial test, t-test, ANOVA. Confidence intervals. Chi-squared and Fisher exact test.
• Корреляции и регрессионный анализ / Correlation and Regression
Корреляции. Линейная и полиномиальная регрессия. Логистическая регрессия / Correlation. Regressionsː linear and polynomial. Logistic regression.
• Смешанные модели / Mixed-effects models
Фиксированные и случайные эффекты. Смешанные модели. / Fixed and random effects. Mixed-effects models
• Бутстрап. Деревья решений. Случайный лес. / Bootstrap. Decision trees. Decision forests
Бутстрап. Деревья решений. Случайный лес. / Bootstrap. Decision trees. Decision forests
• Кластеризация и снижение размерности / Clusterization and Dimension reduction
Distance matrices. Clusterization. Dimension reduction, visualisations using MDS, PCA, CA, MCA.
• Байесовская статистика / Bayesian statistics
Байесовское правило. Обобщенные линейные модели. Сравнение и выбор модели / Bayes' rule for statistical inference. (Generalized) linear models. Model comparison and selection.

#### Assessment Elements

• Домашние задания
• Подготовка индивидуального исследовательского проекта
• Устный экзамен
The student is expected to prepare the final project in a written form as electronic document. The exam is conducted in the form of oral defense of the final project. The Exam Score measures the overall quality of the final project. It is integer number from 0 to 10.  The Final Score is obtained from the following formula: Final Score = 0.6 × (Homework Score) + 0.4 × (Exam Score). Exam is conducted on Zoom platform, according to schedule that will be published using official telegram channel of the course. Link to Zoom meeting will be published there as well. First retake is conducted like exam. Second retake is conducted in form of written test. Connectivity problems are not considered as violations of the rules (if they allow to finish examination).

#### Interim Assessment

• Interim assessment (4 module)
0.6 * Домашние задания + 0.4 * Подготовка индивидуального исследовательского проекта

#### Recommended Core Bibliography

• Gries, S. T. (2013). Statistics for Linguistics with R : A Practical Introduction (Vol. 2nd revised edition). Berlin: De Gruyter Mouton. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=604318
• Levshina, N. (2015). How to Do Linguistics with R : Data Exploration and Statistical Analysis. Amsterdam: John Benjamins Publishing Company. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=nlebk&AN=1093048