Бакалавриат
2020/2021
Продвинутый анализ данных
Лучший по критерию «Полезность курса для Вашей будущей карьеры»
Лучший по критерию «Полезность курса для расширения кругозора и разностороннего развития»
Статус:
Курс по выбору (Социология и социальная информатика)
Направление:
39.03.01. Социология
Кто читает:
Департамент социологии
Где читается:
Санкт-Петербургская школа социальных наук
Когда читается:
4-й курс, 1, 2 модуль
Формат изучения:
без онлайн-курса
Преподаватели:
Широканова Анна Александровна
Язык:
английский
Кредиты:
6
Контактные часы:
48
Course Syllabus
Abstract
Advanced Data Analysis in Sociology focuses on categorical data analysis and covers special types of prediction and classification models (logistic regression and cluster analysis). The course continues with a discussion of data culture, from data management to narrating with data. This course is also the starting point for students interested in pursuing advanced training in research methods or planning to use quantitative methods with categorical outcomes in their own research.
Learning Objectives
- The course covers the foundations and popular techniques of categorical data analysis with the goal of training students to be informed producers and consumers of quantitative research.
Expected Learning Outcomes
- Students can apply classification techniques, propose hypotheses and choose the methods in categorical data analysis in R, including supervised classification with a binary outcome, and unsupervised classification with clustering techniques of mixed data types.
- Students interpret the results and assess the quality of proposed analytical and visualization solutions, provide reasons for their choice of techniques, interpret the outputs correctly, and assess the quality of models and data stories.
Course Contents
- Binary Logistic RegressionModels for categorical outcome variables. Variety of goals of analysis with categorical data. Typical goals of analysis and interpretation of results. Binary logistic regression. Objectives of logistic regression. The logistic curve. Maximum likelihood estimation. Assumptions of logistic regression. Perfect separation. Transforming a probability into odds and logit values. Goodness-of-fit measures for logistic regression. Out-of-sample validation. Classification matrix. Interpretation of results with linear and dichotomous predictors. Stepwise model building. Model diagnostics. Binary logistic regression in R. A comparison of binary logistic regression with decision trees.
- Cluster AnalysisObjectives of cluster analysis: segmentation, taxonomy description, data simplification, and relationship identification. A conceptual framework for cluster analysis. Distance between objects. Similarity measures. Distance measures for various types of variables. Proximity matrix. Assumptions of cluster analysis. Hierarchical and non-hierarchical clustering algorithms. K-means clustering, PAM, DBSCAN clustering. Dendrograms. Measures of overall fit. Cluster profiles. Between- and within-cluster variation. Determining the number of clusters. Interpretation of clusters. Cross-classification from several solutions. Cluster analysis in R. A comparison of cluster analysis with data reduction by principal components analysis and factor analysis.
Assessment Elements
- Binary Outcome project
- Dimension Reduction project
- Cluster Analysis project
- Coding reflection paperIf you fail to submit this paper in time, you can make up by contributing to the 'R Gems' seminar.
- Rmd Customization
- Internet datasets
- Bayes reaction paper
- Viz QuizThe Tidy Tuesday: https://github.com/rfordatascience/tidytuesday
- Final Exam
Interim Assessment
- Interim assessment (2 module)0.05 * Bayes reaction paper + 0.2 * Binary Outcome project + 0.2 * Cluster Analysis project + 0.1 * Coding reflection paper + 0.15 * Dimension Reduction project + 0.1 * Final Exam + 0.05 * Internet datasets + 0.05 * Rmd Customization + 0.1 * Viz Quiz
Bibliography
Recommended Core Bibliography
- Ledolter, J. (2013). Data Mining and Business Analytics with R. Hoboken, New Jersey: Wiley. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=587979
- Upton, G. J. G. (2016). Categorical Data Analysis by Example. Hoboken, New Jersey: Wiley. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1402878
Recommended Additional Bibliography
- Mood, C. (2010). Logistic Regression: Why We Cannot Do What We Think We Can Do, and What We Can Do About It. European Sociological Review, 26(1), 67–82. https://doi.org/10.1093/esr/jcp006