Основы анализа данных в R

Бакалавриат 2019/2020

Лучший по критерию «Полезность курса для расширения кругозора и разностороннего развития»

Лучший по критерию «Новизна полученных знаний»

Статус: Курс обязательный (Экономика)

Направление: 38.03.01. Экономика

Кто читает: Департамент экономики и финансов

Где читается: Факультет экономики, менеджмента и бизнес-информатики

Когда читается: 2-й курс, 2, 3 модуль

Формат изучения: с онлайн-курсом

Преподаватели: Паклина София Николаевна, Шенкман (Попова) Евгения Андреевна

Язык: английский

Кредиты: 4

Контактные часы: 44

Full Syllabus

Abstract

The course introduces the methods of data manipulation and analysis for Economics. The main aim of the current course is to provide students with theoretical knowledge and practical skills concerning various techniques to analyze economic data and make data-driven decisions. The course begins with an introduction to data analysis in R, free software environment for statistical computing and graphics. This part is designed to build basic computer competencies related to finding, importing, exploring, manipulating and visualization of data. The second part of the course aims to provide students with research and analytical skills and covers such methods of data analysis as hypothesis testing (parametric, nonparametric, with bootstrap), principal component analysis and clustering. The course is based on real data on Russian and European public companies collected by International laboratory of intangible-driven economy NRU HSE and data on sales and customer analytics provided by laboratory GAMES NRU HSE. After completing the course students will be able to locate data, wrangle and manipulate it, and provide meaningful economic analysis of this data. The course is blended. The online video lectures are provided by online platform for education DataCamp (www.datacamp.com). The seminars are provided by lecturers of National Research University Higher School of Economics.

Learning Objectives

The objective of the course is, that students should be able to: work easily in R, know fundamental of R Syntax; import data in R, make basic manipulation with it to prepare data for calculations and export results of calculations; visualize data; apply basic methods of preliminary data analysis; understand limitation and relevance of the methods.

Expected Learning Outcomes

Know basic data types and R syntax. Is able to transform datasets. Have skills of data visualization.
Know basics of parametric and nonparametric hypothesis testing. Is able to resample data. Have skills of hypothesis testing with bootstrap.
Know methodology of PCA and clustering. Is able to do PCA and clustering using R. Have skills of evaluation of clustering quality.

Course Contents

Introduction to data analysis with R
1. Importing and cleaning data. Fundamentals of R Syntax. Ways to import data. Introduction and exploring raw data. Tidying data. Preparing data for analysis. 2. Data manipulation. Data wrangling. Select and mutate functions. Filter and arrange functions. Summarise and the pipe operator. Joining data. Intermediate operations in R. 3. Data visualization. Exploring of ggplot2. Plot aesthetic. Plot geometries. Applying statistical methods. Themes. Plots for specific data types.
Hypothesis testing
4. Hypothesis testing: parametric vs nonparametric. Main advantage and limitations of parametric hypothesis testing. T-test on comparing means in independent and dependent tests. Z-test on comparing proportions in independent and dependent tests. Main advantages and limitations of non-parametric hypothesis testing. Distribution free tests: The Sign test, Wilcoxon Signed-Ranks Test, Mann-Whitney U Test. ANOVA tests.
PCA and clustering
5. Principal component analysis. Main objectives of principal component analysis (PCA). Mathematical model of components discovery. Algorithms of PCA implementation. Latent variable, criteria for defining number of components. Rotation, interpretation of the results. 6. Clustering. Main objectives of clustering, geometrical interpretation. Measures of distance between objects and measures of distance between clusters. Hierarchical clustering: objective, algorithm, results interpretation, dendrogram. k-means and k-median clustering: objective, algorithm, results interpretation. Criteria for defining number of clusters and quality of clustering.

Assessment Elements

Test
Microtests
Self-study (DataCamp)
Reports
Exam
The exam is held on the platform http://trajectory.hse.perm.ru/. At the same time students should be in zoom-session with working camera. Requirements for the exam: laptop, web-camera and robust internet.

Interim Assessment

Interim assessment (3 module)
0.3 * Exam + 0.15 * Microtests + 0.15 * Reports + 0.1 * Self-study (DataCamp) + 0.3 * Test

Bibliography

Recommended Core Bibliography

Spector, P. (2008). Data Manipulation with R. New York: Springer. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=229058

Recommended Additional Bibliography

Corder, G. W., & Foreman, D. I. (2014). Nonparametric Statistics : A Step-by-Step Approach (Vol. Second edition). Hoboken, New Jersey: Wiley. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=798830
Gatignon, H. (2013). Statistical Analysis of Management Data (Vol. Third edition). New York: Springer. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1073815
Govaert, G. (2009). Data Analysis. London: Wiley-ISTE. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=310759
Rahlf, T. (2017). Data Visualisation with R : 100 Examples. Cham, Switzerland: Springer. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1377904

Course Syllabus