Bachelor
2020/2021

# Data Science for Economics

Type:
Elective course (HSE University and University of London Double Degree Programme in Economics)

Area of studies:
Economics

Delivered by:
International College of Economics and Finance

When:
3 year, 1, 2 module

Mode of studies:
offline

Language:
English

ECTS credits:
4

### Course Syllabus

#### Abstract

a)Course Pre-requisites · Statistics; · Mathematics for Economists. In the second part of the course I will present and derive statistical properties of various estimators. To be able to follow this part of the course students should have a certain level of mathematical maturity. In practice, this means that students should have done some simple mathematical proofs before taking this class (for example, understand “epsilon-delta” arguments in the context of limits of sequences). b)Abstract The objective of this course is to provide students with a hands on introduction to data science in economics (or more broadly to data science in the social sciences). The course consists of three parts: 1. Introduction to programming; 2. Overview of the most commonly used machine learning algorithms; 3. Time permitting, an introduction to causal inference and applications of machine learning algorithms to causal inference. In the first part of the course students will learn basic programming using computing language R. Obtained skills will allow to implement all methods taught subsequently. Additionally, students will learn how to explore and analyse structured and un-structured data sets. Finally, provided introduction to programming will also be useful in subsequent courses in econometrics and economics. At first, to gain intuition, we will study how to solve the problem in a brute force manner and then explore R packages and built in functions to deal with a problem in the most efficient manner. In the second part of the course we will focus on most commonly used machine learning algorithms. We will cover regression techniques (parametric, nonparametric and high-dimensional), classification methods, resampling methods, model selection, unsupervised learning and text analysis (time permitting). Finally, in the last part of the course we will cover research papers which have recently applied machine learning methods to causal inference in economics.

#### Learning Objectives

- At the end of the course students should have developed the following skills: Ability to write simple computer programs using computing language R;
- Implement basic machine learning algorithms;
- Understand assumptions and statistical properties of machine learning algorithms;
- Be able to use machine learning algorithms to solve real world business problems.

#### Expected Learning Outcomes

- Explain where and how the data science is used in industry and in academia. Additionally, after this class students should have a working knowledge of basic data structures used in the computing language R, i.e., they should understand what kind of data each data structure can store and how to manipulate basic data structures.
- Understand and implement basic control structures: if, if – else, for loop, while loop. Understand and implement functions in R. Solve data science problems implementing control structures and functions.
- Understand and implement “apply” family functions: lapply, sapply, tapply. Understand and implement data.table framework. Additionally, after this class students should understand the vectorized nature of R computing language and be able to avoid unnecessary control structures.
- Write basic regular expressions, know functions from the stringr package. Additionally, they should be able to extract and summarize the data from .html and .xml files.
- Explain and apply the basic ideas of statistical learning: supervised learning, un – supervised learning, regression, classification, clustering, bias, variance, curse of dimensionality, parametric estimators, non – parametric estimators, Bayes risk, Bayes decision boundary, in sample MSE, out of sample MSE
- Derive OLS estimator both in the univariate and in the multivariate case. Additionally, students should be able to analyse unbiasedness, consistency and obtain asymptotic distribution of these estimators. Finally, after this class students should be able to implement OLS estimator in the computing language R, analyse their properties using Monte – Carlo simulations and also apply OLS estimation techniques to the real data.
- Distinguish two classification techniques: logit and linear (quadratic) discriminant analysis. They should understand theoretical properties of these methods (including all derivations) and implement these methods using the computing language R.
- Be able to do a model selection in a very broad class of models. They should be able to explain main tradeoffs involved in each model selection technique. Additionally, after this class students should be able to do inference on complicated statistical functionals. Finally, after this class students should be able to implement both model selection techniques and bootstrap methods using the computing language R.
- Operate basic ideas of linear model selection: tuning parameter, L0 norm, L1 norm, L2 norm, Lasso, Ridge. Additionally, students should be able to derive simplest properties of obtained estimators. Finally, they should be able to implement all the studied estimators on the computer, using the computing language R.
- Be familiar with a very broad class of non – parametric techniques (polynomial regression, regression splines, smoothing splines, kernel – methods, generalized additive models). Additionally, they should be able to derive simplest properties of kernel estimators. Finally, they should be able to study properties of these estimators using Monte – Carlo techniques and apply these estimators on real datasets using the computing language R.
- Outline theoretical properties of the following algorithms: regression trees, classification trees, bagging, random forest, boosting. Additionally, students should be able to implement all of the above algorithms using the computing language R.
- Outline theoretical properties of the following algorithms: maximal margin classifier, support vector classifier, support vector machine. Additionally, students should be able to implement all of the above algorithms using the computing language R.
- Outline theoretical properties of the following algorithms: PCA, K – Means clustering, Hierarchical clustering. Additionally, students should be able to implement all of the above algorithms using the computing language R.
- Explain how machine learning methods are used in the causal inference in the “conditional on observables” framework.

#### Course Contents

- Introduction to Data Science In EconomicsOverview of the course. Data science process. Overview of the usage of machine learning algorithms in economics. General introduction to R and Rstudio. Data types and classes in R. R objects. Subsetting of R objects
- Control Structures and FunctionsControl structures in R: if – else, for loops, nested for loops, while loops, repeat loops, next, break. Functions in R
- Vectorized Computation and Data AggregationUsage of “apply” family in R. Usage of a data.table package in R
- Working With Text and the WebUsage of a stringr package in R. Regular expressions in R. The structure of HTML and XML documents. Writing simple XML and HTML parsers.
- Introduction to Statistical LearningClassification and regression, bias – variance tradeoff, mean squared error, in sample and out of sample mean squared error, parametric and nonparametric statistical models. First look at ordinary least squares (OLS) estimator.
- Large Sample Properties of OLSReview of “asymptotic tools”. “Linear plus noise representation” of OLS. Derivation of asymptotic properties of OLS. OLS in the matrix form.
- ClassificationIntroduction to the logistic regression. Estimation of the logistic regression via MLE. Discussion of the statistical properties of MLE. Linear discriminant analysis, quadratic discriminant analysis.
- Resampling MethodsCross – Validation, jacknife, bootstrap, subsampling
- Linear Model Selection and RegularizationBest subset selection, stepwise forward selection, backward forward selection, dimension reduction methods. Ridge regression. Lasso regression
- Nonparametric EstimationPolynomial regression, step functions, basis functions, regression splines, smoothing splines, kernel – methods, generalized additive models.
- Tree Based MethodsDecision trees, bagging, random forests, boosting.
- Support Vector Machines
- Unsupervised Learning
- Topics in Causal Inference

#### Assessment Elements

- Problem sets IDone in groups of two students
- final exam
- Problem sets IIDone in groups of 2 students
- Problem sets IIIDone in groups of 2 students
- Problem sets IVDone in groups of 2 students

#### Interim Assessment

- Interim assessment (2 module)0.4 * final exam + 0.15 * Problem sets I + 0.15 * Problem sets II + 0.15 * Problem sets III + 0.15 * Problem sets IV

#### Bibliography

#### Recommended Core Bibliography

- Alexandre Belloni, Victor Chernozhukov, & Christian Hansen. (2014). High-Dimensional Methods and Inference on Structural and Treatment Effects. Journal of Economic Perspectives, (2), 29. https://doi.org/10.1257/jep.28.2.29
- Einav, L., & Levin, J. (2014). Economics in the age of big data. Science, 346(6210), 1–6. https://doi.org/10.1126/science.1243089
- Gareth James, Daniela Witten, Trevor Hastie, Rob Tibshirani, & Maintainer Trevor Hastie. (2013). Type Package Title Data for An Introduction to Statistical Learning with Applications in R Version 1.0. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsbas&AN=edsbas.28D80286
- Hands-On programming with R, Grolemund, G., 2014
- Sendhil Mullainathan, & Jann Spiess. (2017). Machine Learning: An Applied Econometric Approach. Journal of Economic Perspectives, (2), 87. https://doi.org/10.1257/jep.31.2.87
- Susan Athey. (2018). The Impact of Machine Learning on Economics. NBER Chapters, 507. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsrep&AN=edsrep.h.nbr.nberch.14009
- Wickham, H., & Grolemund, G. (2016). R for Data Science : Import, Tidy, Transform, Visualize, and Model Data (Vol. First edition). Sebastopol, CA: Reilly - O’Reilly Media. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1440131

#### Recommended Additional Bibliography

- Computer age statistical inference : algorithms, evidence, and data science, Efron, B., 2017
- McKinney, W. (2012). Python for Data Analysis : Data Wrangling with Pandas, NumPy, and IPython. Sebastopol, CA: O’Reilly Media. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=495822
- Murphy, K. P. (2012). Machine Learning : A Probabilistic Perspective. Cambridge, Mass: The MIT Press. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=480968
- Wickham, H. (2015). Advanced R, Second Edition. Boca Raton, FL: Chapman and Hall/CRC. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=934735