Introduction into R

Master 2019/2020

Category 'Best Course for Broadening Horizons and Diversity of Knowledge and Skills'

Category 'Best Course for New Knowledge and Skills'

Type: Bridging course (Population and Development)

Area of studies: Public Administration

Delivered by: Department of Demography

Where: Faculty of Social Sciences

When: 1 year, 2 module

Mode of studies: offline

Instructors: Dmitry Malakhov

Master’s programme: Population and Development

Language: English

ECTS credits: 3

Contact hours: 24

Full Syllabus

Abstract

This syllabus was designed for lecturers, faculty members, teaching assistants, and students of 38.04.04 ‘Public Administration’ and ‘Population and Development’ programmes. This syllabus meets the standards established by the National Research University ‘Higher School of Economics’, for educational programmes 38.04.04 ‘Public Administration’ and ‘Population and Development’. It also satisfies the standards of Master’s programme curriculum as of 2019. This course offers a versatile introduction into R programming. The R language is one of the most dynamically developed ones currently used in academia. R has many advantages: it is open-source, it has an active helpful community, and it enables reproducible research. From the very beginning, R was being developed by statisticians for statisticians, so it has versatile built-in functions for statistical computations and visualizations. The students will be provided with a thorough introduction into R with basic theory and particular focus on practical applications. This course also aims at elaborating on the essentials of reproducible research in academia. This crash course will contain essentials of R, gradually developing from basic to more advanced topics. Principles of the learning-by-doing and moving from easy-to-hard will be heavily used. During the course, the students will learn how to acquire, manipulate, and visualize data. We shall also cover the most common basic statistical models. This course was designed for students with no prior knowledge of R or complex mathematical techniques, but we expect the participants to have some basic knowledge of statistics and computer skills. Students who have some prior experience of working with data (in R or any other environment such as Stata, SPSS, SAS, Excel, Python, or gretl) will benefit greatly from the course. The entire learning process is platform-independent. The students will get identical results on Windows, Mac, or Linux, and they do not need to rely on any proprietary closed-source commercial products. The total cost of all software used in this course is zero, so there are no entry barriers for students and no dependence on academic subscriptions and costly licenses for commercial products. The course is divided into 4 main parts. The first two parts are devoted to basic programming skills, the last two parts are tightly connected with applied statistical analysis. Therefore, in order to pass the course students should work from the beginning.

Learning Objectives

Upon completion of this course, the students are expected to be able to do the data manipulations and cleaning using the basic built-ins (even in Big Data context).

Expected Learning Outcomes

Know how to obtain, install, and configure free and open-source statistical software from the Internet on the platform they are using.
Know how to write and run R scripts, edit them in RStudio, view objects in the user space, and invoke help.
Distinguish between data types, convert between numeric values, factors, and character strings.
Check for the presence of invalid values in the data, treat them accordingly, and realise the potential consequences of having erroneously encoded data in their analysis.
Make subsets of data, and conduct analyses on sub-samples based on a certain criterion.
Know how to import data from Stata, SPSS, Excel, or a CSV file into R, and diagnose it for the presence of coding errors.
Know how to prepare source code and data for publication as a publicly accessible online appendix for the purposes of reproducibility.
Be able to choose the quickest way to read and save a data set while using the least CPU time and RAM.
Understand the basic concepts of programming universal to all Turing-complete programming languages and be able to implement them in R.
Substitute inefficient serial computations with native low-level R functions and know which function from the apply family to choose in various scenarios, utilising all available cores of a machine.
Process collections of text files and extract only the necessary information from unstructured data based on regular expressions.
Display relationships between pairs of variables using scatter and line plots.
Visualise descriptive statistics and distributions while avoiding the most common plotting artefacts.
Be able to pick vector or raster image export format and import it into their report.
Estimate an ordinary linear model and report robust significance statistics.
Estimate a probit model, a logit model, an ordered probit model, an ordered logit model, report robust significance statistics and estimate the average effect of variables on the probability.
Know which reported indicators can be used for diagnostics and which can not.
Visualise regression residuals, check the presence of influential observations and determine whether they can be attributed to coding errors or non-typical observations.
Correct their results for the presence of within-cluster correlation of errors.
Report concise tables with regression estimates and readable labels that can be pasted into MS Word or LaTeX.

Course Contents

Course introduction. Basics of R, RStudio, reproducible research. Data: types, structures.
Informing the students about the pros and cons of R programming language. Demonstration of the benefits of R over existing modern commercial products. Explanation of reproducible research principles. Familiarising the students with console interface, scripts, and working environment. Demonstrating multiple commonly used data formats and reading data sets exported from popular statistical software. Introduction into basic structures of R: vectors, matrices, data frames, data classes. Demonstrating the power and versatility of user-written functions that allow the students to save time for repeated operations.
Numerical data: manipulation, import. Functional programming.
Explaining the differences between data formats from various software packages mentioned in lecture 1 and making precautions about their import in R. Demonstrating a proper way to write datasets for reproducible research without loss of hidden attributes. Showing how to speed up data import using custom-written packages. Showing how to merge multiple data sets into one and how to check the merged result for integrity. Familiarising the students with the most generic programming principles: various types of conditions and loops. Demonstrating how to speed computations in R by avoiding loops in large data sets (Big Data). Showing the scope of various by-element apply functions and conversion of the reduced lists into vectors of matrices. Introducing the students into the basics of parallel computing.
Working with text data. Descriptive analysis, visualisations.
Working with text data in R: loops with text data, concatenating, trimming, duplication, etc. Summary, pivot tables, histograms, correlation tables. Basic plots: scatter plots, line plots, density plots, box plots, bar charts, heat etc. Saving graphs as PDF and PNG, multi-row, multi-column plotting, basic 3D plotting.
Applied statistical analysis.
Introducing the students into the most commonly used generalised linear models: simple linear regression, binary choice models. Testing the significance of coefficients and explaining whether the results can be deemed reliable or not. Demonstrating how to extract intuitively interpretable marginal effects from non-linear models. Showing the basic tools of regression diagnosis and giving precautions about outdated techniques and statistics that are still reported by default. Warning about the error of applying naïve methods to data where the errors might not be independent and correcting for potential within-cluster correlation.

Assessment Elements

Participation
Homework
Exam

Interim Assessment

Interim assessment (2 module)
0.4 * Exam + 0.5 * Homework + 0.1 * Participation

Bibliography

Recommended Core Bibliography

Adler, J. (2012). R in a Nutshell : A Desktop Quick Reference (Vol. 2nd ed). Sebastopol, CA: Reilly - O’Reilly Media. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=488356
Teetor, P., & Loukides, M. K. (2011). R Cookbook : Proven Recipes for Data Analysis, Statistics, and Graphics (Vol. 1st ed). Sebastopol, CA: O’Reilly Media. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=414829

Recommended Additional Bibliography

Han Lin Shang. (2012). Graphics for statistics and data analysis with R. Journal of Applied Statistics, (8), 1843. https://doi.org/10.1080/02664763.2012.679355
Michael Friendly. (2011). Graphics for Statistics and Data Analysis with R by KEEN, K. J. Biometrics, (3), 1177. https://doi.org/10.1111/j.1541-0420.2011.01658_1.x
R. Allan Reese. (2018). Graphics for Statistics and Data Analysis with R, 2nd edn. Journal of the Royal Statistical Society Series A, (4), 1261. https://doi.org/10.1111/rssa.12399

Course Syllabus