- The course provides a review of major data analysis techniques and aims at developing practical skills of data acquisition, processing and interpretation
- Be able to install and write programs using the R programming language
- Be able to collect and pre-process the raw data using methods of the descriptive statistics
- Be able to compute and interpret point and interval estimates. Be able to perform hypothesis tests.
- Be able to create and interpret linear regression models
- Be able to use the dimension reduction methods
- Be able to use classification methods
- Introduction to RData objects in R: data frames, matrices and arrays, factors, lists. Indexing of data frames, conditional selection. Installing and using packages. Loading data from local files and on-line databases. Handling missing values. The R environment: session management, the graphics subsystem. Plotting data in R. Time series objects in R. Overview of basic statistical functions in R. Major programming constructs: conditional operators, loops, functions.
- Descriptive StatisticsTypes of data. Graphical presentation of univariate data: bar chart, histogram, stem-and-leaf display. Kernel density estimators. Measures of central tendency: mean, median. Assessing shape of distribution. Measures of variability: range, sample standard deviation, interquartile range. Quartiles and quantiles. Numerical summaries of the data in R. Boxplots. Bivariate data: side-by-side boxplots, quantile-to-quantile plots, scatter plots. Pearson’s and Spearman’s coefficients of correlation. Kendall’s τ measure. Bivariate boxplots. The convex hull of the bivariate data. Bubble and glyph plots for multivariate data. Scatter plot matrix.
- Inferential StatisticsInterval estimates. The idea behind confidence intervals (CI). CI for a parameter of single population (mean, proportion, variance). CI for difference between means and proportions. Hypothesis testing. The general concept of hypothesis testing. Type I and Type II errors. Test statistic. Decision rules: rejection regions, p-values. One-sample tests: z- and t-test for a single mean, test for a single proportion. Two-sample tests: test for difference between means and proportions. Paired samples in hypothesis testing. Wilcoxon rank-sum test for two independent samples. Chi-squared goodness-of-fit test for association between the categorical variables. The chi-squared test for homogeneity. Goodness-of-fit tests for continuous distributions. Kolmogorov-Smirnov test and the Shapiro-Wilk test for normality.
- Linear RegressionThe simple linear regression model. Estimating the parameters in simple linear regression. Linear regression in R. Statistical inference for simple linear regression. R-squared and adjusted R-squared coefficients. Model diagnostics: assessing normality of the residuals. The model goodness-of-fit statistics: F-test. t-test for model coefficients, confidence intervals for model coefficients. Prediction intervals. Confidence intervals for the correlation coefficient. Introduction to multiple linear regression. Multiple regression assumptions, diagnostics, and efﬁcacy measures. Fitting the multiple regression model in R. Interpreting the regression parameters. Model selection: paired F-test, the Akaike information criterion. Problems with many explanatory variables: multicollinearity. Remedies for multicollinearity. Confusion between explanatory and response variables.
- Dimension Reduction MethodsGeometrical view on data. Optimal projecting onto a low-dimensional space. Principal component analysis (PCA). Data for PCA. Different approaches to PCA. PCA through the singular value decomposition. PCA via diagonalization of the covariance matrix. Coordinates of individuals and variables in the reduced basis. The transition formulae. Quality of projecting and individual contributions into construction of new dimensions. Interpretation of PCA output. Simultaneous analysis of individuals and variables. PCA implementation in R. Correspondence analysis (CA). Data for CA, chi^2 metric. Proﬁles of rows and columns. CA implementation in R. Quality of projecting and individual contributions into construction of new dimensions. Link between row and column representations. Multiple CA. Indicator matrix. Multidimensional scaling (MDS). Dissimilarity matrices. Goals of multidimensional scaling. Computing dissimilarities: Euclidean and non-Euclidean distances. Classical multidimensional scaling. Metric and non-metric MDS. Goodness-of-fit measures for the metric MDS. Shepard’s diagrams. Distance scaling. Issues of the non-metric MDS. Interpretation of the MDS analysis. Embedding external variables.
- Classification MethodsLogistic regression. Estimating the regression coeﬃcients and making predictions. Logistic regression with several variables. Case-control sampling and logistic regression. Logistic regression with more than two classes. Linear discriminant analysis (LDA). Using Bayes’ theorem for classiﬁcation. Discriminant functions. Fisher’s discriminant plots. Advantages and downsides of LDA. Naive Bayes approach. Quadratic discriminant analysis. K-nearest neighbor algorithm. Cluster algorithms. Distances between clusters (linkage). Agglomerative hierarchical clustering (AHC). Constructing an indexed hierarchy. Ward’s algorithm. Quality of partition. Agglomeration according to inertia. Properties of the agglomeration criterion. Impact of different linkage type on the performance of the AHC. Direct search for partitions: K-means and K-medoids approaches. Probabilistic clustering: Gaussian mixture model (GMM). Expectation maximization algorithm. Clustering and principal component methods.
- Interim assessment (1 module)0.4 * Final exam + 0.36 * Home assignments + 0.24 * In-class test
- Husson, F., Lê, S., & Pagès, J. (2017). Exploratory Multivariate Analysis by Example Using R (Vol. Second edition). Boca Raton: Chapman and Hall/CRC. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=nlebk&AN=1516055
- Mailund, T. (2017). Beginning Data Science in R : Data Analysis, Visualization, and Modelling for the Data Scientist. New York: Apress. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1484645
- Shmueli, G., Bruce, P. C., Yahav, I., Patel, N. R., & Lichtendahl, K. C. (2017). Data Mining for Business Analytics : Concepts, Techniques, and Applications in R. Hoboken, New Jersey: Wiley. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1585613