Master
2019/2020

## Data Analysis

Type:
Bridging course (Big Data Systems)

Area of studies:
Business Informatics

Delivered by:
Department of Information Systems and Digital Infrastructure Management

When:
1 year, 1 module

Mode of studies:
offline

Instructors:
Sergey Petropavlovsky

Master’s programme:
Big Data Systems

Language:
English

ECTS credits:
3

### Course Syllabus

#### Abstract

"Data Analysis" is taken in the first module of the Master’s program "Big Data Systems". The course is designed to refresh basic statistical and data analysis techniques and thereby prepare students for the subsequent more advanced disciplines. In the first part of the course we review basic methods of descriptive and inferential statistics including data visualization and point estimation, interval estimates, hypothesis testing, linear regression (univariate and multivariate) and analysis of variance. The second part of the course covers selected modern approaches such as dimension reduction methods (principal component analysis and extensions), classification algorithms (cluster analysis, linear discriminant analysis, logistic regression) and foundations of statistical learning. The emphasis is put on practical implementation of the algorithms in the first place, so a brief theoretical exposure to each topic is accompanied by multiple examples. The students are supposed to use the R language for doing analysis throughout the course (but not limited to), so a brief introduction to R is done at the very beginning. The duration of the course is one module. The course is taught in English and worth 3 credits.

#### Learning Objectives

- The course provides a review of major data analysis techniques and aims at developing practical skills of data acquisition, processing and interpretation

#### Expected Learning Outcomes

- Be able to install and write programs using the R programming language
- Be able to collect and pre-process the raw data using methods of the descriptive statistics
- Be able to compute and interpret point and interval estimates. Be able to perform hypothesis tests.
- Be able to create and interpret linear regression models
- Be able to use the dimension reduction methods
- Be able to use classification methods

#### Course Contents

- Introduction to RData objects in R: data frames, matrices and arrays, factors, lists. Indexing of data frames, conditional selection. Installing and using packages. Loading data from local files and on-line databases. Handling missing values. The R environment: session management, the graphics subsystem. Plotting data in R. Time series objects in R. Overview of basic statistical functions in R. Major programming constructs: conditional operators, loops, functions.
- Descriptive StatisticsTypes of data. Graphical presentation of univariate data: bar chart, histogram, stem-and-leaf display. Kernel density estimators. Measures of central tendency: mean, median. Assessing shape of distribution. Measures of variability: range, sample standard deviation, interquartile range. Quartiles and quantiles. Numerical summaries of the data in R. Boxplots. Bivariate data: side-by-side boxplots, quantile-to-quantile plots, scatter plots. Pearson’s and Spearman’s coefficients of correlation. Kendall’s τ measure. Bivariate boxplots. The convex hull of the bivariate data. Bubble and glyph plots for multivariate data. Scatter plot matrix.
- Inferential StatisticsInterval estimates. The idea behind confidence intervals (CI). CI for a parameter of single population (mean, proportion, variance). CI for difference between means and proportions. Hypothesis testing. The general concept of hypothesis testing. Type I and Type II errors. Test statistic. Decision rules: rejection regions, p-values. One-sample tests: z- and t-test for a single mean, test for a single proportion. Two-sample tests: test for difference between means and proportions. Paired samples in hypothesis testing. Wilcoxon rank-sum test for two independent samples. Chi-squared goodness-of-fit test for association between the categorical variables. The chi-squared test for homogeneity. Goodness-of-fit tests for continuous distributions. Kolmogorov-Smirnov test and the Shapiro-Wilk test for normality.
- Linear RegressionThe simple linear regression model. Estimating the parameters in simple linear regression. Linear regression in R. Statistical inference for simple linear regression. R-squared and adjusted R-squared coefficients. Model diagnostics: assessing normality of the residuals. The model goodness-of-fit statistics: F-test. t-test for model coefficients, confidence intervals for model coefficients. Prediction intervals. Confidence intervals for the correlation coefficient. Introduction to multiple linear regression. Multiple regression assumptions, diagnostics, and efﬁcacy measures. Fitting the multiple regression model in R. Interpreting the regression parameters. Model selection: paired F-test, the Akaike information criterion. Problems with many explanatory variables: multicollinearity. Remedies for multicollinearity. Confusion between explanatory and response variables.
- Dimension Reduction MethodsGeometrical view on data. Optimal projecting onto a low-dimensional space. Principal component analysis (PCA). Data for PCA. Different approaches to PCA. PCA through the singular value decomposition. PCA via diagonalization of the covariance matrix. Coordinates of individuals and variables in the reduced basis. The transition formulae. Quality of projecting and individual contributions into construction of new dimensions. Interpretation of PCA output. Simultaneous analysis of individuals and variables. PCA implementation in R. Correspondence analysis (CA). Data for CA, chi^2 metric. Proﬁles of rows and columns. CA implementation in R. Quality of projecting and individual contributions into construction of new dimensions. Link between row and column representations. Multiple CA. Indicator matrix. Multidimensional scaling (MDS). Dissimilarity matrices. Goals of multidimensional scaling. Computing dissimilarities: Euclidean and non-Euclidean distances. Classical multidimensional scaling. Metric and non-metric MDS. Goodness-of-fit measures for the metric MDS. Shepard’s diagrams. Distance scaling. Issues of the non-metric MDS. Interpretation of the MDS analysis. Embedding external variables.
- Classification MethodsLogistic regression. Estimating the regression coeﬃcients and making predictions. Logistic regression with several variables. Case-control sampling and logistic regression. Logistic regression with more than two classes. Linear discriminant analysis (LDA). Using Bayes’ theorem for classiﬁcation. Discriminant functions. Fisher’s discriminant plots. Advantages and downsides of LDA. Naive Bayes approach. Quadratic discriminant analysis. K-nearest neighbor algorithm. Cluster algorithms. Distances between clusters (linkage). Agglomerative hierarchical clustering (AHC). Constructing an indexed hierarchy. Ward’s algorithm. Quality of partition. Agglomeration according to inertia. Properties of the agglomeration criterion. Impact of different linkage type on the performance of the AHC. Direct search for partitions: K-means and K-medoids approaches. Probabilistic clustering: Gaussian mixture model (GMM). Expectation maximization algorithm. Clustering and principal component methods.

#### Interim Assessment

- Interim assessment (1 module)0.4 * Final exam + 0.36 * Home assignments + 0.24 * In-class test

#### Bibliography

#### Recommended Core Bibliography

- Husson, F., Lê, S., & Pagès, J. (2017). Exploratory Multivariate Analysis by Example Using R (Vol. Second edition). Boca Raton: Chapman and Hall/CRC. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=nlebk&AN=1516055

#### Recommended Additional Bibliography

- Mailund, T. (2017). Beginning Data Science in R : Data Analysis, Visualization, and Modelling for the Data Scientist. New York: Apress. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1484645
- Shmueli, G., Bruce, P. C., Yahav, I., Patel, N. R., & Lichtendahl, K. C. (2017). Data Mining for Business Analytics : Concepts, Techniques, and Applications in R. Hoboken, New Jersey: Wiley. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1585613