Многомерный анализ данных

Бакалавриат 2019/2020

Статус: Курс по выбору (Бизнес-информатика)

Направление: 38.03.05. Бизнес-информатика

Кто читает: Кафедра управления информационными системами и цифровой инфраструктурой

Где читается: Высшая школа бизнеса

Когда читается: 3-й курс, 1, 2 модуль

Преподаватели: Петропавловский Сергей Владимирович

Язык: английский

Кредиты: 4

Контактные часы: 64

Full Syllabus

Abstract

"Multidimensional Data Analysis" is an elective course designed for the 3rd year Bachelor students. Nowadays most raw data are multidimensional necessitating the efficient tools for handling such type of data. The major challenge here is the so-called “curse of dimensionality”, i.e., a rapid growth of efforts related to data manipulation with the number of dimensions. The course presents a detailed exposition to the basic as well as more advanced methods of dealing with such data. Starting with the descriptive techniques (i.e., methods of summarizing and visualizing multivariate data) we will quickly switch to dimension reduction methods and consider them in detail. Of key importance are principal component analysis and its extensions (correspondence analysis, multidimensional scaling, principal component regression etc). Another bunch of methods facilitating multivariate data processing is cluster analysis. Here we will review some basic algorithms (hierarchical, k-means) and end up with the probabilistic clustering. A considerable part of the course will be devoted to Markov Chain Monte Carlo methods. This technique is very well suited for multivariate data, became an industry standard and is a must for data scientists. A related topic is the dynamic linear models among which the Kalman filter is the most prominent approach. After that, we will consider multivariate econometrics, i.e., multidimensional time series. Co-integration and its use in applications will be of primary interest here. A separate section of the course will be devoted to different aspects of model tuning and validation in the context of multivariate data. This part overlaps partially with the standard machine learning courses and focuses on selecting and regularizing the multivariate linear statistical models using cross validation, bootstrapping, the ridge and Lasso regression, etc. We also discuss some non-linear models for classification and regression such as multivariate adaptive regression splines and generalized additive models. If time permits, some other machine learning algorithms will be introduced and discussed, again, in the context of multivariate data. The course has a practical bias so that the theoretical concepts will be illustrated by applications in computer science, engineering, economics, finance, marketing, social sciences etc. The students are supposed to use the R programming language for implementing the algorithms throughout the course (but not limited to), so a brief introduction to R will be done at the beginning. The course is taught in English.

Learning Objectives

The course provides a theoretical background of multivariate data analysis and aims at developing practical skills of data mining, processing and interpretation

Expected Learning Outcomes

Be able to use basic constructs of the R programming language
Be able to load, process, visualize and interpret the multivariate data
Be able to apply methods of dimensionality reduction for the efiicient data processing
Be able to implement methods of cluster analysis and interpret the results
Understand the principle of the Bayesian approach to statistics. Be able to use the Markov Chain Monte Carlo methods in the Bayesian framework
Be able to use the regularized versions of the regression model, regression splines and generalized additive models in data analysis. Understand the principles of bootstrapping and cross-validation
Undestand the principles of machine learning (bias-variance trade-off, overfitting etc). Be able to assess the perfomance of the machine learning algorithms.

Course Contents

Introduction to R
Data objects in R, installing and using packages. Loading data from local files and on-line databases. Plotting data in R. Advanced graphics. Time series objects. Overview of basic statistics in R. Major programming constructs: conditional operators, loops, functions.
Multivariate Data Handling and Visualization
Multivariate normal distribution. Testing multivariate normality (chi^2 QQ-plots). Scatter plots, imposing marginal distributions. Bivariate boxplots. The convex hull of the bivariate data. Removing outliers. The bubble and glyph plots and their interpretation. Analysis of the scatter plot matrix.
Dimension Reduction Methods
Principal component analysis (PCA). Geometrical view on data. Cloud of individuals and cloud of variables. Rotating the frame and optimal projecting. PCA through diagonalization of the covariance matrix. PCA through the singular value decomposition. Coordinates of individuals and variables in the reduced basis. Quality of projecting. Interpretation. Simultaneous analysis of individuals and variables. Demonstrations in R. Correspondence analysis (CA). Data for the CA. chi^2 tests for association between categorical variables. Geometrical view: chi^2 metric. Raw and column proﬁles. Implementation of the CA. Quality of dimension reduction. Link between row and column representations. Demonstrations in R. Multiple CA (MCA). Data for the MCA. Indicator matrix. Distances between individuals and categories. Implementation of the MCA. Numerical indicators of quality representation. Demonstrations in R. Multidimensional scaling (MDS). Data for the MDS: dissimilarity matrices. Goals of multidimensional scaling. Computing dissimilarities: Euclidean versus non-Euclidean distances. Classical multidimensional scaling. Metric and non-metric MDS. Goodness-of-fit measures for the metric MDS. Shepard’s diagrams. Distance scaling. Issues of the non-metric MDS. Interpretation of the MDS analysis. Embedding external variables.
Cluster Algorithms
Cluster algorithms. Distances between clusters of observations (linkage). Agglomerative hierarchical clustering (AHC). Constructing an indexed hierarchy. Ward’s algorithm. Quality of partition. Agglomeration according to inertia. Properties of the agglomeration criterion. Impact of different linkage type on the performance of the AHC. Direct search for partitions. K-means and K-medoids approaches. Probabilistic clustering. Gaussian mixture model (GMM). Expectation maximization algorithm. Clustering and principal component methods.
Markov Chain Monte Carlo Methods
Goals of Markov Chain Monte Carlo (MCMC). Markov processes. Properties of Markov chains (finiteness, aperiodicity, irreducibility, ergodicity, mixing, etc). The stationary state of the chain. Monte Carlo simulations of distributions. Inverse CDF method. Rejection sampling. The Gibbs sampler. The Metropolis-Hastings algorithm. Issues in chain efficacy. MCMC implementation in R and examples. Applications of MCMC: modeling S&P500 index.
Cross Validation and Model Selection
Cross validation and bootstrapping. The idea and applications. The validation set approach. Leave-One-Out cross validation. k-fold cross validation. Bias-variance trade-oﬀ for k-fold cross validation. Cross validation on classiﬁcation problems. Bootstrapping. Linear model selection and regularization. Best subset selection. Stepwise selection. Choosing the optimal model. Shrinkage methods: ridge regression, the Lasso, selecting the tuning parameter. Dimension reduction method in regression. Principal components regression, partial least squares. Regression splines. Piecewise polynomials. Constraints and splines. The spline basis representation. Choosing the number and locations of knots. Comparison to polynomial regression. Smoothing splines. Choosing the smoothing parameter. Generalized additive models. GAMs for regression and classification problems.
Machine Learning Algorithms in Practice
Types of machine learning algorithms. The limits of machine learning. Classification using Nearest Neighbors algorithm: measuring similarity with distance, choosing an appropriate number of neighbors, preparing data for use with k-NN. Examples of k-NN algorithm. Probabilistic learning using naïve Bayes approach: the basic idea, the Laplace estimator, numerical features of the naïve Bayes approach. Examples (filtering out the spam). Classification using decision trees and rules. Advantages and disadvantages of trees. Tree-based classiﬁcation and regression. Application area of tree-based methods. Trees versus linear models. Divide and conquer algorithm. The 1R algorithm. The RIPPER algorithm. Boosting the accuracy of decision trees, pruning the trees. Bagging classiﬁcation. Random forests. The Gini index. Fitting the decision trees. Black box methods. Neural networks. Activation functions. Network topology. Training a model on the data. Evaluating and improving model performance. Support vector machines. Classification with hyperplanes (linearly and non-linearly separable data). Using kernels for non-linear spaces.

Assessment Elements

Home assignments
In-class tests
Attendance and in-class activity
Final exam

Interim Assessment

Interim assessment (2 module)
0.03 * Attendance and in-class activity + 0.4 * Final exam + 0.21 * Home assignments + 0.36 * In-class tests

Bibliography

Recommended Core Bibliography

Husson, F., Lê, S., & Pagès, J. (2017). Exploratory Multivariate Analysis by Example Using R (Vol. Second edition). Boca Raton: Chapman and Hall/CRC. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=nlebk&AN=1516055

Recommended Additional Bibliography

Mailund, T. (2017). Beginning Data Science in R : Data Analysis, Visualization, and Modelling for the Data Scientist. New York: Apress. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1484645
Shmueli, G., Bruce, P. C., Gedeck, P., & Patel, N. R. (2020). Data Mining for Business Analytics : Concepts, Techniques and Applications in Python. Newark: Wiley. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=2273611
Shmueli, G., Bruce, P. C., Yahav, I., Patel, N. R., & Lichtendahl, K. C. (2017). Data Mining for Business Analytics : Concepts, Techniques, and Applications in R. Hoboken, New Jersey: Wiley. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1585613

Course Syllabus