• A
  • A
  • A
  • АБB
  • АБB
  • АБB
  • А
  • А
  • А
  • А
  • А
Обычная версия сайта
Бакалавриат 2019/2020

Многомерный анализ данных

Статус: Курс по выбору (Бизнес-информатика)
Направление: 38.03.05. Бизнес-информатика
Кто читает: Кафедра управления информационными системами и цифровой инфраструктурой
Когда читается: 3-й курс, 1, 2 модуль
Язык: английский
Кредиты: 4

Course Syllabus

Abstract

"Multidimensional Data Analysis" is an elective course designed for the 3rd year Bachelor students. Nowadays most raw data are multidimensional necessitating the efficient tools for handling such type of data. The major challenge here is the so-called “curse of dimensionality”, i.e., a rapid growth of efforts related to data manipulation with the number of dimensions. The course presents a detailed exposition to the basic as well as more advanced methods of dealing with such data. Starting with the descriptive techniques (i.e., methods of summarizing and visualizing multivariate data) we will quickly switch to dimension reduction methods and consider them in detail. Of key importance are principal component analysis and its extensions (correspondence analysis, multidimensional scaling, principal component regression etc). Another bunch of methods facilitating multivariate data processing is cluster analysis. Here we will review some basic algorithms (hierarchical, k-means) and end up with the probabilistic clustering. A considerable part of the course will be devoted to Markov Chain Monte Carlo methods. This technique is very well suited for multivariate data, became an industry standard and is a must for data scientists. A related topic is the dynamic linear models among which the Kalman filter is the most prominent approach. After that, we will consider multivariate econometrics, i.e., multidimensional time series. Co-integration and its use in applications will be of primary interest here. A separate section of the course will be devoted to different aspects of model tuning and validation in the context of multivariate data. This part overlaps partially with the standard machine learning courses and focuses on selecting and regularizing the multivariate linear statistical models using cross validation, bootstrapping, the ridge and Lasso regression, etc. We also discuss some non-linear models for classification and regression such as multivariate adaptive regression splines and generalized additive models. If time permits, some other machine learning algorithms will be introduced and discussed, again, in the context of multivariate data. The course has a practical bias so that the theoretical concepts will be illustrated by applications in computer science, engineering, economics, finance, marketing, social sciences etc. The students are supposed to use the R programming language for implementing the algorithms throughout the course (but not limited to), so a brief introduction to R will be done at the beginning. The course is taught in English.
Learning Objectives

Learning Objectives

  • The course provides a theoretical background of multivariate data analysis and aims at developing practical skills of data mining, processing and interpretation
Expected Learning Outcomes

Expected Learning Outcomes

  • Be able to use basic constructs of the R programming language
  • Be able to load, process, visualize and interpret the multivariate data
  • Be able to apply methods of dimensionality reduction for the efiicient data processing
  • Be able to implement methods of cluster analysis and interpret the results
  • Understand the principle of the Bayesian approach to statistics. Be able to use the Markov Chain Monte Carlo methods in the Bayesian framework
  • Be able to use the regularized versions of the regression model, regression splines and generalized additive models in data analysis. Understand the principles of bootstrapping and cross-validation
  • Undestand the principles of machine learning (bias-variance trade-off, overfitting etc). Be able to assess the perfomance of the machine learning algorithms.
Course Contents

Course Contents

  • Introduction to R
    Data objects in R, installing and using packages. Loading data from local files and on-line databases. Plotting data in R. Advanced graphics. Time series objects. Overview of basic statistics in R. Major programming constructs: conditional operators, loops, functions.
  • Multivariate Data Handling and Visualization
    Multivariate normal distribution. Testing multivariate normality (chi^2 QQ-plots). Scatter plots, imposing marginal distributions. Bivariate boxplots. The convex hull of the bivariate data. Removing outliers. The bubble and glyph plots and their interpretation. Analysis of the scatter plot matrix.
  • Dimension Reduction Methods
    Principal component analysis (PCA). Geometrical view on data. Cloud of individuals and cloud of variables. Rotating the frame and optimal projecting. PCA through diagonalization of the covariance matrix. PCA through the singular value decomposition. Coordinates of individuals and variables in the reduced basis. Quality of projecting. Interpretation. Simultaneous analysis of individuals and variables. Demonstrations in R. Correspondence analysis (CA). Data for the CA. chi^2 tests for association between categorical variables. Geometrical view: chi^2 metric. Raw and column profiles. Implementation of the CA. Quality of dimension reduction. Link between row and column representations. Demonstrations in R. Multiple CA (MCA). Data for the MCA. Indicator matrix. Distances between individuals and categories. Implementation of the MCA. Numerical indicators of quality representation. Demonstrations in R. Multidimensional scaling (MDS). Data for the MDS: dissimilarity matrices. Goals of multidimensional scaling. Computing dissimilarities: Euclidean versus non-Euclidean distances. Classical multidimensional scaling. Metric and non-metric MDS. Goodness-of-fit measures for the metric MDS. Shepard’s diagrams. Distance scaling. Issues of the non-metric MDS. Interpretation of the MDS analysis. Embedding external variables.
  • Cluster Algorithms
    Cluster algorithms. Distances between clusters of observations (linkage). Agglomerative hierarchical clustering (AHC). Constructing an indexed hierarchy. Ward’s algorithm. Quality of partition. Agglomeration according to inertia. Properties of the agglomeration criterion. Impact of different linkage type on the performance of the AHC. Direct search for partitions. K-means and K-medoids approaches. Probabilistic clustering. Gaussian mixture model (GMM). Expectation maximization algorithm. Clustering and principal component methods.
  • Markov Chain Monte Carlo Methods
    Goals of Markov Chain Monte Carlo (MCMC). Markov processes. Properties of Markov chains (finiteness, aperiodicity, irreducibility, ergodicity, mixing, etc). The stationary state of the chain. Monte Carlo simulations of distributions. Inverse CDF method. Rejection sampling. The Gibbs sampler. The Metropolis-Hastings algorithm. Issues in chain efficacy. MCMC implementation in R and examples. Applications of MCMC: modeling S&P500 index.
  • Cross Validation and Model Selection
    Cross validation and bootstrapping. The idea and applications. The validation set approach. Leave-One-Out cross validation. k-fold cross validation. Bias-variance trade-off for k-fold cross validation. Cross validation on classification problems. Bootstrapping. Linear model selection and regularization. Best subset selection. Stepwise selection. Choosing the optimal model. Shrinkage methods: ridge regression, the Lasso, selecting the tuning parameter. Dimension reduction method in regression. Principal components regression, partial least squares. Regression splines. Piecewise polynomials. Constraints and splines. The spline basis representation. Choosing the number and locations of knots. Comparison to polynomial regression. Smoothing splines. Choosing the smoothing parameter. Generalized additive models. GAMs for regression and classification problems.
  • Machine Learning Algorithms in Practice
    Types of machine learning algorithms. The limits of machine learning. Classification using Nearest Neighbors algorithm: measuring similarity with distance, choosing an appropriate number of neighbors, preparing data for use with k-NN. Examples of k-NN algorithm. Probabilistic learning using naïve Bayes approach: the basic idea, the Laplace estimator, numerical features of the naïve Bayes approach. Examples (filtering out the spam). Classification using decision trees and rules. Advantages and disadvantages of trees. Tree-based classification and regression. Application area of tree-based methods. Trees versus linear models. Divide and conquer algorithm. The 1R algorithm. The RIPPER algorithm. Boosting the accuracy of decision trees, pruning the trees. Bagging classification. Random forests. The Gini index. Fitting the decision trees. Black box methods. Neural networks. Activation functions. Network topology. Training a model on the data. Evaluating and improving model performance. Support vector machines. Classification with hyperplanes (linearly and non-linearly separable data). Using kernels for non-linear spaces.
Assessment Elements

Assessment Elements

  • non-blocking Home assignments
  • non-blocking In-class tests
  • non-blocking Attendance and in-class activity
  • non-blocking Final exam
Interim Assessment

Interim Assessment

  • Interim assessment (2 module)
    0.03 * Attendance and in-class activity + 0.4 * Final exam + 0.21 * Home assignments + 0.36 * In-class tests
Bibliography

Bibliography

Recommended Core Bibliography

  • Husson, F., Lê, S., & Pagès, J. (2017). Exploratory Multivariate Analysis by Example Using R (Vol. Second edition). Boca Raton: Chapman and Hall/CRC. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=nlebk&AN=1516055

Recommended Additional Bibliography

  • Mailund, T. (2017). Beginning Data Science in R : Data Analysis, Visualization, and Modelling for the Data Scientist. New York: Apress. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1484645
  • Shmueli, G., Bruce, P. C., Gedeck, P., & Patel, N. R. (2020). Data Mining for Business Analytics : Concepts, Techniques and Applications in Python. Newark: Wiley. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=2273611
  • Shmueli, G., Bruce, P. C., Yahav, I., Patel, N. R., & Lichtendahl, K. C. (2017). Data Mining for Business Analytics : Concepts, Techniques, and Applications in R. Hoboken, New Jersey: Wiley. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1585613