Multidimensional Data Analysis
- The course provides a theoretical background of multivariate data analysis and aims at developing practical skills of data mining, processing and interpretation
- Be able to use basic constructs of the R programming language
- Be able to load, process, visualize and interpret the multivariate data
- Be able to apply methods of dimensionality reduction for the efiicient data processing
- Be able to implement methods of cluster analysis and interpret the results
- Understand the principle of the Bayesian approach to statistics. Be able to use the Markov Chain Monte Carlo methods in the Bayesian framework
- Be able to use the regularized versions of the regression model, regression splines and generalized additive models in data analysis. Understand the principles of bootstrapping and cross-validation
- Undestand the principles of machine learning (bias-variance trade-off, overfitting etc). Be able to assess the perfomance of the machine learning algorithms.
- Introduction to RData objects in R, installing and using packages. Loading data from local files and on-line databases. Plotting data in R. Advanced graphics. Time series objects. Overview of basic statistics in R. Major programming constructs: conditional operators, loops, functions.
- Multivariate Data Handling and VisualizationMultivariate normal distribution. Testing multivariate normality (chi^2 QQ-plots). Scatter plots, imposing marginal distributions. Bivariate boxplots. The convex hull of the bivariate data. Removing outliers. The bubble and glyph plots and their interpretation. Analysis of the scatter plot matrix.
- Dimension Reduction MethodsPrincipal component analysis (PCA). Geometrical view on data. Cloud of individuals and cloud of variables. Rotating the frame and optimal projecting. PCA through diagonalization of the covariance matrix. PCA through the singular value decomposition. Coordinates of individuals and variables in the reduced basis. Quality of projecting. Interpretation. Simultaneous analysis of individuals and variables. Demonstrations in R. Correspondence analysis (CA). Data for the CA. chi^2 tests for association between categorical variables. Geometrical view: chi^2 metric. Raw and column proﬁles. Implementation of the CA. Quality of dimension reduction. Link between row and column representations. Demonstrations in R. Multiple CA (MCA). Data for the MCA. Indicator matrix. Distances between individuals and categories. Implementation of the MCA. Numerical indicators of quality representation. Demonstrations in R. Multidimensional scaling (MDS). Data for the MDS: dissimilarity matrices. Goals of multidimensional scaling. Computing dissimilarities: Euclidean versus non-Euclidean distances. Classical multidimensional scaling. Metric and non-metric MDS. Goodness-of-fit measures for the metric MDS. Shepard’s diagrams. Distance scaling. Issues of the non-metric MDS. Interpretation of the MDS analysis. Embedding external variables.
- Cluster AlgorithmsCluster algorithms. Distances between clusters of observations (linkage). Agglomerative hierarchical clustering (AHC). Constructing an indexed hierarchy. Ward’s algorithm. Quality of partition. Agglomeration according to inertia. Properties of the agglomeration criterion. Impact of different linkage type on the performance of the AHC. Direct search for partitions. K-means and K-medoids approaches. Probabilistic clustering. Gaussian mixture model (GMM). Expectation maximization algorithm. Clustering and principal component methods.
- Markov Chain Monte Carlo MethodsGoals of Markov Chain Monte Carlo (MCMC). Markov processes. Properties of Markov chains (finiteness, aperiodicity, irreducibility, ergodicity, mixing, etc). The stationary state of the chain. Monte Carlo simulations of distributions. Inverse CDF method. Rejection sampling. The Gibbs sampler. The Metropolis-Hastings algorithm. Issues in chain efficacy. MCMC implementation in R and examples. Applications of MCMC: modeling S&P500 index.
- Cross Validation and Model SelectionCross validation and bootstrapping. The idea and applications. The validation set approach. Leave-One-Out cross validation. k-fold cross validation. Bias-variance trade-oﬀ for k-fold cross validation. Cross validation on classiﬁcation problems. Bootstrapping. Linear model selection and regularization. Best subset selection. Stepwise selection. Choosing the optimal model. Shrinkage methods: ridge regression, the Lasso, selecting the tuning parameter. Dimension reduction method in regression. Principal components regression, partial least squares. Regression splines. Piecewise polynomials. Constraints and splines. The spline basis representation. Choosing the number and locations of knots. Comparison to polynomial regression. Smoothing splines. Choosing the smoothing parameter. Generalized additive models. GAMs for regression and classification problems.
- Machine Learning Algorithms in PracticeTypes of machine learning algorithms. The limits of machine learning. Classification using Nearest Neighbors algorithm: measuring similarity with distance, choosing an appropriate number of neighbors, preparing data for use with k-NN. Examples of k-NN algorithm. Probabilistic learning using naïve Bayes approach: the basic idea, the Laplace estimator, numerical features of the naïve Bayes approach. Examples (filtering out the spam). Classification using decision trees and rules. Advantages and disadvantages of trees. Tree-based classiﬁcation and regression. Application area of tree-based methods. Trees versus linear models. Divide and conquer algorithm. The 1R algorithm. The RIPPER algorithm. Boosting the accuracy of decision trees, pruning the trees. Bagging classiﬁcation. Random forests. The Gini index. Fitting the decision trees. Black box methods. Neural networks. Activation functions. Network topology. Training a model on the data. Evaluating and improving model performance. Support vector machines. Classification with hyperplanes (linearly and non-linearly separable data). Using kernels for non-linear spaces.
- Interim assessment (2 module)0.03 * Attendance and in-class activity + 0.4 * Final exam + 0.21 * Home assignments + 0.36 * In-class tests
- Husson, F., Lê, S., & Pagès, J. (2017). Exploratory Multivariate Analysis by Example Using R (Vol. Second edition). Boca Raton: Chapman and Hall/CRC. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=nlebk&AN=1516055
- Mailund, T. (2017). Beginning Data Science in R : Data Analysis, Visualization, and Modelling for the Data Scientist. New York: Apress. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1484645
- Shmueli, G., Bruce, P. C., Gedeck, P., & Patel, N. R. (2020). Data Mining for Business Analytics : Concepts, Techniques and Applications in Python. Newark: Wiley. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=2273611
- Shmueli, G., Bruce, P. C., Yahav, I., Patel, N. R., & Lichtendahl, K. C. (2017). Data Mining for Business Analytics : Concepts, Techniques, and Applications in R. Hoboken, New Jersey: Wiley. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1585613