Modern Methods of Data Analysis

Master 2019/2020

Type: Compulsory course (Data Science)

Area of studies: Applied Mathematics and Informatics

Delivered by: School of Data Analysis and Artificial Intelligence

Where: Faculty of Computer Science

When: 1 year, 1, 2 module

Mode of studies: offline

Instructors: Boris Mirkin, Quentin Paris

Master’s programme: Data Science

Language: English

ECTS credits: 4

Contact hours: 56

Full Syllabus

Abstract

This is a course in basic methods for modern Data Analysis. Its contents are heavily influenced by the idea that data analysis should help in enhancing and augmenting knowledge of the domain as represented by the concepts and statements of relation between them. This view distinguishes the subject from related courses such as applied statistics, machine learning, data mining, etc. Two main pathways for data analysis are: (1) summarization, for developing and augmenting concepts, and (2) correlation, for enhancing and establishing relations. Visualization, in this context, is a way of presenting results in a cognitively comfortable way. The term summarization is understood quite broadly here to embrace not only simple summaries like totals and means, but also more complex summaries: the principal components of a set of features and cluster structures in a set of entities. Similarly, correlation here covers both bivariate and multivariate relations between input and target features including classification trees and Bayes classifiers. Another feature of the class is that its main thrust is to give an in-depth understanding of a few basic techniques rather than to cover a broad spectrum of approaches developed so far. Most of the described methods fall under the same least-squares paradigm for mapping an “idealized” structure to the data. This allows me to bring forward a number of mathematically derived relations between methods that are usually overlooked.

Learning Objectives

To give a student basic knowledge and competence in modern English language and style for technical discussions of data analysis and data mining problems on the international scene.
To provide a unified framework and system for capturing the mainstream of numerous data analysis approaches and methods developed so far.
To teach modern methods of data analysis including cutting edge techniques such as intelligent clustering, spectral clustering, consensus clustering, community detection, SVD and principal component analysis, and using bootstrapping for validation and comparison of averages, and evolutionary optimization techniques.
To give a hands-on experience in real-world data analysis.
To provide an experience in using modern computational tools and computation.

Expected Learning Outcomes

Students know methods and their theoretical underpinnings for matrix and probabilistic data models.
Students know methods and their theoretical underpinnings for mixed scales data, quantification, pre-processing, standardization.
Students know methods and their theoretical underpinnings for K-Means clustering, including rules for its initialization and interpretation.
Students know methods and their theoretical underpinnings for comparing cluster means with computational validation techniques such as bootstrappings.
Students know methods and their theoretical underpinnings for interpreting clusters in nominal scales, Quetelet indexes and Pearson’s Chi-squared.
Students know methods and their theoretical underpinnings for clustering similarity and network data including community detection, spectral clustering, and consensus clustering.
Students know methods and their theoretical underpinnings for hierarchical clustering.
Students know methods and their theoretical underpinnings for principal component analysis (PCA), SVD and data visualization.
Students know methods and their theoretical underpinnings for matrices of covariance and correlation indexes; conventional formulation for PCA.
Students know methods and their theoretical underpinnings for correlation and determinacy indexes at different perspective.
Students know methods and their theoretical underpinnings for spectral clustering,

Course Contents

Intro: course contents and administration
Data table. Feature modeling. Feature as mapping. Probability feature model. Categorical data: probability and frequency. Conditional probability; independence; Bayes theorem. Continuous distribution and density function. Mean and variance. Random sample. Distribution of the sample mean. Central limit theorem.
Quantitative coding for mixed scales. Elements of matrix theory: linear subspaces and principal directions. Least-squares approximation. Full system of events: nominal feature. Bivariate distribution and contingency table. Popular distributions (Gaussian, Power law, Poisson, Bernoulli). Chi-squared distribution. Distribution of the sample variance.
K-Means clustering: method and properties
Clustering criterion and its reformulations. K-Means clustering as alternating minimization; Nature inspired algorithms for K-Means; Partition around medoids PAM; Choosing the number of clusters; Initialization of K-Means; Anomalous pattern and Intelligent K-Means.
Cluster interpretation: comparison of means, bootstrap for confidence intervals
Cluster interpretation aids.
Cluster interpretation at categorical features, Pearson chi-squared, Quetelet indexes
Principal component analysis (PCA), Singular value decomposition (SVD), using PCA for data visualization
PCA: covariance and correlation matrices, meaning and properties of correlation coefficient in three perspectives; conventional formulation of PCA
Clustering similarity and network data; k-means converted criterion and algorithms
Consensus clustering; two criteria; reduction to network clustering
Spectral clustering

Assessment Elements

Home Project
Exam

Interim Assessment

Interim assessment (2 module)
0.7 * Exam + 0.3 * Home Project

Bibliography

Recommended Core Bibliography

Mirkin, B. Core concepts in data analysis: summarization, correlation and visualization. – Springer Science & Business Media, 2011. – 388 pp.

Recommended Additional Bibliography

Grünwald, Peter D. The minimum description length principle. – MIT press, 2007. – 736 pp.
Hall, M., Witten, Ian H., Frank, E. Data Mining: practical machine learning tools and techniques. – 2011. – 664 pp.
Han, J., Kamber, M., Pei, J. Data Mining: Concepts and Techniques, Third Edition. – Morgan Kaufmann Publishers, 2011. – 740 pp.
Larose, D. T., Larose, C. D. Discovering knowledge in data: an introduction to data mining. – John Wiley & Sons, 2014. – 336 pp.
Mazza, R. Introduction to information visualization. – Springer, 2009. – 139 pp.
Scholkopf, B., Smola, A. J. Learning with kernels: support vector machines, regularization, optimization, and beyond. – MIT press, 2001. – 648 pp.
Webb, A. R. Statistical pattern recognition. – John Wiley & Sons, 20011. – 668 pp.
Witten, I. H. et al. Data Mining: Practical machine learning tools and techniques. – Morgan Kaufmann, 2017. – 654 pp.

Course Syllabus