• A
  • A
  • A
  • ABC
  • ABC
  • ABC
  • А
  • А
  • А
  • А
  • А
Regular version of the site
Master 2021/2022

Modern Methods of Data Analysis

Type: Compulsory course (Data Science)
Area of studies: Applied Mathematics and Informatics
When: 1 year, 1, 2 module
Mode of studies: offline
Open to: students of all HSE University campuses
Master’s programme: Data Science
Language: English
ECTS credits: 4
Contact hours: 54

Course Syllabus

Abstract

This is a course in basic methods for modern Data Analysis. Its contents are heavily influenced by the idea that data analysis should help in enhancing and augmenting knowledge of the domain as represented by the concepts and statements of relation between them. This view distinguishes the subject from related courses such as applied statistics, machine learning, data mining, etc. Two main pathways for data analysis are: (1) summarization, for developing and augmenting concepts, and (2) correlation, for enhancing and establishing relations. Visualization, in this context, is a way of presenting results in a cognitively comfortable way. The term summarization is understood quite broadly here to embrace not only simple summaries like totals and means, but also more complex summaries: the principal components of a set of features and cluster structures in a set of entities. Similarly, correlation here covers both bivariate and multivariate relations between input and target features including classification trees and Bayes classifiers. Another feature of the class is that its main thrust is to give an in-depth understanding of a few basic techniques rather than to cover a broad spectrum of approaches developed so far. Most of the described methods fall under the same least-squares paradigm for mapping an “idealized” structure to the data. This allows me to bring forward a number of mathematically derived relations between methods that are usually overlooked.
Learning Objectives

Learning Objectives

  • To give a student basic knowledge and competence in modern English language and style for technical discussions of data analysis and data mining problems on the international scene.
  • To provide a unified framework and system for capturing the mainstream of numerous data analysis approaches and methods developed so far.
  • To teach modern methods of data analysis including cutting edge techniques such as intelligent clustering, spectral clustering, consensus clustering, community detection, SVD and principal component analysis, and using bootstrapping for validation and comparison of averages, and evolutionary optimization techniques.
  • To give a hands-on experience in real-world data analysis.
  • To provide an experience in using modern computational tools and computation.
Expected Learning Outcomes

Expected Learning Outcomes

  • Students know methods and their theoretical underpinnings for clustering similarity and network data including community detection, spectral clustering, and consensus clustering.
  • Students know methods and their theoretical underpinnings for clustering similarity and network data including community detection, spectral clustering, and consensus clustering.
  • Students know methods and their theoretical underpinnings for comparing cluster means with computational validation techniques such as bootstrappings.
  • Students know methods and their theoretical underpinnings for correlation and determinacy indexes at different perspective.
  • Students know methods and their theoretical underpinnings for hierarchical clustering.
  • Students know methods and their theoretical underpinnings for interpreting clusters in nominal scales, Quetelet indexes and Pearson’s Chi-squared.
  • Students know methods and their theoretical underpinnings for K-Means clustering, including rules for its initialization and interpretation.
  • Students know methods and their theoretical underpinnings for matrices of covariance and correlation indexes; conventional formulation for PCA.
  • Students know methods and their theoretical underpinnings for matrix and probabilistic data models.
  • Students know methods and their theoretical underpinnings for mixed scales data, quantification, pre-processing, standardization.
  • Students know methods and their theoretical underpinnings for principal component analysis (PCA), SVD and data visualization.
  • Students know methods and their theoretical underpinnings for spectral clustering,
Course Contents

Course Contents

  • Intro: course contents and administration
  • Data table. Feature modeling. Feature as mapping. Probability feature model.<br /> Categorical data: probability and frequency. Conditional probability; independence; Bayes theorem.<br /> Continuous distribution and density function. Mean and variance. Random sample. Distribution of the sample mean. Central limit theorem.
  • K-Means clustering: method and properties
  • Cluster interpretation: comparison of means, bootstrap for confidence intervals
  • Cluster interpretation at categorical features, Pearson chi-squared, Quetelet indexes
  • Principal component analysis (PCA), Singular value decomposition (SVD), using PCA for data visualization
  • PCA: covariance and correlation matrices, meaning and properties of correlation coefficient in three perspectives; conventional formulation of PCA
  • Clustering similarity and network data; k-means converted criterion and algorithms
  • Consensus clustering; two criteria; reduction to network clustering
  • Spectral clustering
Assessment Elements

Assessment Elements

  • non-blocking Home Project
  • non-blocking Exam
Interim Assessment

Interim Assessment

  • 2021/2022 2nd module
    0.7 * Exam + 0.3 * Home Project
Bibliography

Bibliography

Recommended Core Bibliography

  • Mirkin, B. Core concepts in data analysis: summarization, correlation and visualization. – Springer Science & Business Media, 2011. – 388 pp.

Recommended Additional Bibliography

  • Grünwald, Peter D. The minimum description length principle. – MIT press, 2007. – 736 pp.
  • Hall, M., Witten, Ian H., Frank, E. Data Mining: practical machine learning tools and techniques. – 2011. – 664 pp.
  • Han, J., Kamber, M., Pei, J. Data Mining: Concepts and Techniques, Third Edition. – Morgan Kaufmann Publishers, 2011. – 740 pp.
  • Larose, D. T., Larose, C. D. Discovering knowledge in data: an introduction to data mining. – John Wiley & Sons, 2014. – 336 pp.
  • Mazza, R. Introduction to information visualization. – Springer, 2009. – 139 pp.
  • Scholkopf, B., Smola, A. J. Learning with kernels: support vector machines, regularization, optimization, and beyond. – MIT press, 2001. – 648 pp.
  • Webb, A. R. Statistical pattern recognition. – John Wiley & Sons, 20011. – 668 pp.
  • Witten, I. H. et al. Data Mining: Practical machine learning tools and techniques. – Morgan Kaufmann, 2017. – 654 pp.