Современные методы анализа данных
Статус: Курс обязательный (Науки о данных)
Направление: 01.04.02. Прикладная математика и информатика
Где читается: Факультет компьютерных наук
Когда читается: 1-й курс, 1, 2 модуль
Формат изучения: Full time
Прогр. обучения: Науки о данных
This is a course in basic methods for modern Data Analysis. Its contents are heavily influenced by the idea that data analysis should help in enhancing and augmenting knowledge of the domain as represented by the concepts and statements of relation between them. This view distinguishes the subject from related courses such as applied statistics, machine learning, data mining, etc. Two main pathways for data analysis are: (1) summarization, for developing and augmenting concepts, and (2) correlation, for enhancing and establishing relations. Visualization, in this context, is a way of presenting results in a cognitively comfortable way. The term summarization is understood quite broadly here to embrace not only simple summaries like totals and means, but also more complex summaries: the principal components of a set of features and cluster structures in a set of entities. Similarly, correlation here covers both bivariate and multivariate relations between input and target features including classification trees and Bayes classifiers. Another feature of the class is that its main thrust is to give an in-depth understanding of a few basic techniques rather than to cover a broad spectrum of approaches developed so far. Most of the described methods fall under the same least-squares paradigm for mapping an “idealized” structure to the data. This allows me to bring forward a number of mathematically derived relations between methods that are usually overlooked.
- To give a student basic knowledge and competence in modern English language and style for technical discussions of data analysis and data mining problems on the international scene.
- To provide a unified framework and system for capturing the mainstream of numerous data analysis approaches and methods developed so far.
- To teach modern methods of data analysis including cutting edge techniques such as intelligent clustering, spectral clustering, consensus clustering, community detection, SVD and principal component analysis, and using bootstrapping for validation and comparison of averages, and evolutionary optimization techniques.
- To give a hands-on experience in real-world data analysis.
- To provide an experience in using modern computational tools and computation.
- Students know methods and their theoretical underpinnings for matrix and probabilistic data models.
- Students know methods and their theoretical underpinnings for mixed scales data, quantification, pre-processing, standardization.
- Students know methods and their theoretical underpinnings for K-Means clustering, including rules for its initialization and interpretation.
- Students know methods and their theoretical underpinnings for comparing cluster means with computational validation techniques such as bootstrappings.
- Students know methods and their theoretical underpinnings for interpreting clusters in nominal scales, Quetelet indexes and Pearson’s Chi-squared.
- Students know methods and their theoretical underpinnings for clustering similarity and network data including community detection, spectral clustering, and consensus clustering.
- Students know methods and their theoretical underpinnings for hierarchical clustering.
- Students know methods and their theoretical underpinnings for principal component analysis (PCA), SVD and data visualization.
- Students know methods and their theoretical underpinnings for matrices of covariance and correlation indexes; conventional formulation for PCA.
- Students know methods and their theoretical underpinnings for correlation and determinacy indexes at different perspective.
- Students know methods and their theoretical underpinnings for spectral clustering,
- Intro: course contents and administration
- Data table. Feature modeling. Feature as mapping. Probability feature model.<br /> Categorical data: probability and frequency. Conditional probability; independence; Bayes theorem.<br /> Continuous distribution and density function. Mean and variance. Random sample. Distribution of the sample mean. Central limit theorem.Quantitative coding for mixed scales. Elements of matrix theory: linear subspaces and principal directions. Least-squares approximation. <br />Full system of events: nominal feature. Bivariate distribution and contingency table. <br />Popular distributions (Gaussian, Power law, Poisson, Bernoulli). <br />Chi-squared distribution. Distribution of the sample variance.
- K-Means clustering: method and propertiesClustering criterion and its reformulations. K-Means clustering as alternating minimization; Nature inspired algorithms for K-Means; Partition around medoids PAM; Choosing the number of clusters; Initialization of K-Means; Anomalous pattern and Intelligent K-Means.
- Cluster interpretation: comparison of means, bootstrap for confidence intervalsCluster interpretation aids.
- Cluster interpretation at categorical features, Pearson chi-squared, Quetelet indexes
- Principal component analysis (PCA), Singular value decomposition (SVD), using PCA for data visualization
- PCA: covariance and correlation matrices, meaning and properties of correlation coefficient in three perspectives; conventional formulation of PCA
- Clustering similarity and network data; k-means converted criterion and algorithms
- Consensus clustering; two criteria; reduction to network clustering
- Spectral clustering
- Home Project (неблокирующий)Home project’s goal is to teach a student to practical use of data analysis methods under study applied to a real-world data table taken from the internet or any other data source; the data should include at least 100 entities over at least 10 features and must be approved by the instructor. A project may involve a team of more than one – two or three individuals, if approved by the instructor. A project includes a number of tasks to be carried out by the team after the corresponding method has been explained in a lecture.
- Exam (неблокирующий)Exam paper is an in-class test oriented at approximately at 80-100 minutes. This includes about 6-7 questions, two of which are theoretical and four-five, practical (examples of practical questions are given in the next section). One more question relates to a task in the individual Home project. To make cheating more difficult, there are 6-7 versions of parameter setting in the paper. Each question is assigned with the maximum mark; the total of marks is 100. At marking, each answer is evaluated according to the level of coverage of the related material within the maximum mark assigned to the question. <br /> The sum of marks is the Exam mark, per cent, to be rounded into a conventional 10-grade scale. The rounding goes along the fairness criteria: say, 52 and 53 are rounded in 5; 58 and 59, in 6. The rounding of other marks takes into account the student’s discipline including the absence at lectures and seminars. A student who missed more than half of the sessions should not be surprised if their 56% mark is rounded down to 5. <br /> Some questions may involve simple calculations that can be done with the help of electronic calculators; no more complex devices are permitted in the Exam – neither notebooks, nor smartphones, nor tablets: exact numeric solutions are not necessary for a successful answer.
- Mirkin, B. Core concepts in data analysis: summarization, correlation and visualization. – Springer Science & Business Media, 2011. – 388 pp.
- Grünwald, Peter D. The minimum description length principle. – MIT press, 2007. – 736 pp.
- Hall, M., Witten, Ian H., Frank, E. Data Mining: practical machine learning tools and techniques. – 2011. – 664 pp.
- Han, J., Kamber, M., Pei, J. Data Mining: Concepts and Techniques, Third Edition. – Morgan Kaufmann Publishers, 2011. – 740 pp.
- Larose, D. T., Larose, C. D. Discovering knowledge in data: an introduction to data mining. – John Wiley & Sons, 2014. – 336 pp.
- Mazza, R. Introduction to information visualization. – Springer, 2009. – 139 pp.
- Scholkopf, B., Smola, A. J. Learning with kernels: support vector machines, regularization, optimization, and beyond. – MIT press, 2001. – 648 pp.
- Webb, A. R. Statistical pattern recognition. – John Wiley & Sons, 20011. – 668 pp.
- Witten, I. H. et al. Data Mining: Practical machine learning tools and techniques. – Morgan Kaufmann, 2017. – 654 pp.