• A
  • A
  • A
  • АБВ
  • АБВ
  • АБВ
  • А
  • А
  • А
  • А
  • А
Обычная версия сайта

Магистерская программа «Науки о данных (Data Science)»

Modern Methods of Data Analysis

This is an unconventional course in modern Data Analysis and Mining. Its contents are heavily influenced by the idea that data analysis should help in enhancing and augmenting knowledge of the domain as represented by the concepts and statements of relation between them. According to this view, two main pathways for data analysis are summarization, for developing and augmenting concepts, and correlation, for enhancing and establishing relations. Visualization, in this context, is a way of presenting results in a cognitively comfortable way. The term summarization is understood quite broadly here to embrace not only simple summaries like totals and means, but also more complex summaries: the principal component of a set of features and cluster structures in a set of entities. Similarly, correlation here covers both bivariate and multivariate relations between input and target features including classification trees and Bayes classifiers. The course topics: 1D analysis (histograms, centrality and spread values, bootstrap), 2D analysis (scatter plot and linear regression, contingency table and chi-squared visualized), naïve Bayes classifier and decision trees, principal component for scoring a hidden factor, k-means and related methods for clustering.

Unlike most other subjects in Computer Sciences, Data Analysis looks at data from inside rather than outside. This is an unconventional course in modern Data Analysis and Mining. Its contents are heavily influenced by the idea that data analysis should help in enhancing and augmenting knowledge of the domain as represented by the concepts and statements of relation between them. According to this view, two main pathways for data analysis are summarization, for developing and augmenting concepts, and correlation, for enhancing and establishing relations. Visualization, in this context, is a way of presenting results in a cognitively comfortable way. The term summarization is understood quite broadly here to embrace not only simple summaries like totals and means, but also more complex summaries: the principal components of a set of features and cluster structures in a set of entities. Similarly, correlation here covers both bivariate and multivariate relations between input and target features including neural networks, classification trees and Bayes classifiers.

The material presented in this perspective makes a unique mix of subjects from the fields of statistical data analysis, data mining, and computational intelligence, which follow different systems of presentation.

Another feature of the module is that its main thrust is to give an in-depth understanding of a few basic techniques rather than to cover a broad spectrum of approaches developed so far. Most of the described methods fall under the same least-squares paradigm for mapping an “idealized” structure to the data. This allows me to bring forward a number of relations between methods that are usually overlooked. Although the in-depth study approach involves a great deal of technical details, these are encapsulated in specific fragments termed “formulation” parts. The main, “presentation”, part is delivered with no mathematical formulas and explains a method by actually applying it to a small real-world dataset – this part can be read and studied with no concern for the formulation at all. There is one more part, “computation”, targeted at studying the computational data processing issues using the MatLab computing environment. This three-way narrative style targets a typical student in Computer Science or Engineering.

2014/2015
Учебный год
ENG
Обучение ведется на английском языке
4
Кредиты

Автор программы


Миркин Борис Григорьевич

Технологии моделирования сложных систем

Статус:
Курс обязательный
Когда читается:
1-й курс, 1, 2 модуль

Пререквизиты

Basics of calculus including the concepts of function, derivative and the first-order optimality condition; basic linear algebra including vectors, inner products, Euclidean distances, matrices, and singular value and eigen-value decompositions; basic probability including conditional probability, stochastic independence, Gaussian density function; and basic set theory notation.