Упорядоченные множества в анализе данных

Магистратура 2019/2020

Лучший по критерию «Новизна полученных знаний»

Статус: Курс по выбору (Науки о данных)

Направление: 01.04.02. Прикладная математика и информатика

Кто читает: Департамент анализа данных и искусственного интеллекта

Где читается: Факультет компьютерных наук

Когда читается: 1-й курс, 1, 2 модуль

Формат изучения: без онлайн-курса

Преподаватели: Кузнецов Сергей Олегович, Строк Федор Владимирович

Прогр. обучения: Науки о данных

Язык: английский

Кредиты: 5

Контактные часы: 56

Full Syllabus

Abstract

“Ordered sets for data analysis” is a first-semester master course within “Data Science” master program. Order relations are ubiquitous, we meet them when we consider numbers, Boolean algebras, partitions, multisets, graphs, logical formulas, and many other mathematical entities. On the one hand, order relations are used for representing data and knowledge, on the other hand they serve as important tools for describing models and methods of data analysis, such as decision trees, random forests, version spaces, association rules, etc. We start from the basic notions of binary relations, their representations with matrices and graphs, consider properties of relations and operations on them, giving examples on problems of data analysis. We come then to the notions of order, semilattice, and lattice. Very important data models like numbers, vectors of numbers, set algebras, partitions make lattices. We consider an important type of lattices, called concept lattices, which constitute the main subject of Formal Concept Analysis (FCA), a modern domain of applied lattice theory. Concept lattices provide very useful tools for data analysis and pattern mining. On the one hand, formal concepts, the elements of a concept lattice, describe sets of objects sharing sets of common attributes, thus presenting a nice model of (bi)clustering. Ordered sets of concepts give a hierarchy of classes of a subject domain, which makes a backbone of domain ontology. On the other hand, concept lattices represent dependencies on attributes, both precise (called implications), and imprecise, known in data mining as association rules. We consider most important algorithms and algorithmic problems related to applications of FCA. Concept lattices allow us to describe some fundamental models of data analysis and pattern mining such as decision trees, random forests, version spaces, and concept-based hypotheses. Since a serious limitation of many methods of pattern mining is computational complexity, complexity analysis makes an important leitmotif of the course. Throughout the course we consider applications of the introduced notions in various applied domains, such as text analysis, medicine, pharmacology, etc. Homework consists of both theoretical problems and practical problems in designing particular.

Learning Objectives

To help students to synthesize knowledge from undergraduate courses on discrete mathematics, algorithmic theory, data analysis to get a stereoscopic picture of the foundations of pattern mining.
Providing students with some important theoretical knowledge and practical skills that would help them to see the relationship of various methods of pattern mining and knowledge discovery.

Expected Learning Outcomes

Know basic notions and terminology of relations, graphs, and matrices as data structures.
Know basic notions and terminology of algorithmic complexity.
Be familiar with complexity classes, (polynomial) reduction between problems, problem hardness and completeness.
Know basic notions and terminology of order theory and symbolic data analysis.
Be familiar with semilattices and lattices, and know their properties.
Develop mathematical models based on order theory and Galois connections.
Understand Formal Concept Analysis (FCA) and its Basic theorem.
Understand basic notions of dependencies in data.
Be familiar with implications, functional dependencies, Armstrong rules, and implication bases.
Be familiar with Association rules in Data Mining, their support and confidence.
Understand fundamental principles of symbolic pattern mining.
Understand the main algorithms for Pattern Mining with concept lattices.
Understand clustering and biclustering, and usage of Formal concepts as biclusters.
Understand Interestingness measures of patterns.
Understand fundamental principles of learning and classification through Concept Lattices.
Be familiar with concept-based (JSM) hypotheses.
Be familiar with version space, minimal generators and minimal hypotheses.
Know differense between Decision trees and hypothesis.
Be capable of realizing and testing basic learning and classification model.
Understand fundamental principles of working with complex data using pattern structures and pattern setups.

Course Contents

Introduction. Relations, Graphs, and Matrices as Data Structures
Goal of the course. The importance and ubiquity of ordered sets in data science. Set algebra, an example of classification and learning model based on set algebra. Binary relations. Graphs, subgraphs, parts, cycles, cliques, trees, bipartite graphs. Graphs of binary relations. Properties of binary relations (reflexivity, symmetricity, antisymmetricity, transitivity, linearity), operations on binary relations (complement, dualization, product, degree, transitive closure) and their graph-theoretic interpretations. Important types of binary relations (equivalence, tolerance, orders). Similarity in data as tolerance. Tolerance classes as clusters.
Algorithmic Complexity for Data Science
Computational model. Turing machine, random access machine (RAM). Worst-case estimates of time and space complexity. Classes P, NP, co-NP and #P, illustrations on data analysis and pattern mining problems (community detection on the web etc.). (Polynomial) reduction between problems, problem hardness and completeness. Delay and polynomial delay. Cumulative (polynomial) delay. Amortized complexity. Anytime algorithms. Average complexity, experimental complexity. Algorithmic complexity of testing properties and performing operations on relations. Standard solvers (SAT, ASP, CSP) for pattern mining.
Orders, Their Graphs, and Diagrams for Data and Knowledge Representation
Partial order, strict order, quasiorder (preorder), linear order. Graph of the order, topological sorting, covering relation, order (Hasse) diagram. Partial order as transitive reflexive closure of the covering relation. Matrix interpretation and algorithmic complexity. Preorder (quasiorder), its incomparability relation and respective partial order. Examples of partial orders in mathematics and data science: multisets, partitions, sequences and graphs. Dimensions of partial order.
Semilattices and Lattices
Upper (lower) bounds, infimum, supremum, semilattice, lattice. Two definitions of lattices and their equivalence. Diagrams of semilattices and lattices. Orders on multisets and partitions make lattices. Data analysis based on lattices. Types of lattices (complete, modular, matroids, distributive and Boolean lattices) and their diagrams. Order filters and ideals. Order completions.
Galois Connections and Formal Concept Analysis (FCA)
(Antitone) Galois connections and their properties. Galois connection on a binary relation (formal context). Closure operators and closure systems (Moore families). Closed sets of objects and attributes. Formal concept as a “conjunctive cluster”, generality order on formal concepts, concept lattice. Monotone Galois connections. Kernel operator. Dual concept as “disjunctive cluster”. <br /> Meet- and join-irreducible lattice elements. Basic theorem of FCA. The size of a concept lattice. Many-valued contexts, scaling.
Dependencies in Data: Implications and Functional Dependencies
Implications, Armstrong rules. Implication bases: minimal (Duquenne-Guigues) base, direct base. Pseudo-intents, minimal generators, and proper premises. Implications and functional dependencies. Form of base and lattice type. Worst-case exponential size of bases, intractability of computing a minimal base (no polynomial-delay), approximations.
Association Rules and Concept Lattices for Data Mining
Association rules in Data Mining, their support and confidence. Closure via counting inference. Frequent closed itemsets. Condensed representation of association rules with concept lattices. Luxenburger basis of association rules. Other condensed representations. Rule interestingness measures (pattern constraints). Antimonotonic pattern constraints.
Algorithms for Pattern Mining with Concept Lattices
#P-completeness of determining the size of a concept lattice. Polynomial-delay algorithms for computing concepts: NextClosure, Close-by-One (CbO), CbO family (FCbO, InClose). Algorithms for computing the covering relation. Polynomial delay of computing covering relation and condensed representation of association rules. ConExp Software.
Biclustering and Interestingness Measures of Patterns
Clustering and biclustering. Formal concepts as biclusters. Biclusters of similar values as triadic concepts. Generating most interesting biclusters (concepts). Antimonotonicity of interestingness measures. Order antimonotonicity, projection antimonotonicity. Stability, level stability, integral stability. Intractability of stability indices, tractable approximations of stability. Probability of a concept. Basic level and other interestingness measures.
Models of Learning and Classification Through Concept Lattices
Concept-based (JSM) hypotheses. Implications and hypotheses. Version space, its most general elements (minimal generators) and most specific elements (minimal hypotheses). Decision trees vs. hypotheses. Version space through Galois connection. Lazy classification with hypotheses, implications, and association rules.
Working with Complex Data: Pattern Structures and Pattern Setups
Scaling complex data: complexity and artifacts. Pattern structures and pattern setups. Pattern structures vs. scaling. Pattern structures on graphs, strings, and interval vectors. Functional dependencies as object implications in a pattern structure. Biclusters as pattern concepts. Applications in bioinformatics and text mining.

Assessment Elements

Homework
Research project
Exam
Students have to demonstrate knowledge of theory, but the most of tasks would evaluate their ability to solve practical examples, present straight operation, and recognition skills to solve them.

Interim Assessment

Interim assessment (2 module)
0.2 * Exam + 0.48 * Homework + 0.32 * Research project

Bibliography

Recommended Core Bibliography

Cormen, T. H., Leiserson, C. E., Rivest, R. L., Stein, C. Introduction to Algorithms (3rd edition). – MIT Press, 2009. – 1292 pp.
Kuznetsov, S. O. Fitting pattern structures to knowledge discovery in big data // International conference on formal concept analysis. – Springer, Berlin, Heidelberg, 2013. – PP. 254-266.

Recommended Additional Bibliography

Kuznetsov, S. O. Pattern structures for analyzing complex data // International Workshop on Rough Sets, Fuzzy Sets, Data Mining, and Granular-Soft Computing. – Springer, Berlin, Heidelberg, 2009. – P. 33-44.
Kuznetsov, S. O. Scalable knowledge discovery in complex data with pattern structures // International Conference on Pattern Recognition and Machine Intelligence. – Springer, Berlin, Heidelberg, 2013. – P. 30-39.

Course Syllabus