Explanation-oriented Methods of  Data Analysis for Semantically Rich Data and Their Applications

Priority areas of development: mathematics
The project has been carried out as part of the HSE Program of Fundamental Studies.

Goal of research:

The research aims at developing new mathematical models, algorithms and software tools for solving problems of data mining and knowledge discovery for data with complex structure including text mining, graph mining, machine learning algorithms in classification problems of complex objects, and others. The developed methods, algorithms and software tools will be applied in solution of practical tasks.

Thus, the object of the research consists in methods, algorithms and software tools of data mining and visualization, ontology modelling, automatic text processing, etc. The subject of the research is the features of methods and algorithms, like scope of application, precision and performance, but with special interest in interpretability (explainability).


The research is based on methods of discrete mathematics, computer science, computational linguistics, software engineering. First, we consider fundamental mathematical models based on Formal Concept Analysis (FCA), clustering, machine learning, applied graph theory. Second, we use methods of automatic text processing and ontology modelling. Then we implement original methods and algorithms in various components of intelligent software systems. Such implementations can be tested in synthetic tasks and adopted in practical applications.

Empirical base of research:

For testing purpose, we use synthetic data, gathered from electronic scientific libraries, social media services, collection of healthcare records, NRU HSE students’ works, the open data repositories like UCI Machine Learning Repository (http://archive.ics.uci.edu/ml), etc.

Results of research:

20 scientific papers with results of the research during December 2016 – November 2017. The main results are:

1. New algorithms for text fragments classification and similarity analysis based on syntactic and discourse structures of fragments.

2. Advances in relevance analysis of texts based on annotated suffix trees.

3. Advances in original lazy classification methods applied to clinical informatics tasks including oncology therapy optimization.

4. Advances in models for prediction of natural history of breast cancer.

5. Implementation of new approaches to interpretation and analysis of frequent closed sets of attributes.

6. Deep research of educational data mining in adaptive learning.

7. New methods of automated assessment of mind maps.

8. New strategies and technologies for the deployment of container nodes to gather data from external data sources.

The level of implementation, recommendations on implementation or outcomes of the implementation of the results

The field of application of the obtained results consists of a spectrum of disciplines, where analysis of datasets with complex structure is in high demand and inevitably requires participation of a domain expert (medical informatics, education, sociology, logistics, criminology etc.).

Effectiveness, efficiency and correctness of the proposed models and methods are supported by comparative studies, testing and practical adoption. The level of implementation varies for different methods and software tools. New theoretical results in FCA, machine learning and text processing underlie almost all modern semantic technologies. Practical implementation of the proposed methods of data analysis was considered to be well explainable by domain experts.

The conducted research resulted in a synergy effect of several international collaborative projects of ISSA Lab and allowed to adopt models and methods of data analysis in practical tasks in conjunction with Gemotest Laboratory, Dmitry Rogachev National Research Center of Pediatric Hematology, Oncology and Immunology (Russia), LORIA and LIRIS (France), TU-Dresden (Germany), etc.


