Goal of research:
The research aims at developing new mathematical models, algorithms and software tools for solving problems of data mining and knowledge discovery for data with complex structure including text mining, graph mining, machine learning algorithms in classification problems of complex objects, and others. The developed methods, algorithms and software tools will be applied in solution of practical tasks.
Thus, the object of the research consists in methods, algorithms and software tools of data mining and visualization, ontology modelling, automatic text processing, etc. The subject of the research is the features of methods and algorithms, like scope of application, precision and performance.
Methodology:
The research is based on methods of discrete mathematics, computer science, computational linguistics, software engineering. First, we consider fundamental mathematical models based on Formal Concept Analysis (FCA), clustering, machine learning, applied graph theory. Second, we use methods of automatic text processing and ontology modelling. Then we implement original methods and algorithms in various components of intelligent software. Such implementations can be tested in synthetic tasks and adopted in practical applications.
Empirical base of research:
For testing purpose we use synthetic data and widely used datasets from the open data sources like UCI Machine Learning Repository (http://archive.ics.uci.edu/ml), MovieLens service (https://movielens.org), ImhoNet service (http://imhonet.ru), gathered social networks datasets, etc.
Results of research:
26 scientific papers with results of the research during December 2015 – November 2016. The main results are:
- New machine learning methods for lazy classification of objects with complex structure based on pattern structures.
- New collaborative filtering methods with aggregated similarity measure.
- New algorithm for text fragments classification and similarity analysis based on syntactic and discourse structures of fragments.
- Adoption of the original lazy classification and attribute exploration methods for tasks of clinical informatics, including treatment optimization in oncology.
- New models and algorithms for prediction of natural history of breast cancer.
- Deep research of learning analytics and educational data mining tasks and methods.
- Implementation of new complex data preprocessing subsystems of Formal Concept Research Analysis Research Toolbox (FCART) that targets natural text processing tasks.
The level of implementation, recommendations on implementation or outcomes of the implementation of the results
The field of application of the obtained results consists of a spectrum of disciplines, where analysis of datasets with complex structure is in high demand and inevitably requires participation of a domain expert (medical informatics, education, sociology, logistics, criminology etc.).
Effectiveness, efficiency and correctness of the proposed models and methods are supported by comparative studies, testing and practical adoption. The level of implementation varies for different methods and software tools. New theoretical results in FCA, machine learning and text processing underlie almost all semantic technologies. New functionality of FCART are actively used in electronic library analysis project.
The conducted research resulted in a synergy effect of several international collaborative projects of ISSA Lab and allowed to adopt models and methods of data analysis in practical tasks in conjunction with Federal Scientific and D. Rogachev Clinical Centre of Pediatric Hematology, Oncology and Immunology (Russia), LORIA and LIRIS (France), TU-Dresden (Germany).