• A
  • A
  • A
  • ABC
  • ABC
  • ABC
  • А
  • А
  • А
  • А
  • А
Regular version of the site
For visually-impairedUser profile (HSE staff only)SearchMenu

Mathematical Models, Algorithms, and Software Tools for the Intelligent Analysis of Big Textual and Structural Data

2013
Department: Scientific-Educational Laboratory for Intelligent Systems and Structural Analysis

In 2013, the Laboratory of Intelligent Systems and Structural Analysis continued with planned research based on the preceding year’s results. The topicality of the research is defined by the need to develop methods for analysing complex (textual and structured) distributed information in different areas―ranging from physics, chemistry, and the life sciences to economics, sociology, and political science.

A related task is the training of HSE master's and bachelor's students specializing in the aforementioned fields. Recently, data mining research is of high importance with regards to developing a new generation of intelligent systems. Web data-mining is becoming increasingly important; it includes social network analysis, recommender systems, distributed databases of textual documents, etc.

Data-mining research ultimately must deal with large datasets. Distributed large databases especially require new approaches and highly efficient algorithms. Employing Formal Concept Analysis models simplifies the discovery and mining of hidden knowledge in such large datasets.

The main goal of the research is to develop and implement new methods and algorithms for analysing structured and unstructured data, to develop program tools for distributed data processing, and to apply these tools to solving practical tasks. Thus, the object of the research consists of methods, algorithms, and software tools for data-mining of structured and unstructured data. The subject of the research is how well the methods perform and how efficient they are. We first considered methods based on FCA, multimodal clustering, and computational linguistics.

The main results include the following:

1) Collecting a large amount of information sources and test datasets within the framework of theoretical studies in FCA, clustering and biclustering,  text processing (more than 80 new sources and more than 60 Gb of new collections of synthetic and real data; in collaboration with our partners – Yandex, Dmitry Rogachev Clinical Center,  Digital Society Lab, LORIA and LIRIS (France), etc.; datasets of more than 230 Tb have been prepared for further analysis);

2) Increasing the efficiency of implementing basic FCA algorithms, specifically building concept lattices and calculating stability indices;

3) Creating a prototype of an original programming module for managing pattern structures, and testing this component on objects represented by sequences and graphs;

4) Creating several versions of methods and algorithms for clustering and classifications on triadic datasets, testing their implementations in Web recommender services, contextual advertisement and crowdsourcing;

5) Developing a DOD-DMS (Dynamical Ontology-Driven Data Mining System) for preprocessing data from external sources, local storage for data that has a complex structure, and efficient text-indexing in natural languages;

6) Developing a Formal Concept Analysis Research Toolbox (based on DOD-DMS) to release version 0.8 and transition to developing version 0.9; finalizing the software means for analysing various kinds of formal concept indices; developing the software means for processing pattern structures, a report editor, and embedded script language.

The field of application for the obtained results consists of a spectrum of disciplines, where the analysis of large datasets is in high demand and inevitably requires the participation of expert analysts (in medical informatics, bioinformatics, sociology, logistics, criminology, etc.).

The effectiveness, efficiency, and correctness of the proposed models and methods are confirmed by comparative studies, testing, and practical usage. The level of obtained integration varies for different methods and software means. New theoretical results in FCA are rather comprehensively implemented in FCART. Preliminary versions of FCART are actively being used in the teaching process at the Department of Applied Mathematics and Computer Science, laboratory scientific studies, Dmitry Rogachev Clinical Center, and universities in Dresden, Nancy, Clermont-Ferrand, and Nicosia.

The conducted research had a synergistic effect and allowed the tasks of integrating several models and methods of data analysis to be put within the framework of a unified intelligent information system. The further development of a platform for increasing the efficiency of our scientific research is a basic task for our future work and for constructing software tools.

Publications:


Ignatov D. I., Kuznetsov S., Zhukov L. E., Poelmans J. Can triconcepts become triclusters? // International Journal of General Systems. 2013. Vol. 42. No. 6. P. 572-593. doi
Kuznetsov S., Babin M. A. Computing premises of a minimal cover of functional dependencies is intractable // Discrete Applied Mathematics. 2013. Vol. 161. No. 6. P. 742-749. 
Kuznetsov S., Poelmans J. Knowledge representation and processing with formal concept analysis // Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery. 2013. Vol. 3. No. 3. P. 200-215. doi
Buzmakov A., Neznanov A. Practical Computing with Pattern Structures in FCART Environment, in: Proceedings of the International Workshop "What can FCA do for Artificial Intelligence?" (FCA4AI at IJCAI 2013). Beijing : CEUR Workshop Proceeding, 2013. С. 49-56. 
Ilvovsky D., Klimushkin M. A. FCA-based Search for Duplicate Objects in Ontologies, in: Proceedings of the Workshop Formal Concept Analysis Meets Information Retrieval. Moscow : CEUR Workshop Proceeding, 2013. С. 44-54. 
Galitsky B., Ilvovsky D., Kuznetsov S., Strok F. V. Parse thicket representations of text paragraphs, in: Компьютерная лингвистика и интеллектуальные технологии: По материалам ежегодной Международной конференции «Диалог» (Бекасово, 29 мая - 2 июня 2013 г.). В 2-х т.. Moscow : РГГУ, 2013. С. 239-255. 
Galitsky B., Kuznetsov S. A Web Mining Tool for Assistance with Creative Writing, in: Proc. 35th European Conference on Information Retrieval (ECIR 2013): Advances in Information Retrieval.: Springer, 2013. С. 828-831. 
Kuznetsov S. Fitting Pattern Structures for Knowledge Discovery in Big Data, in: Proc. 11th International Conference on Formal Concept Analysis (ICFCA 2013).: Springer, 2013. С. 254-266. 
Kuznetsov S., Strok F. V., Ilvovsky D., Galitsky B. Improving Text Retrieval Efficiency with Pattern Structures on Parse Thickets, in: Proceedings of the Workshop Formal Concept Analysis Meets Information Retrieval. Moscow : CEUR Workshop Proceeding, 2013. С. 6-21. 
Kuznetsov S., Neznanov A. Information Retrieval and Knowledge Discovery with FCART, in: Proceedings of the Workshop Formal Concept Analysis Meets Information Retrieval. Moscow : CEUR Workshop Proceeding, 2013. С. 74-82. 
Galitsky B., Kuznetsov S., Usikov D. Parse Thicket Representation for Multi-sentence Search, in: Conceptual Structures for STEM Research and Education, 20th International Conference on Conceptual Structures. Berlin : Springer, 2013. С. 153-172. 
Galitsky B., Ilvovsky D., Kuznetsov S., Strok F. V. Matching sets of parse trees for answering multi-sentence questions, in: Proceedings of the Recent Advances in Natural Language Processing. Hissar : INCOMA Ltd, 2013. С. 285-293. 
Obiedkov S. Modeling ceteris paribus preferences in formal concept analysis, in: Formal Concept Analysis. Berlin : Springer, 2013. С. 188-202. 
Konstantin B., Obiedkov S. Optimizations in computing the Duquenne–Guigues basis of implications // Annals of Mathematics and Artificial Intelligence. 2014. Vol. 70. No. 1-2. P. 5-24. doi