• A
  • A
  • A
  • ABC
  • ABC
  • ABC
  • А
  • А
  • А
  • А
  • А
Regular version of the site

Mathematical Models, Algorithms, and Software Tools for the Intelligent Analysis of Big Textual and Structural Data

2013
Department: Scientific-Educational Laboratory for Intelligent Systems and Structural Analysis
The project has been carried out as part of the HSE Program of Fundamental Studies.

In 2013, the Laboratory of Intelligent Systems and Structural Analysis continued with planned research based on the preceding year’s results. The topicality of the research is defined by the need to develop methods for analysing complex (textual and structured) distributed information in different areas―ranging from physics, chemistry, and the life sciences to economics, sociology, and political science.

A related task is the training of HSE master's and bachelor's students specializing in the aforementioned fields. Recently, data mining research is of high importance with regards to developing a new generation of intelligent systems. Web data-mining is becoming increasingly important; it includes social network analysis, recommender systems, distributed databases of textual documents, etc.

Data-mining research ultimately must deal with large datasets. Distributed large databases especially require new approaches and highly efficient algorithms. Employing Formal Concept Analysis models simplifies the discovery and mining of hidden knowledge in such large datasets.

The main goal of the research is to develop and implement new methods and algorithms for analysing structured and unstructured data, to develop program tools for distributed data processing, and to apply these tools to solving practical tasks. Thus, the object of the research consists of methods, algorithms, and software tools for data-mining of structured and unstructured data. The subject of the research is how well the methods perform and how efficient they are. We first considered methods based on FCA, multimodal clustering, and computational linguistics.

The main results include the following:

1) Collecting a large amount of information sources and test datasets within the framework of theoretical studies in FCA, clustering and biclustering,  text processing (more than 80 new sources and more than 60 Gb of new collections of synthetic and real data; in collaboration with our partners – Yandex, Dmitry Rogachev Clinical Center,  Digital Society Lab, LORIA and LIRIS (France), etc.; datasets of more than 230 Tb have been prepared for further analysis);

2) Increasing the efficiency of implementing basic FCA algorithms, specifically building concept lattices and calculating stability indices;

3) Creating a prototype of an original programming module for managing pattern structures, and testing this component on objects represented by sequences and graphs;

4) Creating several versions of methods and algorithms for clustering and classifications on triadic datasets, testing their implementations in Web recommender services, contextual advertisement and crowdsourcing;

5) Developing a DOD-DMS (Dynamical Ontology-Driven Data Mining System) for preprocessing data from external sources, local storage for data that has a complex structure, and efficient text-indexing in natural languages;

6) Developing a Formal Concept Analysis Research Toolbox (based on DOD-DMS) to release version 0.8 and transition to developing version 0.9; finalizing the software means for analysing various kinds of formal concept indices; developing the software means for processing pattern structures, a report editor, and embedded script language.

The field of application for the obtained results consists of a spectrum of disciplines, where the analysis of large datasets is in high demand and inevitably requires the participation of expert analysts (in medical informatics, bioinformatics, sociology, logistics, criminology, etc.).

The effectiveness, efficiency, and correctness of the proposed models and methods are confirmed by comparative studies, testing, and practical usage. The level of obtained integration varies for different methods and software means. New theoretical results in FCA are rather comprehensively implemented in FCART. Preliminary versions of FCART are actively being used in the teaching process at the Department of Applied Mathematics and Computer Science, laboratory scientific studies, Dmitry Rogachev Clinical Center, and universities in Dresden, Nancy, Clermont-Ferrand, and Nicosia.

The conducted research had a synergistic effect and allowed the tasks of integrating several models and methods of data analysis to be put within the framework of a unified intelligent information system. The further development of a platform for increasing the efficiency of our scientific research is a basic task for our future work and for constructing software tools.

Publications:


Kuznetsov S. Fitting Pattern Structures for Knowledge Discovery in Big Data, in: Proc. 11th International Conference on Formal Concept Analysis (ICFCA 2013) / Отв. ред.: P. Cellier, F. Distel, B. Ganter. Vol. 7880. Springer, 2013. P. 254-266.
Galitsky B., Kuznetsov S., Usikov D. Parse Thicket Representation for Multi-sentence Search, in: Conceptual Structures for STEM Research and Education, 20th International Conference on Conceptual Structures / Отв. ред.: H. Pfeiffer, D. I. Ignatov, J. Poelmans, G. Nagarjuna. Vol. 7735: Conceptual Structures for STEM Research and Education, 20th International Conference on Conceptual Structures. Berlin, Heidelberg : Springer, 2013. P. 153-172.
Kuznetsov S., Neznanov A. Information Retrieval and Knowledge Discovery with FCART P. 74-82.
Kuznetsov S., Babin M. A. Computing premises of a minimal cover of functional dependencies is intractable // Discrete Applied Mathematics. 2013. Vol. 161. No. 6. P. 742-749.
Ilvovsky D., Klimushkin M. A. FCA-based Search for Duplicate Objects in Ontologies, in: Proceedings of the Workshop Formal Concept Analysis Meets Information Retrieval / Отв. ред.: S. Kuznetsov, C. Carpineto, A. Napoli. Vol. 977. M. : CEUR Workshop Proceedings, 2013. P. 44-54.
Kuznetsov S., Poelmans J. Knowledge representation and processing with formal concept analysis // Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery. 2013. Vol. 3. No. 3. P. 200-215. doi
Kuznetsov S., Strok F. V., Ilvovsky D., Galitsky B. Improving Text Retrieval Efficiency with Pattern Structures on Parse Thickets, in: Proceedings of the Workshop Formal Concept Analysis Meets Information Retrieval / Отв. ред.: S. Kuznetsov, C. Carpineto, A. Napoli. Vol. 977. M. : CEUR Workshop Proceedings, 2013. P. 6-21.
Galitsky B., Kuznetsov S. A Web Mining Tool for Assistance with Creative Writing P. 828-831.
Galitsky B., Kuznetsov S. A Web Mining Tool for Assistance with Creative Writing, in: Proc. 35th European Conference on Information Retrieval (ECIR 2013): Advances in Information Retrieval / Отв. ред.: P. Serdyukov, P. Braslavski, S. Kuznetsov, J. Kamps, S. Rüger, E. Agichtein, I. Segalovich, E. Yilmaz. Vol. 7814. Springer, 2013. P. 828-831.
Kuznetsov S., Neznanov A. Information Retrieval and Knowledge Discovery with FCART, in: Proceedings of the Workshop Formal Concept Analysis Meets Information Retrieval / Отв. ред.: S. Kuznetsov, C. Carpineto, A. Napoli. Vol. 977. M. : CEUR Workshop Proceedings, 2013. P. 74-82.
Ignatov D. I., Kuznetsov S., Zhukov L. E., Poelmans J. Can triconcepts become triclusters? // International Journal of General Systems. 2013. Vol. 42. No. 6. P. 572-593. doi
Konstantin Bazhanov, Obiedkov S. Optimizations in computing the Duquenne–Guigues basis of implications // Annals of Mathematics and Artificial Intelligence. 2014. Vol. 70. No. 1-2. P. 5-24. doi
Obiedkov S. Modeling ceteris paribus preferences in formal concept analysis, in: Formal Concept Analysis / Ed. by P. Cellier, F. Distel, B. Ganter. Vol. 7880. Berlin, Heidelberg : Springer, 2013. P. 188-202.
Buzmakov A., Neznanov A. Practical Computing with Pattern Structures in FCART Environment, in: Proceedings of the International Workshop "What can FCA do for Artificial Intelligence?" (FCA4AI at IJCAI 2013) / Ed. by S. Kuznetsov, A. Napoli, S. Rudolph. Issue 1058. Beijing : CEUR Workshop Proceedings, 2013. Ch. 7. P. 49-56.
Galitsky B., Ilvovsky D., Kuznetsov S., Strok F. V. Matching sets of parse trees for answering multi-sentence questions, in: Proceedings of the Recent Advances in Natural Language Processing. Hissar : INCOMA Ltd, 2013. P. 285-293.
Galitsky B., Ilvovsky D., Kuznetsov S., Strok F. V. Parse thicket representations of text paragraphs, in: Компьютерная лингвистика и интеллектуальные технологии: По материалам ежегодной Международной конференции «Диалог» (Бекасово, 29 мая - 2 июня 2013 г.). В 2-х т. Т. 1: Основная программа конференции. Вып. 12 (19). М. : РГГУ, 2013. P. 239-255.