• A
  • A
  • A
  • ABC
  • ABC
  • ABC
  • А
  • А
  • А
  • А
  • А
Regular version of the site

Development Tools for Automatical Search of Structured Information in Heterogeneous Environment

Student: Kry`sanov Pavel

Supervisor: Lyudmila N. Lyadova

Faculty: Faculty of Economics, Management, and Business Informatics

Educational Programme: Bachelor

Final Grade: 7

Year of Graduation: 2014

<p>The problem of automated search of structured data on the Internet and on the local machine . The aim is to develop an application that allows you to search the tables on the local machine and into the Internet on a defined pattern ( reference table ) based on a comparison of the table and verify that the data is relevant to search criterias. Under reference table means the table created by the user, with which found tables will compare. The main parameters of the search is necessary degree of compliance tables for the recognition of their relevance , depth of search in Yandex and depth of research on the site.</p><p>The problem of search is divided into several stages. In the case of data retrieval on the Internet the first stage is an access to the search engine, wherein as the initial data is transmitted to find the user-entered search query through the application, and returned the result set of links from the search results. If the search is on the local computer, in the first stage the process of obtaining a list of files stored in a folder that has been selected for the search. At the second stage is the processing of the references received from&nbsp; previous step and downloading data stored in documents Word, Excel, HTML, found as a result of processing of the references. In the third stage considers the extracted table data from documents Word, Excel, and HTML, and then the comparison of these data with a reference table. If the found table is relevant with the reference table, the application keeps track of these files in the search results.</p><p>The paper presents the results of a survey of existing solutions in search of structured data and methods of making the individual tasks required to the development of the application. During the review was found the DLL for Stemming (finding the basis for a given word of the original word), which was used in the third stage of the algorithm to create a comparison of the two tables. Also have been described algorithms for the isolation tables from Word, Excel, and HTML (the second stage), the algorithm to compare two tables for relevancy (the third stage), the algorithm to work with the search engines on the Internet as well as on the local machine (the first step).</p><p>The article describes a developed application that automatically, based on the reference table, finds the desired tabular data in the documents of Word, Excel and HTML for the user. When searching the internet for references to the pages where the requested data may be used Yandex search. This development allows the user to gain time. The gain in time is obtained by the fact that the program automates time-consuming, routine tasks browsing and downloading data.</p><p>The application is developed in Visual Studio in format Windows Forms Application.</p><p>The paper presents the results of the program on several search queries, described all the developed algorithms - algorithm of retrieval of structured data on the Internet, algorithms retrieving tables from Word, Excel, and HTML, algorithm of comparison of tables and many others. Also in the end of the paper shows the source code of the program.</p>

Full text (added June 11, 2014) (2.35 Kb)

Student Theses at HSE must be completed in accordance with the University Rules and regulations specified by each educational programme.

Summaries of all theses must be published and made freely available on the HSE website.

The full text of a thesis can be published in open access on the HSE website only if the authoring student (copyright holder) agrees, or, if the thesis was written by a team of students, if all the co-authors (copyright holders) agree. After a thesis is published on the HSE website, it obtains the status of an online publication.

Student theses are objects of copyright and their use is subject to limitations in accordance with the Russian Federation’s law on intellectual property.

In the event that a thesis is quoted or otherwise used, reference to the author’s name and the source of quotation is required.

Search all student theses