Extraction of Structured Information from HTML Pages using Convolutional Neural Networks

Student: Meinster Dmitrii

Educational Programme: Data Science (Master)

Year of Graduation: 2018

Web data extraction problem becomes more and more important these days as Internet rapidly grows and technologically advances. Web data extraction may be applied either directly (i.e., for news aggregation) or as a part of intellectual data analysis. Most of automated ways of extraction were developed in the age of static web and can't work well with modern, dynamic web, populated with CSS tables and client scripts. Data extraction using visual representation of web pages is a promising approach, but existing methods perform extraction using rule-based techniques. The present work investigates methods of extracting information using convolutional neural networks: we assume that this approach allows creating new methods or qualitatively changing existing ones. Web data extraction task is formulated as a classification problem. We classify elements of HTML document DOM tree. Based on the HTML code of the page and its visual representation, we build a regular or irregular graphical grid. Next, a new meta-image is created using that grid. Pixels of new image contain information about the elements of the web page that fall into the corresponding grid cells. Next, the convolutional neural network solves the semantic segmentation problem to determine the pixel classes of this image. Finally, the predictions for the pixels are collected back into the predictions for the DOM tree elements. The proposed model was tested on a set of data from approximately 30,000 news pages taken from more than 150 news resources. It solves both the task of binary classification (simple extraction of information) and multiclass classification (extraction of date, title and other parameters, in addition to the main text). Сomparison with some existing extraction methods has been made; it shows that the proposed method is comparable in effectiveness with popular commercial tools on the cleared data.

Student Theses at HSE must be completed in accordance with the University Rules and regulations specified by each educational programme.

Summaries of all theses must be published and made freely available on the HSE website.

The full text of a thesis can be published in open access on the HSE website only if the authoring student (copyright holder) agrees, or, if the thesis was written by a team of students, if all the co-authors (copyright holders) agree. After a thesis is published on the HSE website, it obtains the status of an online publication.

Student theses are objects of copyright and their use is subject to limitations in accordance with the Russian Federation’s law on intellectual property.

In the event that a thesis is quoted or otherwise used, reference to the author’s name and the source of quotation is required.

Search all student theses