• A
  • A
  • A
  • ABC
  • ABC
  • ABC
  • А
  • А
  • А
  • А
  • А
Regular version of the site

Unsupervised Wep Page Information Extraction Using CNN

Student: Iuliia Koroleva

Supervisor: Valentina Kuskova

Faculty: International laboratory for Applied Network Research

Educational Programme: Applied Statistics with Network Analysis (Master)

Year of Graduation: 2021

This master thesis focuses on the problem of structured content extraction from HTML documents. The relevance of this paper lies with the fact that of today there are no open source projects on the topic that give admissible results given the ongoing rapid development of web pages. Old classical methods lose their functionality due dynamic loading, log-in requirements, CAPTCAs, IP blocking and Honeypots. Most of the existing methods were developed before wide application of neural networks. Since then both hardware technological development and machine learning algorithms improved drastically and allowed better quality of problem solving in related areas. Specifically, machine learning algorithms based on multilayer neural networks have recently led to significant improvements in such areas as computer vision and natural language processing. Considering this the assumption of this thesis is that applying these methods to structured information extraction ought to yield good results. The aim of this work is to explore the possibility to apply Convolutional Neural Networks architectures to the task structured information extraction. In this work a comprehensive research of the existing methods was conducted, based on which the new method of content extraction was proposed and implemented on the collected dataset. Keywords. Convolutional neural networks, HTML, content extraction.

Student Theses at HSE must be completed in accordance with the University Rules and regulations specified by each educational programme.

Summaries of all theses must be published and made freely available on the HSE website.

The full text of a thesis can be published in open access on the HSE website only if the authoring student (copyright holder) agrees, or, if the thesis was written by a team of students, if all the co-authors (copyright holders) agree. After a thesis is published on the HSE website, it obtains the status of an online publication.

Student theses are objects of copyright and their use is subject to limitations in accordance with the Russian Federation’s law on intellectual property.

In the event that a thesis is quoted or otherwise used, reference to the author’s name and the source of quotation is required.

Search all student theses