• A
  • A
  • A
  • ABC
  • ABC
  • ABC
  • А
  • А
  • А
  • А
  • А
Regular version of the site
  • HSE University
  • Student Theses
  • Classification and Analysis of Events from News Feeds in Order to Hihglight the Main Characteristics of the Russian Regions and their Performance

Classification and Analysis of Events from News Feeds in Order to Hihglight the Main Characteristics of the Russian Regions and their Performance

Student: Susmanova Sofya

Supervisor: Rostislav Yavorskiy

Faculty: Faculty of Computer Science

Educational Programme: Data Science (Master)

Year of Graduation: 2016

The Russian Federation occupies an area about 17 million square kilometers, and it consists of 89 regions – which are different in range of parameters. At the same time doesn’t stand still, and the study of changing nature of the region, its most pressing problems is a rather difficult task – especially under tight time. None of the modern means of data handling is enchanted by this problem and none of them can give an idea about structure extraction – from regional information field. This work is based on assumption that the necessary information about the regions can be obtained from the local news feeds. The purpose of this work is development of algorithms composition to determine keywords in news corpora for regions, and its application to following cases:  To the entire news feeds corpora – for the understanding of common regional discourse;  To the range of news categories – for the understanding of the prevailing concepts and key association rules in this category. The problems of this study are:  RSS-parser development and news feeds corpora creation;  Data converting;  Development of keywords extraction algorithm based on the Random Forest approach;  Comparison results with results of frequency analysis approach (TF-IDF);  Results formation – for the entire news feeds corpora and for the sample by news category. The result of study is the algorithm that highlights keywords of regional context at a good level. The use of the analyzer to the news categories allow to allocate prevalent association rules for basic social concepts – for the region. The novelty of this work lies in the fact that we are not just looking for keywords but combine problem of keywords extraction with classification - basic regional concepts should be not only representative within the region, but also to differ our region from other regions.

Student Theses at HSE must be completed in accordance with the University Rules and regulations specified by each educational programme.

Summaries of all theses must be published and made freely available on the HSE website.

The full text of a thesis can be published in open access on the HSE website only if the authoring student (copyright holder) agrees, or, if the thesis was written by a team of students, if all the co-authors (copyright holders) agree. After a thesis is published on the HSE website, it obtains the status of an online publication.

Student theses are objects of copyright and their use is subject to limitations in accordance with the Russian Federation’s law on intellectual property.

In the event that a thesis is quoted or otherwise used, reference to the author’s name and the source of quotation is required.

Search all student theses