• A
  • A
  • A
  • ABC
  • ABC
  • ABC
  • А
  • А
  • А
  • А
  • А
Regular version of the site

Special Topics Detection in Short Natural Languages Texts

Student: Samelyuk Kirill

Supervisor: Gennady Osipov

Faculty: Faculty of Computer Science

Educational Programme: Data Science (Master)

Year of Graduation: 2016

The rapid development of information technologies has led to a significant increase in the number of text documents stored in electronic form. At current rates of data growth neither single expert nor group of specialists can not cope with text processing problems. This is what determines the relevance of the development of techniques and approaches to solving problems of automatic processing, storage, and data collection. This paper discusses methods of automatic language recognition of short text messages and identify specific topics that meet the various violations of the law. The aim of the work is the analysis of search engine queries to determine the language and detect various violations. The work consists of five chapters: introduction, description of the methods of automatic language detection of short texts, description of the approach to identify the topics on the basis of specialized dictionaries, analysis of search queries using the described methods and results of the processing of search engine queries. During the preparing of this paper it was analyzed over one billion search engine queries, taken during the period from 1 to 11 April 2016. Queries were groupped by regions and filtered out 100 most active ones. The main result is the fact that an overwhelming number of queries asks the users in Russian (in different regions ranged from 47% to 53%). Besides the distribution of queries languages does no correlate with the ethnic composition of the region. The total distribution of violations persists in all regions: the most common violation is related to drugs, further violations relating to extremism and nationalism.

Student Theses at HSE must be completed in accordance with the University Rules and regulations specified by each educational programme.

Summaries of all theses must be published and made freely available on the HSE website.

The full text of a thesis can be published in open access on the HSE website only if the authoring student (copyright holder) agrees, or, if the thesis was written by a team of students, if all the co-authors (copyright holders) agree. After a thesis is published on the HSE website, it obtains the status of an online publication.

Student theses are objects of copyright and their use is subject to limitations in accordance with the Russian Federation’s law on intellectual property.

In the event that a thesis is quoted or otherwise used, reference to the author’s name and the source of quotation is required.

Search all student theses