• A
  • A
  • A
  • ABC
  • ABC
  • ABC
  • А
  • А
  • А
  • А
  • А
Regular version of the site
Student
Title
Supervisor
Faculty
Educational Programme
Final Grade
Year of Graduation
Betty jeanne Fabre
Using Active Learning in Text Classification Tasks
Data Science
(Master’s programme)
9
2018
The increase in the amount of data collected and stored leads to the collection of huge datasets that require annotation or labeling to be used. Also, text data, like documents, require time and expertise to be labeled which can be costly. Improving the methods to wisely chose the data to label is, as a consequence, becoming a major topic of interest.

The work of this thesis has been to study the active learning paradigm which aims at including and optimizing the human labeling task into the learning process of a classifier. The goal is to study different strategies of active learning applied to text classification tasks.

The work has followed three main topics, the active learning strategies and the influence of both the representation of a text or document and the classifier to the process of active learning. Several experiments consisting of training a classifier for the 20newsgroup dataset, have been carried to study Bag-of-words based,Word2Vec based and Doc2Vec based text representation associated with Random Forest, Decision Trees and KNN classifiers in the framework of active learning. The active learning strategies used single learners with uncertainty based query strategy and committees of learners. Also, a semi-active learning strategy that includes an automatic learning process in parallel of the active learning queries has been tested.

All the experiments have been implemented in Python using the common data science libraries scikit-learn and pandas, the gensim library and the modAL framework.

KEY-WORDS : text data, active learning, classification tasks , representation of a text, classifier, newsgroup dataset, Bag-of-words, Word2Vec, Doc2Vec, Random Forest, Decision Trees, KNN, ncertainty, query, committees, semi-active learning, Python, scikit-learn, pandas, gensim, modAL

Student Theses at HSE must be completed in accordance with the University Rules and regulations specified by each educational programme.

Summaries of all theses must be published and made freely available on the HSE website.

The full text of a thesis can be published in open access on the HSE website only if the authoring student (copyright holder) agrees, or, if the thesis was written by a team of students, if all the co-authors (copyright holders) agree. After a thesis is published on the HSE website, it obtains the status of an online publication.

Student theses are objects of copyright and their use is subject to limitations in accordance with the Russian Federation’s law on intellectual property.

In the event that a thesis is quoted or otherwise used, reference to the author’s name and the source of quotation is required.

Search all student theses