Text mining: Advanced Level

Master 2022/2023

Category 'Best Course for Career Development'

Category 'Best Course for Broadening Horizons and Diversity of Knowledge and Skills'

Category 'Best Course for New Knowledge and Skills'

Type: Compulsory course (Data Analytics for Politics and Society )

Area of studies: Political Science

Delivered by: Department of Sociology

Where: Saint-Petersburg School of Social Sciences

When: 2 year, 1 module

Mode of studies: offline

Open to: students of one campus

Instructors: Sergei Koltsov

Master’s programme: Data Analytics for Politics and Society

Language: English

ECTS credits: 3

Contact hours: 24

Full Syllabus

Abstract

This course covers a wide range of machine learning algorithms for textual data analysis. The first part of the course deals with preprocessing procedures for text data, which include lematizers for different languages, procedures for vectorization of text data. This course also deals with the work of classifiers for textual analysis (and measures the quality of classifiers). The second part of the course focuses on the work of flat and hierarchical topic models (measures of quality: coherence, perplexity, loglikellyhood, stability, Renyi entropy). In addition, this course explores the concept of 'word embedings' for textual analysis (topic modeling). In the third part of the course, the work with neural networks for textual data analysis based on the TensorFlow framework with the Keras add-in is considered. All the models discussed are provided with python scripts. At the end of the course students have to present their work on data analysis in the form of a presentation and scripts.

Learning Objectives

Learn algorithms and their main advantages and limitations in terms of text data analysis
Obtain skills to work with machine learning software / cod
Be able to work with text data.

Expected Learning Outcomes

Have skills to analyze textual data
Analyze data with machine learning tools
Do textual preprocessing (lemmatization and tokenization)
Present the resulting project in terms of machine learning
Visualize results of the analysis

Course Contents

Objectives of text analysis - preprocessing, lematization-vectorization.
Overview of classical classifiers such as KNN, Random Forrest, SVM
Bayesian classification for sentiment analysis or topic definition.
Topic modeling (plane), quality metrics (Coherence, Perplexity, Loglokellyhood, stability, Renyi entropy), review of some libraries.
Topic modeling (hierarchical models, discussion of problems).
Embedings (gensim), what are word embeddings, how to work with words embedings.
Topic models with embedings (ETM, GLDAW).
Introduction to neuron networks (Tensorflow, keras) - the basics of working with Keras, an overview of some neural networks.
Preprocessing of text data for neural networking.
Working with recurrent neural networks for textual analysis.
Working with LSTM neural networks for textual analysis.
Model with multiple outputs (heads).
Presentation of student work.

Assessment Elements

Exam
The exam is a competition (hakaton) to develop the best model of sentiment analysis for the Russian-language text. The essence of the competition is as follows. At the end of the first part of the course a Russian-language dataset with sentiment scores will be given. Students must train their classification models on this dataset. A week before the exam, students will receive the second part of the dataset, which they must use to test the models they have learned. On the exam, students give a presentation on their models. The grade for the presentation depends, first, on the level of presentation. Second, the grade depends on the results obtained (level of model learning and number of models).
Homework

Interim Assessment

2022/2023 1st module
0.4 * Exam + 0.6 * Homework

Bibliography

Recommended Core Bibliography

Sebastian Raschka, & Vahid Mirjalili. (2019). Python Machine Learning : Machine Learning and Deep Learning with Python, Scikit-learn, and TensorFlow 2, 3rd Edition. Packt Publishing.

Recommended Additional Bibliography

Miroslav Kubat. (2017). An Introduction to Machine Learning (Vol. 2nd ed. 2017). Springer.

Course Syllabus