Unstructured Data Analysis

Master 2020/2021

Category 'Best Course for Career Development'

Category 'Best Course for Broadening Horizons and Diversity of Knowledge and Skills'

Category 'Best Course for New Knowledge and Skills'

Type: Elective course (Applied Statistics with Network Analysis)

Area of studies: Applied Mathematics and Informatics

Delivered by: International laboratory for Applied Network Research

Where: International laboratory for Applied Network Research

When: 2 year, 1, 2 module

Mode of studies: offline

Instructors: Ilia Karpov

Master’s programme: Applied Statistics with Network Analysis

Language: English

ECTS credits: 4

Contact hours: 48

Full Syllabus

Abstract

This course focuses on applied methods and existing tools for information retrieval: web scrap-ing, data preprocessing, natural language processing. All methods considered in this course require basic knowledge of discrete mathematics and probabilistic theory. For instance, most NLP and IR methods use conditional probability. In this course, we show the implementation of contemporary approaches in existing software packages (preferably in the python frameworks), and demonstrate how these methods can be used for the solution of some real-world problems.

Learning Objectives

to show the implementation of contemporary approaches in existing software packages (preferably in the python frameworks), and demonstrate how these methods can be used for the solution of some real-world problems.

Expected Learning Outcomes

know the basic principles behind the the existing deep learning approaches
know advantages of existing natural language processing packages
be able to get necessary data for research and applied projects
be able to perform basic ETL operations with datasets and unstructured data
be able to criticize constructively and determine existing issues with applied nlp tasks
have an understanding of the basic principles of information retrieval
have the skill to meaningfully develop an appropriate data analysis pipeline
have the skill to work unstructured text data

Course Contents

IR tasks overview, Python dive in
Lecture: The first session will discuss key IR tasks and show simple examples. We will also handle several issues with acquiring data from databases, files and web. Practical: Getting and serializing data from databases, files
Web information extraction
Lecture: Web scraping techniques and tools. APIs and response formats. Practical: Creating simple web extraction script.
Word embeddings
Lecture: Word ambiguity problem, traditional and contemporary approaches in text representa-tion. Distributed semantics, Autoencoders architecture, word2vec, fasttext, bert. The notion of global and local optimization. Practical: word2vec, bert model training and fitting, basic text classification
Text normalisation
Lecture: Text normalization problem, finite automate, conditional random fields, Practical: Text processing tools for Russian and English
Syntax parsing, fact extraction
Lecture & Practical: Syntax parsing, text augmentation and generation
Language modelling, text classification and clustering
Lecture: Noisy channel model, spellchecking, Language modelling, text classification and clus-tering, cross-validation for classification estimation. Practical: Language modelling, text classification and clustering
Sentiment detection
Lecture: Sentiment detection with dictionaries, CNNs, RNNs. Sentiment detection as a classifi-cation problem Practical: Sentiment classifier development
Text visualization methods and interfaces
Practical: Historgams, Multidimension scaling, word graphs, highlight problem.
Machine translation, question answering
Lecture: Machine translation with markov models and recurrent neural networks, Seminar: Seq2seq training, Self-attention, Transformer. Analysis of attention heads in Transformer.
Summarization and Domain adaptation
Lecture: transfer learning in text analysis, Knowledge Distillation. Abstract summarization and simplification, Rouge, SARI, BLUE, METEOR metrics
Semantic search and indexing
Lecture: Elasticsearch queries, morphology parameters, cosine similarity, index density.
Additional topics and course projects defense
Lecture: Additional topics and course projects defense

Assessment Elements

cumulative mark for the work during the modulus
The cumulative mark for the work during the modulus is based on the mark for the home tasks and on the activity during the seminars
final exam
Final exam can be replaced with course project. The grade for the course project must be set be-fore the final exam.

Interim Assessment

Interim assessment (2 module)
0.4 * cumulative mark for the work during the modulus + 0.6 * final exam

Bibliography

Recommended Core Bibliography

Manning, C. D., & Schèutze, H. (1999). Foundations of Statistical Natural Language Processing. Cambridge, Mass: The MIT Press. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=24399

Recommended Additional Bibliography

Cohen, S. (2016). Bayesian Analysis in Natural Language Processing. Morgan & Claypool Publishers.

Course Syllabus