• A
  • A
  • A
  • ABC
  • ABC
  • ABC
  • А
  • А
  • А
  • А
  • А
Regular version of the site
Master 2020/2021

Unstructured Data Analysis

Category 'Best Course for Career Development'
Category 'Best Course for Broadening Horizons and Diversity of Knowledge and Skills'
Category 'Best Course for New Knowledge and Skills'
Area of studies: Applied Mathematics and Informatics
When: 2 year, 1, 2 module
Mode of studies: offline
Open to: students of all HSE University campuses
Instructors: Ilia Karpov
Master’s programme: Applied Statistics with Network Analysis
Language: English
ECTS credits: 4

Course Syllabus

Abstract

This course focuses on applied methods and existing tools for information retrieval: web scrap-ing, data preprocessing, natural language processing. All methods considered in this course require basic knowledge of discrete mathematics and probabilistic theory. For instance, most NLP and IR methods use conditional probability. In this course, we show the implementation of contemporary approaches in existing software packages (preferably in the python frameworks), and demonstrate how these methods can be used for the solution of some real-world problems.
Learning Objectives

Learning Objectives

  • to show the implementation of contemporary approaches in existing software packages (preferably in the python frameworks), and demonstrate how these methods can be used for the solution of some real-world problems.
Expected Learning Outcomes

Expected Learning Outcomes

  • know the basic principles behind the the existing deep learning approaches
  • know advantages of existing natural language processing packages
  • be able to get necessary data for research and applied projects
  • be able to perform basic ETL operations with datasets and unstructured data
  • be able to criticize constructively and determine existing issues with applied nlp tasks
  • have an understanding of the basic principles of information retrieval
  • have the skill to meaningfully develop an appropriate data analysis pipeline
  • have the skill to work unstructured text data
Course Contents

Course Contents

  • IR tasks overview, Python dive in
    Lecture: The first session will discuss key IR tasks and show simple examples. We will also handle several issues with acquiring data from databases, files and web. Practical: Getting and serializing data from databases, files
  • Web information extraction
    Lecture: Web scraping techniques and tools. APIs and response formats. Practical: Creating simple web extraction script.
  • Word embeddings
    Lecture: Word ambiguity problem, traditional and contemporary approaches in text representa-tion. Distributed semantics, Autoencoders architecture, word2vec, fasttext, bert. The notion of global and local optimization. Practical: word2vec, bert model training and fitting, basic text classification
  • Text normalisation
    Lecture: Text normalization problem, finite automate, conditional random fields, Practical: Text processing tools for Russian and English
  • Syntax parsing, fact extraction
    Lecture & Practical: Syntax parsing, text augmentation and generation
  • Language modelling, text classification and clustering
    Lecture: Noisy channel model, spellchecking, Language modelling, text classification and clus-tering, cross-validation for classification estimation. Practical: Language modelling, text classification and clustering
  • Sentiment detection
    Lecture: Sentiment detection with dictionaries, CNNs, RNNs. Sentiment detection as a classifi-cation problem Practical: Sentiment classifier development
  • Text visualization methods and interfaces
    Practical: Historgams, Multidimension scaling, word graphs, highlight problem.
  • Machine translation, question answering
    Lecture: Machine translation with markov models and recurrent neural networks, Seminar: Seq2seq training, Self-attention, Transformer. Analysis of attention heads in Transformer.
  • Summarization and Domain adaptation
    Lecture: transfer learning in text analysis, Knowledge Distillation. Abstract summarization and simplification, Rouge, SARI, BLUE, METEOR metrics
  • Semantic search and indexing
    Lecture: Elasticsearch queries, morphology parameters, cosine similarity, index density.
  • Additional topics and course projects defense
    Lecture: Additional topics and course projects defense
Assessment Elements

Assessment Elements

  • non-blocking cumulative mark for the work during the modulus
    The cumulative mark for the work during the modulus is based on the mark for the home tasks and on the activity during the seminars
  • non-blocking final exam
    Final exam can be replaced with course project. The grade for the course project must be set be-fore the final exam.
Interim Assessment

Interim Assessment

  • Interim assessment (2 module)
    0.4 * cumulative mark for the work during the modulus + 0.6 * final exam
Bibliography

Bibliography

Recommended Core Bibliography

  • Manning, C. D., & Schèutze, H. (1999). Foundations of Statistical Natural Language Processing. Cambridge, Mass: The MIT Press. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=24399

Recommended Additional Bibliography

  • Cohen, S. (2016). Bayesian Analysis in Natural Language Processing. Morgan & Claypool Publishers.