Unstructured Data Analysis
- to show the implementation of contemporary approaches in existing software packages (preferably in the python frameworks), and demonstrate how these methods can be used for the solution of some real-world problems.
- know the basic principles behind the the existing deep learning approaches
- know advantages of existing natural language processing packages
- be able to get necessary data for research and applied projects
- be able to perform basic ETL operations with datasets and unstructured data
- be able to criticize constructively and determine existing issues with applied nlp tasks
- have an understanding of the basic principles of information retrieval
- have the skill to meaningfully develop an appropriate data analysis pipeline
- have the skill to work unstructured text data
- IR tasks overview, Python dive inLecture: The first session will discuss key IR tasks and show simple examples. We will also handle several issues with acquiring data from databases, files and web. Practical: Getting and serializing data from databases, files
- Web information extractionLecture: Web scraping techniques and tools. APIs and response formats. Practical: Creating simple web extraction script.
- Word embeddingsLecture: Word ambiguity problem, traditional and contemporary approaches in text representa-tion. Distributed semantics, Autoencoders architecture, word2vec, fasttext, bert. The notion of global and local optimization. Practical: word2vec, bert model training and fitting, basic text classification
- Text normalisationLecture: Text normalization problem, finite automate, conditional random fields, Practical: Text processing tools for Russian and English
- Syntax parsing, fact extractionLecture & Practical: Syntax parsing, text augmentation and generation
- Language modelling, text classification and clusteringLecture: Noisy channel model, spellchecking, Language modelling, text classification and clus-tering, cross-validation for classification estimation. Practical: Language modelling, text classification and clustering
- Sentiment detectionLecture: Sentiment detection with dictionaries, CNNs, RNNs. Sentiment detection as a classifi-cation problem Practical: Sentiment classifier development
- Text visualization methods and interfacesPractical: Historgams, Multidimension scaling, word graphs, highlight problem.
- Machine translation, question answeringLecture: Machine translation with markov models and recurrent neural networks, Seminar: Seq2seq training, Self-attention, Transformer. Analysis of attention heads in Transformer.
- Summarization and Domain adaptationLecture: transfer learning in text analysis, Knowledge Distillation. Abstract summarization and simplification, Rouge, SARI, BLUE, METEOR metrics
- Semantic search and indexingLecture: Elasticsearch queries, morphology parameters, cosine similarity, index density.
- Additional topics and course projects defenseLecture: Additional topics and course projects defense
- cumulative mark for the work during the modulusThe cumulative mark for the work during the modulus is based on the mark for the home tasks and on the activity during the seminars
- final examFinal exam can be replaced with course project. The grade for the course project must be set be-fore the final exam.
- Interim assessment (2 module)0.4 * cumulative mark for the work during the modulus + 0.6 * final exam
- Manning, C. D., & Schèutze, H. (1999). Foundations of Statistical Natural Language Processing. Cambridge, Mass: The MIT Press. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=24399
- Cohen, S. (2016). Bayesian Analysis in Natural Language Processing. Morgan & Claypool Publishers.