Master
2020/2021

# What is Data Science?

Category 'Best Course for New Knowledge and Skills'

Type:
Compulsory course (Applied Linguistics and Text Analytics)

Area of studies:
Fundamental and Applied Linguistics

Delivered by:
School of Literature and Intercultural Communication

When:
1 year, 2-4 module

Mode of studies:
offline

Instructors:
Купцов Павел Владимирович

Master’s programme:
Прикладная лингвистика и текстовая аналитика

Language:
English

ECTS credits:
9

Contact hours:
96

### Course Syllabus

#### Abstract

During this course students get acquainted with math concepts related to data science, master basic methods of collecting, processing and transforming of data using Python. The course deals with the concepts of data preprocessing, visualization, classical machine learning methods and deep neural networks. It includes basics of classification methods, regression, image recognition and natural language processing.

#### Learning Objectives

- Formation of ideas about various ways of working with data.
- Acquaintance with methods of preprocessing and visualization of data.
- Development the ability to create data analysis software employing the methods of machine learning and deep neural networks.

#### Expected Learning Outcomes

- Students are aware of basics of Python: variables, expressions, control structures, functions, classes, exceptions, files
- Students can perform the data visualization using Python. They can use matplotlib to draw bar charts, line charts, scatter plots.
- Students are aware of basic concepts of linear algebra: matrix, vectors, matrix algebra; matrix decompositions.
- Students are aware of basics of statistics: average, dispersion and standard deviation, correlation.
- Students are aware of basic concepts of probability theory: dependence and independence, conditional probability, Bayes's theorem, normal distribution.
- Students are aware of statistical hypothesis testing, confidence interval, Bayesian inference.
- Students are aware of gradient decent method
- Students can write Python programs to read data files and scrap the Web
- Students can write Python programs to perform exploring data, representing, cleaning, manipulating, rescaling.
- Students can write Python programs to work with regular expressions, time and date, perform one-hot vectorization.
- Students are aware of concept of PCA and can write Python program that performs it
- Students are aware of concepts and can write Python programs for nonlinear dimension reduction and manifold learning
- Students are aware of basic concepts of machine learning: models, problems of ML, overfitting and underfitting, correctness, bias-variance tradeoff, feature extraction and selection.
- Students are aware of concept and can write Python program for k-nearest neighbors classification
- Students are aware of concept and can write Python program for naive Bayes classification
- Students are aware of concept and can write Python program for regression analysis
- Students are aware of concept and can write Python program for decision trees
- Students are aware consept and can write Python program for cluster analysis
- Students are aware of basic concepts of neural networks: artificial neuron, activation function, perceptron, backpropagation
- Students are aware of basic concepts of deep learning: tensor, model weighs, layers, various activation functions, loss function and metrics, optimization methods, softmax and crossentropy, dropout, batches, stochastic gradient decent, epoch, batch normalization.
- Students can use Python to create deep networks for image recognition: dense layers, convolutional layers, pooling layers, augmentation.
- Students are aware of basic concepts of natural language processing and its applications
- Students can use Python to pefrom text preprocessing: word normalization (spelling correction, stemming, lemmatization, stopword removal, case folding), tokenization and creation n-grams .
- Students can use Python to create bag of word text representation and use TF-IDF for weighting terms
- Students can use Python to create word embedding
- Student are aware of basic concepts and can use Python for NLP deep lerning: recurrent neural networks, convolutional networks, pooling, attention mechanism, transformer.

#### Course Contents

- PythonBrief review of Python
- Visualization of dataVarious ways of data visualization using Python. Matplotlib. Bar charts, line charts, scatter plots.
- Linear algebraBasic concepts of linear algebra. Matrix, vectors, matrix algebra. Matrix decompositions.
- StatisticsBasic ideas of statistics. Average, dispersion and standard deviation. Correlation.
- ProbabilityBasic concepts of probability theory. Dependence and independence. Conditional probability. Bayes's theorem. Normal distribution.
- Hypothesis and inferenceStatistical hypothesis testing. Confidence interval. Bayesian inference.
- Gradient descentFinding extremuma of function of multiple variables using gradient descent.
- Getting dataWorking with data files. Scrapping the Web.
- Working with dataExploring data, representing, cleaning, manipulating, rescaling.
- Data pre-processingRegular expressions, working with time and date. One-hot vectorization.
- Linear dimension reductionDimension reduction using PCA
- Nonlinear dimension reductionNonlinear methods of dimension reduction. Manifold learning
- Machine learningBasic concepts of machine learning: models, problems of ML, overfitting and underfitting, correctness, bias-variance tradeoff, feature extraction and selection.
- k-nearest neighborsMethod of k-nearest neighbors for classification
- Naive Bayes classificationMethod of Naive Bayes classification
- RegressionMethods of regression analysis
- Decision treesA tree-like models that performs predictions by learning decision rules inferred from the data features
- ClusteringFinding groups of similar objects in multidimsional data
- Neural networksBasics concepts of neural networks: artificial neuron, activation function, perceptron, backpropagation
- Deep learniingDeep learning concepts: tensor, model weighs, layers, various activation functions, loss function and metrics, optimization methods, softmax and crossentropy, dropout, batches, stochastic gradient decent, epoch, batch normalization.
- Image recognitionDeep networks for image recognition: dense layers, convolutional layers, pooling layers, augmentation.
- Natural language processingBasic concepts of natural language processing. Applications of NLP.
- Text preprocessing and building vocabulary.Word normalization: spelling correction, stemming, lemmatization, stopword removal, case folding. Tokenization and n-grams .
- Bag of wordsRepresenting text data as bag of words. TF-IDF for weighting terms
- Word embeddingWord embedding. Similarity calculation between document vectors: cosine similarity, word mover's distance
- Deep network architecture for NLPRecurrent neural networks: plain RNN, LSTM, GRU. Convolutional networks. Pooling. Attention mechanism and transformer.

#### Assessment Elements

- Домашние задания, модуль 1
- Домашние задания, модуль 2
- Домашние задания, модуль 3
- Самостоятельная работа

#### Interim Assessment

- Interim assessment (4 module)0.1 * Домашние задания, модуль 1 + 0.3 * Домашние задания, модуль 2 + 0.2 * Домашние задания, модуль 3 + 0.4 * Самостоятельная работа

#### Bibliography

#### Recommended Core Bibliography

- Aman Kedia, & Mayank Rasu. (2020). Hands-On Python Natural Language Processing : Explore Tools and Techniques to Analyze and Process Text with a View to Building Real-world NLP Applications. Packt Publishing.
- Ben Stephenson. (2019). The Python Workbook : A Brief Introduction with Exercises and Solutions (Vol. 2nd ed. 2019). Springer.
- Bill Lubanovic. (2019). Introducing Python : Modern Computing in Simple Packages. [N.p.]: O’Reilly Media. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=2291494
- Grus, J. (2019). Data Science From Scratch : First Principles with Python (Vol. Second edition). Sebastopol, CA: O’Reilly Media. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=2102311

#### Recommended Additional Bibliography

- Döbler, M., & Grössmann, T. (2019). Data Visualization with Python : Create an Impact with Meaningful Data Insights Using Interactive and Engaging Visuals. Packt Publishing.
- Embarak O. Data Analysis and Visualization Using Python: Analyze Data to Create Visualizations for BI Systems. - Apress, 2018.
- Ian Goodfellow and Yoshua Bengio and Aaron Courville. Deep Learning, 2016. URL: http://www.deeplearningbook.org
- Idris, I. (2015). NumPy: Beginner’s Guide - Third Edition (Vol. 3rd edition). Birmingham, UK: Packt Publishing. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1018109
- Muller, A. C., & Guido, S. (2017). Introduction to machine learning with Python: a guide for data scientists. O’Reilly Media. (HSE access: http://ebookcentral.proquest.com/lib/hselibrary-ebooks/detail.action?docID=4698164)
- Vanderplas, J.T. (2016). Python data science handbook: Essential tools for working with data. Sebastopol, CA: O’Reilly Media, Inc. https://proxylibrary.hse.ru:2119/login.aspx?direct=true&db=nlebk&AN=1425081.
- Williams, G. (2019). Linear Algebra with Applications (Vol. Ninth edition). Burlington, MA: Jones & Bartlett Learning. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1708709