• A
  • A
  • A
  • ABC
  • ABC
  • ABC
  • А
  • А
  • А
  • А
  • А
Regular version of the site
Master 2020/2021

What is Data Science?

Category 'Best Course for New Knowledge and Skills'
Type: Compulsory course (Applied Linguistics and Text Analytics)
Area of studies: Fundamental and Applied Linguistics
Delivered by: School of Literature and Intercultural Communication
When: 1 year, 2-4 module
Mode of studies: offline
Master’s programme: Прикладная лингвистика и текстовая аналитика
Language: English
ECTS credits: 9
Contact hours: 96

Course Syllabus

Abstract

During this course students get acquainted with math concepts related to data science, master basic methods of collecting, processing and transforming of data using Python. The course deals with the concepts of data preprocessing, visualization, classical machine learning methods and deep neural networks. It includes basics of classification methods, regression, image recognition and natural language processing.
Learning Objectives

Learning Objectives

  • Formation of ideas about various ways of working with data.
  • Acquaintance with methods of preprocessing and visualization of data.
  • Development the ability to create data analysis software employing the methods of machine learning and deep neural networks.
Expected Learning Outcomes

Expected Learning Outcomes

  • Students are aware of basics of Python: variables, expressions, control structures, functions, classes, exceptions, files
  • Students can perform the data visualization using Python. They can use matplotlib to draw bar charts, line charts, scatter plots.
  • Students are aware of basic concepts of linear algebra: matrix, vectors, matrix algebra; matrix decompositions.
  • Students are aware of basics of statistics: average, dispersion and standard deviation, correlation.
  • Students are aware of basic concepts of probability theory: dependence and independence, conditional probability, Bayes's theorem, normal distribution.
  • Students are aware of statistical hypothesis testing, confidence interval, Bayesian inference.
  • Students are aware of gradient decent method
  • Students can write Python programs to read data files and scrap the Web
  • Students can write Python programs to perform exploring data, representing, cleaning, manipulating, rescaling.
  • Students can write Python programs to work with regular expressions, time and date, perform one-hot vectorization.
  • Students are aware of concept of PCA and can write Python program that performs it
  • Students are aware of concepts and can write Python programs for nonlinear dimension reduction and manifold learning
  • Students are aware of basic concepts of machine learning: models, problems of ML, overfitting and underfitting, correctness, bias-variance tradeoff, feature extraction and selection.
  • Students are aware of concept and can write Python program for k-nearest neighbors classification
  • Students are aware of concept and can write Python program for naive Bayes classification
  • Students are aware of concept and can write Python program for regression analysis
  • Students are aware of concept and can write Python program for decision trees
  • Students are aware consept and can write Python program for cluster analysis
  • Students are aware of basic concepts of neural networks: artificial neuron, activation function, perceptron, backpropagation
  • Students are aware of basic concepts of deep learning: tensor, model weighs, layers, various activation functions, loss function and metrics, optimization methods, softmax and crossentropy, dropout, batches, stochastic gradient decent, epoch, batch normalization.
  • Students can use Python to create deep networks for image recognition: dense layers, convolutional layers, pooling layers, augmentation.
  • Students are aware of basic concepts of natural language processing and its applications
  • Students can use Python to pefrom text preprocessing: word normalization (spelling correction, stemming, lemmatization, stopword removal, case folding), tokenization and creation n-grams .
  • Students can use Python to create bag of word text representation and use TF-IDF for weighting terms
  • Students can use Python to create word embedding
  • Student are aware of basic concepts and can use Python for NLP deep lerning: recurrent neural networks, convolutional networks, pooling, attention mechanism, transformer.
Course Contents

Course Contents

  • Python
    Brief review of Python
  • Visualization of data
    Various ways of data visualization using Python. Matplotlib. Bar charts, line charts, scatter plots.
  • Linear algebra
    Basic concepts of linear algebra. Matrix, vectors, matrix algebra. Matrix decompositions.
  • Statistics
    Basic ideas of statistics. Average, dispersion and standard deviation. Correlation.
  • Probability
    Basic concepts of probability theory. Dependence and independence. Conditional probability. Bayes's theorem. Normal distribution.
  • Hypothesis and inference
    Statistical hypothesis testing. Confidence interval. Bayesian inference.
  • Gradient descent
    Finding extremuma of function of multiple variables using gradient descent.
  • Getting data
    Working with data files. Scrapping the Web.
  • Working with data
    Exploring data, representing, cleaning, manipulating, rescaling.
  • Data pre-processing
    Regular expressions, working with time and date. One-hot vectorization.
  • Linear dimension reduction
    Dimension reduction using PCA
  • Nonlinear dimension reduction
    Nonlinear methods of dimension reduction. Manifold learning
  • Machine learning
    Basic concepts of machine learning: models, problems of ML, overfitting and underfitting, correctness, bias-variance tradeoff, feature extraction and selection.
  • k-nearest neighbors
    Method of k-nearest neighbors for classification
  • Naive Bayes classification
    Method of Naive Bayes classification
  • Regression
    Methods of regression analysis
  • Decision trees
    A tree-like models that performs predictions by learning decision rules inferred from the data features
  • Clustering
    Finding groups of similar objects in multidimsional data
  • Neural networks
    Basics concepts of neural networks: artificial neuron, activation function, perceptron, backpropagation
  • Deep learniing
    Deep learning concepts: tensor, model weighs, layers, various activation functions, loss function and metrics, optimization methods, softmax and crossentropy, dropout, batches, stochastic gradient decent, epoch, batch normalization.
  • Image recognition
    Deep networks for image recognition: dense layers, convolutional layers, pooling layers, augmentation.
  • Natural language processing
    Basic concepts of natural language processing. Applications of NLP.
  • Text preprocessing and building vocabulary.
    Word normalization: spelling correction, stemming, lemmatization, stopword removal, case folding. Tokenization and n-grams .
  • Bag of words
    Representing text data as bag of words. TF-IDF for weighting terms
  • Word embedding
    Word embedding. Similarity calculation between document vectors: cosine similarity, word mover's distance
  • Deep network architecture for NLP
    Recurrent neural networks: plain RNN, LSTM, GRU. Convolutional networks. Pooling. Attention mechanism and transformer.
Assessment Elements

Assessment Elements

  • non-blocking Домашние задания, модуль 1
  • non-blocking Домашние задания, модуль 2
  • non-blocking Домашние задания, модуль 3
  • non-blocking Самостоятельная работа
Interim Assessment

Interim Assessment

  • Interim assessment (4 module)
    0.1 * Домашние задания, модуль 1 + 0.3 * Домашние задания, модуль 2 + 0.2 * Домашние задания, модуль 3 + 0.4 * Самостоятельная работа
Bibliography

Bibliography

Recommended Core Bibliography

  • Aman Kedia, & Mayank Rasu. (2020). Hands-On Python Natural Language Processing : Explore Tools and Techniques to Analyze and Process Text with a View to Building Real-world NLP Applications. Packt Publishing.
  • Ben Stephenson. (2019). The Python Workbook : A Brief Introduction with Exercises and Solutions (Vol. 2nd ed. 2019). Springer.
  • Bill Lubanovic. (2019). Introducing Python : Modern Computing in Simple Packages. [N.p.]: O’Reilly Media. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=2291494
  • Grus, J. (2019). Data Science From Scratch : First Principles with Python (Vol. Second edition). Sebastopol, CA: O’Reilly Media. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=2102311

Recommended Additional Bibliography

  • Döbler, M., & Grössmann, T. (2019). Data Visualization with Python : Create an Impact with Meaningful Data Insights Using Interactive and Engaging Visuals. Packt Publishing.
  • Embarak O. Data Analysis and Visualization Using Python: Analyze Data to Create Visualizations for BI Systems. - Apress, 2018.
  • Ian Goodfellow and Yoshua Bengio and Aaron Courville. Deep Learning, 2016. URL: http://www.deeplearningbook.org
  • Idris, I. (2015). NumPy: Beginner’s Guide - Third Edition (Vol. 3rd edition). Birmingham, UK: Packt Publishing. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1018109
  • Muller, A. C., & Guido, S. (2017). Introduction to machine learning with Python: a guide for data scientists. O’Reilly Media. (HSE access: http://ebookcentral.proquest.com/lib/hselibrary-ebooks/detail.action?docID=4698164)
  • Vanderplas, J.T. (2016). Python data science handbook: Essential tools for working with data. Sebastopol, CA: O’Reilly Media, Inc. https://proxylibrary.hse.ru:2119/login.aspx?direct=true&db=nlebk&AN=1425081.
  • Williams, G. (2019). Linear Algebra with Applications (Vol. Ninth edition). Burlington, MA: Jones & Bartlett Learning. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1708709