• A
  • A
  • A
  • АБВ
  • АБВ
  • АБВ
  • А
  • А
  • А
  • А
  • А
Обычная версия сайта

Introduction to collection and analysis of 'Big data'

Учебный год
Обучение ведется на английском языке
Курс по выбору
Когда читается:
1-й курс, 1, 2 модуль


Course Syllabus


The growth of Internet penetration and the possibility of collecting and analyzing big data have produced new challenges and have offered new opportunities for researchers and official statistics. Within several years nonreactive and big data has become the main trend in the social sciences. Nonreactive methods include nonparticipant observation and analysis of digital fingerprints such as likes or shares, as well as private documents such as blogs, social media profiles and comments, or public online documents such as mass media materials. This course will give an introduction to key quantitative approaches to the collection of nonreactive data in social sciences. The course is taught in the form of lectures, seminars, and individual work using Jupyter notebook. All teaching is conducted in English. The goal of the course is to introduce the opportunities of nonreactive and big data for social scientists and learn basic methods and tools to collect nonreactive data. Within the course some Python packages will be used for data analysis. Basic knowledge of quantitative sociological methods is required. Familiarity with Python is very helpful but not required. To run Jupyter notebook, install Anaconda (freely available at: https://www.anaconda.com/products/individual#Downloads) This course uses some materials and tasks from https://www.datacamp.com/ (free access is provided to all students)
Learning Objectives

Learning Objectives

  • Know basic methods of collecting nonreactive data in social sciences
  • Know different types of big data in social sciences
  • Use skills to collect online data (Wikipedia, YouTube, etc).
  • Use skills to analyze textual data
Expected Learning Outcomes

Expected Learning Outcomes

  • Know basic concepts of Python programming language;
  • Have skills to write Python code for basic data analysis tasks
  • Have skills to analyze textual data
  • Have skills to scrap online data through various APIs, automatization of actions in browser, and etc
  • Know basic concepts of Big data, its opportunities, limitations, and relevance to social sciences
Course Contents

Course Contents

  • Introduction to Python
    Anaconda. Virtual environments. Jupyter notebook. Basic data types and structures. Basic functions and operators, methods and packages.
  • Basic data manipulation in Python
    Basic dataframe manipulations in Python: filtering rows, selecting columns, slicing rows, creating new variables, arranging columns, joins, aggregation and grouping. Exploratory Data Analysis: descriptive statistics and visualization. Competitive data science: kaggle competitions.
  • Basic Text Processing
    Text preprocessing procedures: cleaning raw data: lowering case, removal of special characters and stopwords, etc.; tokenization and segmentation; normalization of words: stemming, lemmatization. Text processing: N-grams, TF-IDF. Frequency-based keyword extraction.
  • Web-scrapping
    Json module in Python. HTML-structure of a web page. requests package. Blocking a request, methods of solution applying fake_useragent package. Working with dynamic pages (user behaviour imitation) using Selenium package. Extracting information from tags: BeautifulSoup package.
  • Client server architecture and request response: work with APIs
    Public and private API. API YouTube. Quotas.
  • Distributional semantics and topic modeling
    GloVe, Word2vec (CBOW and Skip-gram model architectures) and other word embedding methods. Topic Mining and Analysis: Motivation and Task Definition. Latent Dirichlet Allocation (LDA).
  • Distributional semantics and topic modeling
    Flexible and interpretable NLP models (on the example of LDA2vec). Evaluation metrics in NLP.
  • Introduction to Deep Learning in Python
    Introduction to neural networks: Artificial neural networks (ANNs), model weights, loss function, activation functions, hyperparameters, forward and back propagation in neural networks, gradient boosting, stochastic gradient boosting, model performance (overfit and underfit problem, main performance metrics). Coding neural networks in Keras: stacking lyers, using momentum and Adam optimization, applying early stopping and learning rate scheduler.
  • Sequence modeling
    Recurrent neural networks. Gradient exploding/vanishing problem. Recurrent neural networks designed to mitigate this issue: LSTM.
  • Introduction to Transformers
    Attention model intuition. Transformer network architecture: self-attention, multi-head attention.
Assessment Elements

Assessment Elements

  • non-blocking Quizzes
  • non-blocking Homework 1
  • non-blocking Homework 2
  • non-blocking Homework 3
  • non-blocking Homework 4
  • non-blocking Homework 5
Interim Assessment

Interim Assessment

  • Interim assessment (2 module)
    0.15 * Homework 1 + 0.3 * Homework 2 + 0.15 * Homework 3 + 0.15 * Homework 4 + 0.15 * Homework 5 + 0.1 * Quizzes


Recommended Core Bibliography

  • Bengfort, B., Bilbro, R., & Ojeda, T. (2018). Applied Text Analysis with Python : Enabling Language-Aware Data Products with Machine Learning. Beijing: O’Reilly Media. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=nlebk&AN=1827695
  • Beysolow, T. (2018). Applied Natural Language Processing with Python : Implementing Machine Learning and Deep Learning Algorithms for Natural Language Processing. [Berkeley, CA]: Apress. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1892182
  • Hajba G.L. Website Scraping with Python: Using BeautifulSoup and Scrapy / G.L. Hajba, Berkeley, CA: Apress, 2018.
  • Jeremy Howard, & Sylvain Gugger. (2020). Deep Learning for Coders with Fastai and PyTorch. O’Reilly Media.
  • Siddhartha Bhattacharyya, Vaclav Snasel, Aboul Ella Hassanien, Satadal Saha, & B. K. Tripathy. (2020). Deep Learning : Research and Applications. De Gruyter.
  • Vanderplas, J. T. (2016). Python Data Science Handbook : Essential Tools for Working with Data (Vol. First edition). Sebastopol, CA: Reilly - O’Reilly Media. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=nlebk&AN=1425081

Recommended Additional Bibliography

  • Eric Matthes. (2019). Python Crash Course, 2nd Edition : A Hands-On, Project-Based Introduction to Programming: Vol. 2nd edition. No Starch Press.