• A
  • A
  • A
  • АБB
  • АБB
  • АБB
  • А
  • А
  • А
  • А
  • А
Обычная версия сайта
Магистратура 2020/2021

Введение в методы сбора и анализа больших данных

Статус: Курс обязательный (Комплексный социальный анализ)
Направление: 39.04.01. Социология
Когда читается: 1-й курс, 1, 2 модуль
Формат изучения: с онлайн-курсом
Охват аудитории: для своего кампуса
Прогр. обучения: Комплексный социальный анализ
Язык: английский
Кредиты: 4

Course Syllabus

Abstract

This is an introductory course on gathering and analysis of Internet data. This course is oriented on two broad topics: data scraping and analysis of textual data. The course is taught in the form of trainings and practical work. All teaching is conducted in English. Within the course some R packages will be used for data analysis (it is freely available at https://www.r-project.org) This discipline is based on the following subjects: - Probability theory and Mathematical Statistics; - Methodology and Methods for Sociological Research. This discipline requires following knowledge and skills: - to know basic components of the sociological research; - to know various sampling techniques, their opportunities and limitations. Main ideas of the discipline might be applicable in following course: - Theory and Practice of Online Research. These online courses might be helpful in learning of the discipline: Shah C. Social Media Data Analytics. URL: https://www.coursera.org/learn/social-media-data-analytics (retrieved: 20.06.2018) Leek J., Peng R. D., Caffo B. Getting and Cleaning Data. URL: https://www.coursera.org/learn/data-cleaning (retrieved: 20.06.2018) Potapenko A., Zobnin A., Kozlova A., Yudin S., Zimovnov A. Natural Language Processing. URL: https://www.coursera.org/learn/language-processing (retrieved: 20.06.2018)
Learning Objectives

Learning Objectives

  • Study of basic notions of Big data research
  • Use of basic techniques to gather Big data and analyze it
Expected Learning Outcomes

Expected Learning Outcomes

  • Know basic concepts of Big data, its opportunities, limitations, and relevance to social sciences
  • Know basic concepts of R programming language
  • Have skills to write R code for basic data analysis tasks
  • Have skills to scrap online data through various API, automatization of actions in browser etc
  • Have skills to analyze textual data
Course Contents

Course Contents

  • Analysis of textual data in R
    Basic concepts of Text mining. Types of Text mining. Packages (qdap, stringi, stringr, tm, quanteda, NLP etc.). Text preprocessing. Term frequency analysis. Keywords analysis. Sentiment analysis. Topic analysis. Document clustering and classification. Introduction to advanced models (text2vec etc.). Visualization.
  • Introduction to R
    What is R. Comparisons between R and SPSS, R and Stata, R and Python. Packages. Files. Variables. Data storage in R (vectors, lists, data frames etc.). Regular expressions. Conditions. Loops. Functions. Tidyverse in R. Limitations of R.
  • Introduction to Big data
    What is Big data. Different understandings of the notion, its opportunities and limitations. Big data applications in various types of social studies. Cases. Biases. Ethical concerns.
  • Data scraping in R
    Basic information on web data (HTML, XML, HTTP, AJAX etc.). Data retrieval via APIs. Packages in R for social media's APIs (Twitter, Facebook, Vkontakte etc.). Limitations of APIs. Various scenarios for data retrieval without APIs. Packages in R for data retrieval without APIs (rvest, httr etc.). Automatization of actions in browser for scraping dynamic pages (with RSelenium package). Cleaning data.
Assessment Elements

Assessment Elements

  • non-blocking Class Attendance
  • non-blocking Class Participation
  • non-blocking Home assignment 1
    Each student must complete this home assignments individually. Students must hand over pdf file with answers and R script. Assessment is graded from 1 (fail) to 10 (excellent).
  • non-blocking Home assignment 2
    Each student must complete this home assignments individually. Students must hand over pdf file with answers and R script. Assessment is graded from 1 (fail) to 10 (excellent).
  • non-blocking Essay
    In the essay a group of students (up to 4) should scrap and analyze online data from various sources on a chosen topic (for instance, news coverage of an event), and report it in a coherent text with introduction (research question, short literature review, and main hypotheses), main body (analysis), conclusion, list of references, and R script in appendix. The length of an essay should be at least 8000 characters without appendix
  • non-blocking Class Attendance
  • non-blocking Class Participation
  • non-blocking Home assignment 1
    Each student must complete this home assignments individually. Students must hand over pdf file with answers and R script. Assessment is graded from 1 (fail) to 10 (excellent).
  • non-blocking Home assignment 2
    Each student must complete this home assignments individually. Students must hand over pdf file with answers and R script. Assessment is graded from 1 (fail) to 10 (excellent).
  • non-blocking Essay
    In the essay a group of students (up to 4) should scrap and analyze online data from various sources on a chosen topic (for instance, news coverage of an event), and report it in a coherent text with introduction (research question, short literature review, and main hypotheses), main body (analysis), conclusion, list of references, and R script in appendix. The length of an essay should be at least 8000 characters without appendix
Interim Assessment

Interim Assessment

  • Interim assessment (2 module)
    0.12 * Class Attendance + 0.13 * Class Participation + 0.45 * Essay + 0.15 * Home assignment 1 + 0.15 * Home assignment 2
Bibliography

Bibliography

Recommended Core Bibliography

  • Mayer-Schönberger, V., & Cukier, K. (2013). Big Data : A Revolution That Will Transform How We Live, Work, and Think. Boston: Eamon Dolan/Houghton Mifflin Harcourt. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1872664
  • Роберт И., Кабаков - R в действии. Анализ и визуализация данных в программе R - Издательство "ДМК Пресс" - 2014 - 588с. - ISBN: 978-5-97060-077-1 - Текст электронный // ЭБС ЛАНЬ - URL: https://e.lanbook.com/book/58703

Recommended Additional Bibliography

  • Hadley, W. (2016). Ggplot2 : Elegant Graphics for Data Analysis. New York, NY: Springer. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1175341