• A
  • A
  • A
  • АБB
  • АБB
  • АБB
  • А
  • А
  • А
  • А
  • А
Обычная версия сайта
Магистратура 2022/2023

Распределенная обработка данных

Направление: 38.04.05. Бизнес-информатика
Когда читается: 1-й курс, 3, 4 модуль
Формат изучения: с онлайн-курсом
Онлайн-часы: 16
Охват аудитории: для всех кампусов НИУ ВШЭ
Прогр. обучения: Бизнес-аналитика и системы больших данных
Язык: английский
Кредиты: 6
Контактные часы: 32

Course Syllabus

Abstract

The main goal of this course is to provide students with an opportunity to acquire conceptual background and practical tools applicable to Big Data analytics and real time computation. The course will briefly review specific challenges of Big Data Analytics, such as problems of extracting, unifying, updating, and merging information and specific needs in processing data, which should be highly parallel and distributed. We will see how different mathematical approaches can be transformed to conform to the Big Data demands.
Learning Objectives

Learning Objectives

  • Learn fundamentals of data analysis using distributed data processing frameworks, setting the foundation for how to combine data with advanced analytics at scale and in production environments
Expected Learning Outcomes

Expected Learning Outcomes

  • Identify when a big data problem needs data integration
  • Describe the connections between data management operations and the big data processing patterns needed to utilize them in large-scale analytical applications
  • Retrieve data from example database and big data management systems
  • Execute simple big data integration and processing on Hadoop and Spark platforms
  • Write scalable Spark SQL code that executes against a cluster of machines
  • Inspect the Spark UI to analyze query performance and identify bottlenecks
Course Contents

Course Contents

  • Introduction to Distributed Data Processing
  • Retrieving Big Data
  • Big Data Integration
  • Processing Big Data
  • Big Data Analytics using Spark
Assessment Elements

Assessment Elements

  • non-blocking Homework
    Each seminar, students will receive homework on the materials they have passed
  • non-blocking Project
    The final project in the course, demonstrating all the mastered skills
  • non-blocking Exam
    Final test
Interim Assessment

Interim Assessment

  • 2022/2023 4th module
    0.3 * Exam + 0.25 * Homework + 0.45 * Project
Bibliography

Bibliography

Recommended Core Bibliography

  • Hoger Khayrolla Omar, & Alaa Khalil Jumaa. (2019). Big Data Analysis Using Apache Spark MLlib and Hadoop HDFS with Scala and Java. https://doi.org/10.24017/science.2019.1.2
  • Jules S. Damji, Brooke Wenig, Tathagata Das, & Denny Lee. (2020). Learning Spark. O’Reilly Media.
  • Luu H. Beginning Apache Spark 2: With Resilient Distributed Datasets, Spark SQL, Structured Streaming and Spark Machine Learning Library. – Berkeley: Apress, 2018.
  • Romeo Kienzler, Md. Rezaul Karim, Sridhar Alla, Siamak Amirghodsi, Meenakshi Rajendran, Broderick Hall, & Shuen Mei. (2018). Apache Spark 2: Data Processing and Real-Time Analytics : Master Complex Big Data Processing, Stream Analytics, and Machine Learning with Apache Spark. Birmingham: Packt Publishing. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1991793

Recommended Additional Bibliography

  • Brajesh Mishra. (2020). Big Data Analysis Using Hadoop Map Reduce. https://doi.org/10.26562/irjcs.2020.v0705.005
  • Field, L., & Newcomb, O. (2012). Distributed Computing : Concepts, Architecture and Applications. Delhi: Academic Studio. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=446466
  • Kienzler, R. (2017). Mastering Apache Spark 2.x - Second Edition (Vol. 2nd ed). Birmingham: Packt Publishing. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1562681
  • Langewisch, R. P. (2016). Performance study of an implementation of the push-relabel maximum flow algorithm in Apache Spark’s GraphX, A.
  • Ryza, S., Laserson, U., Owen, S., & Wills, J. (2017). Advanced Analytics with Spark : Patterns for Learning From Data at Scale (Vol. Second edition). Sebastopol, CA: Reilly - O’Reilly Media. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1533378