Распределенная обработка данных

Магистратура 2022/2023

Статус: Курс обязательный (Бизнес-аналитика и системы больших данных)

Направление: 38.04.05. Бизнес-информатика

Кто читает: Департамент бизнес-информатики

Где читается: Высшая школа бизнеса

Когда читается: 1-й курс, 3, 4 модуль

Формат изучения: с онлайн-курсом

Онлайн-часы: 16

Охват аудитории: для всех кампусов НИУ ВШЭ

Преподаватели: Гоменюк Кирилл Сергеевич

Прогр. обучения: Бизнес-аналитика и системы больших данных

Язык: английский

Кредиты: 6

Контактные часы: 32

Full Syllabus Ask Question

Abstract

The creation and implementation of trading strategies on the stock market using the Financial Exchange API should cover various aspects of programming, financial analysis, the theory of trading strategies and their practical implementation.

Learning Objectives

Learn fundamentals of data analysis using distributed data processing frameworks, setting the foundation for how to combine data with advanced analytics at scale and in production environments

Expected Learning Outcomes

Identify when a big data problem needs data integration
Describe the connections between data management operations and the big data processing patterns needed to utilize them in large-scale analytical applications
Retrieve data from example database and big data management systems
Execute simple big data integration and processing on Hadoop and Spark platforms
Write scalable Spark SQL code that executes against a cluster of machines
Inspect the Spark UI to analyze query performance and identify bottlenecks

Course Contents

Introduction to Distributed Data Processing
Retrieving Big Data
Big Data Integration
Processing Big Data
Big Data Analytics using Spark

Assessment Elements

Homework
Each seminar, students will receive homework on the materials they have passed
Project
The final project in the course, demonstrating all the mastered skills
Exam
Final test

Interim Assessment

2022/2023 4th module
0.25 * Homework + 0.45 * Project + 0.3 * Exam

Bibliography

Recommended Core Bibliography

Hoger Khayrolla Omar, & Alaa Khalil Jumaa. (2019). Big Data Analysis Using Apache Spark MLlib and Hadoop HDFS with Scala and Java. https://doi.org/10.24017/science.2019.1.2
Jules S. Damji, Brooke Wenig, Tathagata Das, & Denny Lee. (2020). Learning Spark. O’Reilly Media.
Luu H. Beginning Apache Spark 2: With Resilient Distributed Datasets, Spark SQL, Structured Streaming and Spark Machine Learning Library. – Berkeley: Apress, 2018.
Romeo Kienzler, Md. Rezaul Karim, Sridhar Alla, Siamak Amirghodsi, Meenakshi Rajendran, Broderick Hall, & Shuen Mei. (2018). Apache Spark 2: Data Processing and Real-Time Analytics : Master Complex Big Data Processing, Stream Analytics, and Machine Learning with Apache Spark. Birmingham: Packt Publishing. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1991793

Recommended Additional Bibliography

Brajesh Mishra. (2020). Big Data Analysis Using Hadoop Map Reduce. https://doi.org/10.26562/irjcs.2020.v0705.005
Field, L., & Newcomb, O. (2012). Distributed Computing : Concepts, Architecture and Applications. Delhi: Academic Studio. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=446466
Kienzler, R. (2017). Mastering Apache Spark 2.x - Second Edition (Vol. 2nd ed). Birmingham: Packt Publishing. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1562681
Langewisch, R. P. (2016). Performance study of an implementation of the push-relabel maximum flow algorithm in Apache Spark’s GraphX, A.
Ryza, S., Laserson, U., Owen, S., & Wills, J. (2017). Advanced Analytics with Spark : Patterns for Learning From Data at Scale (Vol. Second edition). Sebastopol, CA: Reilly - O’Reilly Media. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1533378

Authors

Beklarian Armen Levonovich

Course Syllabus