Магистратура
2020/2021
Теоретические основы распределенной обработки информации в системах больших данных
Лучший по критерию «Полезность курса для расширения кругозора и разностороннего развития»
Лучший по критерию «Новизна полученных знаний»
Статус:
Курс по выбору
Направление:
38.04.05. Бизнес-информатика
Кто читает:
Департамент бизнес-информатики
Где читается:
Высшая школа бизнеса
Когда читается:
1-й курс, 2, 3 модуль
Формат изучения:
без онлайн-курса
Преподаватели:
Голубцов Петр Викторович
Прогр. обучения:
Системы больших данных
Язык:
английский
Кредиты:
5
Контактные часы:
40
Course Syllabus
Abstract
The main goal of this course is to provide students with an opportunity to acquire conceptual background and mathematical tools applicable to Big Data analytics and real time computation. The course will briefly review specific challenges of Big Data Analytics, such as problems of extracting, unifying, updating, and merging information and specific needs in processing data, which should be highly parallel and distributed. With these specific features in mind we will then study more closely a number of mathematical tools for Big Data analytics, such as regression analysis, linear estimation, calibration problems, real time processing of incoming (potentially infinite) data. We will see how these approaches can be transformed to conform to the Big Data demands.
Learning Objectives
- Formation of the theoretical knowledge and practical basic skills in the collection, storage, processing and analysis of large data.
- Develop theoretical and practical skills to analyze large data to tackle a wide range of applications.
- To understand main principles of approaching big data problems for large-scale distributed systems
- To design an efficient representation of intermediate information for various data processing problems
- Apply optimal linear estimation methods in big data context.
- Design and use calibration techniques in the cases where the measurement process is unknown.
- To be able to apply the essential tools and techniques of distributed data processing in practice.
Expected Learning Outcomes
- understands main principles of approaching big data problems for large-scale distributed systems
- develop efficient representation of intermediate information for various simple data processing problems
- designs and implements simple distributed algorithms in MapReduce style
- develop linear regression methods in the context of distributed and emerging data
- implement linear regression algorithm for distributed and emerging data
- master optimal linear estimation approaches
- develop optimal representation of intermediate information in simple linear estimation problems
- study general approaches of parallelizing linear estimation. determine best form of canonical information
- learns optimal linear estimation technique
- studies various forms of information representation and transformations between them
- Develops program for efficient accumulating information and constructing optimal estimation with or without prior information
- learns how to formulate a problem of data processing with the uncertainty in the model of data source
- study general approaches of parallelizing the corresponding algorithm. determine best form of canonical information
- learns how to formulate a problem of data processing when the model of data source is unknown
- studies general approach of determining data source model through calibration
- Develops program for efficient accumulating calibration information and constructing optimal estimation
Course Contents
- Introduction to information processing in distributed systems.Introduction to information processing in distributed systems. Simple examples and properties of various forms of information representation. MapReduce approach.
- Distributed information management for linear regression problemsRedesigning linear regression methods for efficient parallelization in big data problems
- Optimal linear estimation in Big Data contextLinear experiment, optimal estimation problem. Combining linear experiments in “raw” form. Canonical information for linear experiments.
- Optimal estimation with prior information.Optimal estimation with prior information. Updating prior information. Manipulating information in various forms.
- Imperfect information about experimentOptimal estimation with uncertainty in measurement transformation.
- Calibration problemUnknown measurement transformation. Calibration problem. Canonical calibration information. Various types of canonical information
Assessment Elements
- Home WorksHome Works are in the form of theoretical or practical mini-projects
- Seminar activity
- ExamExamination format: The exam is taken oral in MS Teams. During the exam you will have to present screen with your results.
Interim Assessment
- Interim assessment (3 module)0.4 * Exam + 0.48 * Home Works + 0.12 * Seminar activity
Bibliography
Recommended Core Bibliography
- Beck, V. L. (2017). Linear Regression : Models, Analysis, and Applications. Hauppauge, New York: Nova Science Publishers, Inc. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1562876
- Dean, J., & Ghemawat, S. (2008). Mapreduce: Simplified Data Processing on Large Clusters. Communications of the ACM, 51(1), 107–113. https://doi.org/10.1145/1327452.1327492
Recommended Additional Bibliography
- Barbier, J., Dia, M., Macris, N., & Krzakala, F. (2016). The Mutual Information in Random Linear Estimation. https://doi.org/10.1109/ALLERTON.2016.7852290
- Manoel, A., Krzakala, F., Mézard, M., & Zdeborová, L. (2017). Multi-Layer Generalized Linear Estimation. https://doi.org/10.1109/ISIT.2017.8006899
- Sani, A., & Vosoughi, A. (2017). On Distributed Linear Estimation With Observation Model Uncertainties. https://doi.org/10.1109/TSP.2018.2824279