• A
  • A
  • A
  • АБB
  • АБB
  • АБB
  • А
  • А
  • А
  • А
  • А
Обычная версия сайта
Магистратура 2019/2020

Технологии извлечения знаний из большого объема данных

Лучший по критерию «Новизна полученных знаний»
Статус: Курс по выбору (Системы больших данных)
Направление: 38.04.05. Бизнес-информатика
Кто читает: Кафедра инноваций и бизнеса в сфере информационных технологий
Когда читается: 1-й курс, 3, 4 модуль
Формат изучения: без онлайн-курса
Преподаватели: Авдеева Зинаида Константиновна, Голубцов Петр Викторович
Прогр. обучения: Системы больших данных
Язык: английский
Кредиты: 4
Контактные часы: 48

Course Syllabus

Abstract

The main goal of this course is to provide students with an opportunity to acquire conceptual background and mathematical tools applicable to Big Data analytics and real time computation. The course will briefly review specific challenges of Big Data Analytics, such as problems of extracting, unifying, updating, and merging information and specific needs in processing data, which should be highly parallel and distributed. With these specific features in mind we will then study more closely a number of mathematical tools for Big Data analytics, such as regression analysis, linear estimation, calibration problems, real time processing of incoming (potentially infinite) data. We will see how these approaches can be transformed to conform to the Big Data demands.
Learning Objectives

Learning Objectives

  • Formation of the theoretical knowledge and practical basic skills in the collection, storage, processing and analysis of large data.
  • Develop theoretical and practical skills to analyze large data to tackle a wide range of applications.
  • To understand main principles of approaching big data problems for large-scale distributed systems
  • To design an efficient representation of intermediate information for various data processing problems
  • Apply optimal linear estimation methods in big data context.
  • Design and use calibration techniques in the cases where the measurement process is unknown.
  • To be able to apply the essential tools and techniques of distributed data processing in practice.
Expected Learning Outcomes

Expected Learning Outcomes

  • understands main principles of approaching big data problems for large-scale distributed systems
  • develop efficient representation of intermediate information for various simple data processing problems
  • designs and implements simple distributed algorithms in MapReduce style
  • develop linear regression methods in the context of distributed and emerging data
  • implement linear regression algorithm for distributed and emerging data
  • master optimal linear estimation approaches
  • develop optimal representation of intermediate information in simple linear estimation problems
  • study general approaches of parallelizing linear estimation. determine best form of canonical information
  • learns optimal linear estimation technique
  • studies various forms of information representation and transformations between them
  • Develops program for efficient accumulating information and constructing optimal estimation with or without prior information
  • learns how to formulate a problem of data processing with the uncertainty in the model of data source
  • study general approaches of parallelizing the corresponding algorithm. determine best form of canonical information
  • learns how to formulate a problem of data processing when the model of data source is unknown
  • studies general approach of determining data source model through calibration
  • Develops program for efficient accumulating calibration information and constructing optimal estimation
Course Contents

Course Contents

  • Introduction to information processing in distributed systems.
    Introduction to information processing in distributed systems. Simple examples and properties of various forms of information representation. MapReduce approach.
  • Distributed information management for linear regression problems
    Redesigning linear regression methods for efficient parallelization in big data problems
  • Optimal linear estimation in Big Data context
    Linear experiment, optimal estimation problem. Combining linear experiments in “raw” form. Canonical information for linear experiments.
  • Optimal estimation with prior information.
    Optimal estimation with prior information. Updating prior information. Manipulating information in various forms.
  • Imperfect information about experiment
    Optimal estimation with uncertainty in measurement transformation.
  • Calibration problem
    Unknown measurement transformation. Calibration problem. Canonical calibration information. Various types of canonical information
Assessment Elements

Assessment Elements

  • non-blocking Home Works
    Home Works are in the form of theoretical or practical mini-projects
  • non-blocking seminar activity
  • non-blocking Oral exam
    Examination format: The exam is taken oral. The platform: The exam is taken on Zoom platform. Students are required to join a session 15 minutes before the beginning. A student is supposed to follow the requirements below: Check your computer for compliance with technical requirements no later than 7 days before the exam; Sign in with your corporate account (@edu.hse.ru); Check your microphone, speakers or headphones, webcam, Internet connection (we recommend connecting your computer to the network with a cable, if possible); Prepare the necessary writing equipment, such as pens, pencils, pieces of paper, and others. Disable applications on the computer's task other than the Zoom application or the browser that will be used to log in to the Zoom platform. If one of the necessary requirements for participation in the exam cannot be met, a student is obliged to inform a professor and a manager of a program 2 weeks before the exam date to decide on the student's participation in the exams. Students are not allowed to: Turn off the video camera; Use smart gadgets (smartphone, tablet, etc.) Involve outsiders for help during the exam, talk to outsiders during the examination tasks; Read tasks out loud. Students are allowed to: Write on a piece of paper, use a pen for making notes and calculations; Turn on the microphone to answer the teacher’s questions; Ask a teacher for additional information related to understanding the exam task; Connection failures: A short-term communication failure during the exam is considered to be the loss of a student's network connection with the Zoom platform for no longer than 1 minute. A long-term communication failure during the exam is considered to be the loss of a student's network connection with the Zoom platform for longer than 1 minute. A student cannot continue to participate in the exam, if there is a long-term communication failure appeared. The retake procedure is similar to the exam procedure. In case of long-term communication failure in the Zoom platform during the examination task, the student must notify the teacher, record the fact of loss of connection with the platform (screenshot, a response from the Internet provider). Then contact the manager of a program with an explanatory note about the incident to decide on retaking the exam.
Interim Assessment

Interim Assessment

  • Interim assessment (4 module)
    0.12 * Home Works + 0.4 * Oral exam + 0.48 * seminar activity
Bibliography

Bibliography

Recommended Core Bibliography

  • Beck, V. L. (2017). Linear Regression : Models, Analysis, and Applications. Hauppauge, New York: Nova Science Publishers, Inc. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1562876
  • Dean, J., & Ghemawat, S. (2008). Mapreduce: Simplified Data Processing on Large Clusters. Communications of the ACM, 51(1), 107–113. https://doi.org/10.1145/1327452.1327492

Recommended Additional Bibliography

  • Barbier, J., Dia, M., Macris, N., & Krzakala, F. (2016). The Mutual Information in Random Linear Estimation. https://doi.org/10.1109/ALLERTON.2016.7852290
  • Manoel, A., Krzakala, F., Mézard, M., & Zdeborová, L. (2017). Multi-Layer Generalized Linear Estimation. https://doi.org/10.1109/ISIT.2017.8006899
  • Sani, A., & Vosoughi, A. (2017). On Distributed Linear Estimation With Observation Model Uncertainties. https://doi.org/10.1109/TSP.2018.2824279