• A
  • A
  • A
  • ABC
  • ABC
  • ABC
  • А
  • А
  • А
  • А
  • А
Regular version of the site
Master 2018/2019

Knowledge Discovery in Data at Scale Technologies

Type: Elective course (Big Data Systems)
Area of studies: Business Informatics
Delivered by: Department of Innovation and Business in Information Technologies
When: 1 year, 3, 4 module
Mode of studies: Full time
Master’s programme: Big Data Systems
Language: English
ECTS credits: 3

Course Syllabus

Abstract

The main goal of this course is to provide students with an opportunity to acquire conceptual background and mathematical tools applicable to Big Data analytics and real time computation. The course will briefly review specific challenges of Big Data Analytics, such as problems of extracting, unifying, updating, and merging information and specific needs in processing data, which should be highly parallel and distributed. With these specific features in mind we will then study more closely a number of mathematical tools for Big Data analytics, such as regression analysis, linear estimation, calibration problems, real time processing of incoming (potentially infinite) data. We will see how these approaches can be transformed to conform to the Big Data demands.
Learning Objectives

Learning Objectives

  • Formation of the theoretical knowledge and practical basic skills in the collection, storage, processing and analysis of large data.
  • Develop theoretical and practical skills to analyze large data to tackle a wide range of applications.
  • To understand main principles of approaching big data problems for large-scale distributed systems
  • To design an efficient representation of intermediate information for various data processing problems
  • Apply optimal linear estimation methods in big data context.
  • Design and use calibration techniques in the cases where the measurement process is unknown.
  • To be able to apply the essential tools and techniques of distributed data processing in practice.
Expected Learning Outcomes

Expected Learning Outcomes

  • understands main principles of approaching big data problems for large-scale distributed systems
  • develop efficient representation of intermediate information for various simple data processing problems
  • designs and implements simple distributed algorithms in MapReduce style
  • develop linear regression methods in the context of distributed and emerging data
  • implement linear regression algorithm for distributed and emerging data
  • master optimal linear estimation approaches
  • develop optimal representation of intermediate information in simple linear estimation problems
  • study general approaches of parallelizing linear estimation. determine best form of canonical information
  • learns optimal linear estimation technique
  • studies various forms of information representation and transformations between them
  • Develops program for efficient accumulating information and constructing optimal estimation with or without prior information
  • learns how to formulate a problem of data processing with the uncertainty in the model of data source
  • study general approaches of parallelizing the corresponding algorithm. determine best form of canonical information
  • learns how to formulate a problem of data processing when the model of data source is unknown
  • studies general approach of determining data source model through calibration
  • Develops program for efficient accumulating calibration information and constructing optimal estimation
Course Contents

Course Contents

  • Introduction to information processing in distributed systems.
    Introduction to information processing in distributed systems. Simple examples and properties of various forms of information representation. MapReduce approach.
  • Distributed information management for linear regression problems
    Redesigning linear regression methods for efficient parallelization in big data problems
  • Optimal linear estimation in Big Data context
    Linear experiment, optimal estimation problem. Combining linear experiments in “raw” form. Canonical information for linear experiments.
  • Optimal estimation with prior information.
    Optimal estimation with prior information. Updating prior information. Manipulating information in various forms.
  • Imperfect information about experiment
    Optimal estimation with uncertainty in measurement transformation.
  • Calibration problem
    Unknown measurement transformation. Calibration problem. Canonical calibration information. Various types of canonical information
Assessment Elements

Assessment Elements

  • non-blocking Home Works
    Home Works are in the form of theoretical or practical mini-projects
  • non-blocking seminar activity
  • non-blocking Oral exam
Interim Assessment

Interim Assessment

  • Interim assessment (4 module)
    0.12 * Home Works + 0.4 * Oral exam + 0.48 * seminar activity
Bibliography

Bibliography

Recommended Core Bibliography

  • Beck, V. L. (2017). Linear Regression : Models, Analysis, and Applications. Hauppauge, New York: Nova Science Publishers, Inc. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1562876
  • Dean, J., & Ghemawat, S. (2008). Mapreduce: Simplified Data Processing on Large Clusters. Communications of the ACM, 51(1), 107–113. https://doi.org/10.1145/1327452.1327492

Recommended Additional Bibliography

  • Barbier, J., Dia, M., Macris, N., & Krzakala, F. (2016). The Mutual Information in Random Linear Estimation. https://doi.org/10.1109/ALLERTON.2016.7852290
  • Manoel, A., Krzakala, F., Mézard, M., & Zdeborová, L. (2017). Multi-Layer Generalized Linear Estimation. https://doi.org/10.1109/ISIT.2017.8006899
  • Sani, A., & Vosoughi, A. (2017). On Distributed Linear Estimation With Observation Model Uncertainties. https://doi.org/10.1109/TSP.2018.2824279