• A
  • A
  • A
  • АБB
  • АБB
  • АБB
  • А
  • А
  • А
  • А
  • А
Обычная версия сайта
Магистратура 2020/2021

Инжиниринг данных и сервисов для автоматизации бизнес процессов

Статус: Курс по выбору (Науки о данных)
Направление: 01.04.02. Прикладная математика и информатика
Когда читается: 2-й курс, 1, 2 модуль
Формат изучения: без онлайн-курса
Прогр. обучения: Науки о данных
Язык: английский
Кредиты: 8
Контактные часы: 56

Course Syllabus

Abstract

Machine learning is changing the world rapidly and dramatically, every modern enterprise is now eyeing machine learning as one of the top instruments to improve business KPIs. Yet, behind any successful application of machine learning is a large chunk of work that is done by engineers, which includes Data Engineering functions such as data cleaning, wrangling, integration, etc. And the models must be deployed in production as reliable services. And finally, advanced analytics will need to take place in order to understand how the service is operating. In this course you will learn the basics of these engineering and analytic disciplines. We won’t focus on machine learning algorithms in this course, its a prerequisite.
Learning Objectives

Learning Objectives

  • To gain basic proficiency in data engineering, understand the key concepts, technologies and challenges of this subject area.
Expected Learning Outcomes

Expected Learning Outcomes

  • Understanding of course content.
  • Understanding of relational model, SQL, its power and its limitations.
  • Understanding of non-relational database, when they should be used, what are their strengths and weaknesses.
  • Understanding of different Enterprise Architectures for real-time online businesses, various trade-offs of using each type of architecture.
  • Understanding of basic reliability and durability mechanisms used in database and streaming systems.
  • Understanding of query processing and optimisation in relational systems, ability to reason about and optimise query plans.
  • Understanding of Big Data technologies, including Hadoop and Spark stack and massively parallel DBMSs.
  • Basic understanding of problems in data integration and data cleaning, familiarity of ETL processes and data warehouses.
  • Understanding of key aspect of reliability of ML services and key technologies to build a reliable machine learning service.
  • Understanding of advanced anomaly detection and collective learning techniques and their applications in building machine learning services.
Course Contents

Course Contents

  • Introduction
    Here we’ll learn why its hard to train a machine learning model and quickly put it into production and embark on another project. What are the extra problems that creep up during this process? What extra risks appear when the model is transferred to production mode? We’ll do an overview of general decision systems based on Data Science. We’ll also dive into a specific business scenario, that will be the guiding example in our course: online credit business. We will go over the business model, major KPIs, the constraints the business places on possible machine learning solution and some fundamental problems.
  • Relational Data Model and Databases
    Data in modern businesses comes in a variety of different types, from basic textual and numeric data, to geographical data, images, videos, timeseries, etc. We will go over basic data types and show how their are best used in machine learning tasks. Then we’ll dive into detail into relational data models.
  • Non-relational Databases
    We’ll dive into detail into non-relational data models.
  • Event-based data models. Kappa and Lambda architectures. Process mining.
    Typical business can be described as a set of business processes, and the event-based data model captures all important events, generated by these processes. Log of such events is at the core of modern real-time architectures such as Lambda and Kappa. We’ll study how to recover all the needed data from the event log, how to test hypothesis on top of such a log. We’ll create usable data marts on top of event logs for analytics. We’ll study advanced analytics techniques such as process mining and cohort analysis.
  • Durability and Reliability of Databases and Streaming Systems
    We'll learn the basic reliability and durability mechanisms used in database and streaming systems.
  • Query Processing in Relational Systems
    We'll learn query processing and optimisation in relational systems, ability to reason about and optimise query plans.
  • Big Data
    Big Data technologies, Hadoop and Spark stack and massively parallel DBMSs.
  • Data Integration and cleaning
    What are the typical problems with data quality? How can we increase data quality? Data integration problem: semantic data integration, virtual data integration.
  • Building a reliable ML service
    Key aspect of reliability of ML services and key technologies to build a reliable machine learning service.
  • Anomaly detection and collective learning
    Advanced anomaly detection and collective learning techniques and their applications in building machine learning services.
Assessment Elements

Assessment Elements

  • non-blocking Programming task 1
  • non-blocking Programming task 2
  • non-blocking Exam
    You can receive full credit for the final automatically, if you do well on all the assignments.
  • non-blocking Programming task 1
  • non-blocking Programming task 2
  • non-blocking Exam
    You can receive full credit for the final automatically, if you do well on all the assignments.
Interim Assessment

Interim Assessment

  • Interim assessment (2 module)
    0.2 * Exam + 0.4 * Programming task 1 + 0.4 * Programming task 2
Bibliography

Bibliography

Recommended Core Bibliography

  • Harrington, J. L. Relational database design and implementation. – Morgan Kaufmann, 2016. – 441 pp.

Recommended Additional Bibliography

  • Xu Z. et al. (ed.). Big Data: 6th CCF Conference, Big Data 2018, Xi'an, China, October 11-13, 2018, Proceedings. – Springer, 2018. – Vol. 945.