• A
  • A
  • A
  • АБB
  • АБB
  • АБB
  • А
  • А
  • А
  • А
  • А
Обычная версия сайта
Магистратура 2022/2023

Машинное обучение на больших данных часть 1

Статус: Курс по выбору
Направление: 01.04.02. Прикладная математика и информатика
Когда читается: 2-й курс, 2 модуль
Формат изучения: с онлайн-курсом
Онлайн-часы: 82
Охват аудитории: для своего кампуса
Прогр. обучения: Магистр по наукам о данных
Язык: английский
Кредиты: 4
Контактные часы: 8

Course Syllabus

Abstract

Large-scale machine learning requires fundamental knowledge in the field of data storage and processing. You need to operate with data for which one machine with standard hardware characteristics is not enough. Examples of such data might be user logs for a particular service, a collection of media files, or Wikipedia articles. This 6-week course gives you knowledge about the main concepts and frameworks that are actively used in companies for which it is critical to analyze large amounts of data in the shortest possible time. These can be companies that own: - search engines (for example, Google, Yandex, Microsoft, Yahoo!, etc.), - social networks and blogs (Twitter, Linked In, etc.), - recommendation services (for example, Kinopoisk from Yandex). The average time to complete this course depends on your background, you might spend 10 to 20 hours per week. To complete the course, students are supposed to have skills in classical algorithms and data structures, main concepts of machine learning, and Python programming.
Learning Objectives

Learning Objectives

  • After taking this course, students should be able to: ● use distributed file system ● run tasks on a Hadoop cluster ● write code to run on a Hadoop cluster using Hadoop streaming tools ● use a high-level programming language to process large data on a computational cluster ● solve search, index and machine learning problems on a Hadoop cluster
Expected Learning Outcomes

Expected Learning Outcomes

  • Be able to use distributed file system
  • Be able to run tasks on a Hadoop cluster
  • Be able to write code to run on a Hadoop cluster using Hadoop streaming tools
  • Be able to use a high-level programming language to process large data on a computational cluster
  • Be able to solve search, index and machine learning problems on a Hadoop cluster
Course Contents

Course Contents

  • Big Data introduction
  • MapReduce paradigm and Hadoop framework
  • SQL over Big Data
  • Apache Spark
  • Apache Spark 2
  • Machine Learning on Spark
Assessment Elements

Assessment Elements

  • non-blocking Quizzes
  • non-blocking Staff Graded Assignment: User routes on the site
  • non-blocking Programming Assignments
Interim Assessment

Interim Assessment

  • 2022/2023 2nd module
    0.15 * Staff Graded Assignment: User routes on the site + 0.7 * Programming Assignments + 0.15 * Quizzes
Bibliography

Bibliography

Recommended Core Bibliography

  • Kienzler, R. (2017). Mastering Apache Spark 2.x - Second Edition (Vol. 2nd ed). Birmingham: Packt Publishing. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1562681
  • White, T. (2015). Hadoop: The Definitive Guide : Storage and Analysis at Internet Scale: Vol. 4th edition. O’Reilly Media.

Recommended Additional Bibliography

  • Jules S. Damji, Brooke Wenig, Tathagata Das, & Denny Lee. (2020). Learning Spark. O’Reilly Media.