Машинное обучение на больших данных часть 1

Магистратура 2022/2023

Статус: Курс по выбору

Направление: 01.04.02. Прикладная математика и информатика

Кто читает: Департамент больших данных и информационного поиска

Где читается: Факультет компьютерных наук

Когда читается: 2-й курс, 2 модуль

Формат изучения: с онлайн-курсом

Онлайн-часы: 82

Охват аудитории: для своего кампуса

Преподаватели: Бардуков Анатолий Андреевич

Прогр. обучения: Магистр по наукам о данных

Язык: английский

Кредиты: 4

Контактные часы: 8

Full Syllabus

Abstract

Large-scale machine learning requires fundamental knowledge in the field of data storage and processing. You need to operate with data for which one machine with standard hardware characteristics is not enough. Examples of such data might be user logs for a particular service, a collection of media files, or Wikipedia articles. This 6-week course gives you knowledge about the main concepts and frameworks that are actively used in companies for which it is critical to analyze large amounts of data in the shortest possible time. These can be companies that own: - search engines (for example, Google, Yandex, Microsoft, Yahoo!, etc.), - social networks and blogs (Twitter, Linked In, etc.), - recommendation services (for example, Kinopoisk from Yandex). The average time to complete this course depends on your background, you might spend 10 to 20 hours per week. To complete the course, students are supposed to have skills in classical algorithms and data structures, main concepts of machine learning, and Python programming.

Learning Objectives

After taking this course, students should be able to: ● use distributed file system ● run tasks on a Hadoop cluster ● write code to run on a Hadoop cluster using Hadoop streaming tools ● use a high-level programming language to process large data on a computational cluster ● solve search, index and machine learning problems on a Hadoop cluster

Expected Learning Outcomes

Be able to use distributed file system
Be able to run tasks on a Hadoop cluster
Be able to write code to run on a Hadoop cluster using Hadoop streaming tools
Be able to use a high-level programming language to process large data on a computational cluster
Be able to solve search, index and machine learning problems on a Hadoop cluster

Course Contents

Big Data introduction
MapReduce paradigm and Hadoop framework
SQL over Big Data
Apache Spark
Apache Spark 2
Machine Learning on Spark

Assessment Elements

Quizzes
Staff Graded Assignment: User routes on the site
Programming Assignments

Interim Assessment

2022/2023 2nd module
0.15 * Staff Graded Assignment: User routes on the site + 0.7 * Programming Assignments + 0.15 * Quizzes

Bibliography

Recommended Core Bibliography

Kienzler, R. (2017). Mastering Apache Spark 2.x - Second Edition (Vol. 2nd ed). Birmingham: Packt Publishing. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1562681
White, T. (2015). Hadoop: The Definitive Guide : Storage and Analysis at Internet Scale: Vol. 4th edition. O’Reilly Media.

Recommended Additional Bibliography

Jules S. Damji, Brooke Wenig, Tathagata Das, & Denny Lee. (2020). Learning Spark. O’Reilly Media.

Course Syllabus