• A
  • A
  • A
  • ABC
  • ABC
  • ABC
  • А
  • А
  • А
  • А
  • А
Regular version of the site

How to Win a Data Science Competition: Learn from Top Kagglers

2019/2020
Academic Year
RUS
Instruction in Russian
2
ECTS credits
Course type:
Elective course
When:
3 year, 3 module

Программа дисциплины

Аннотация

Participating in predictive modelling competitions can help you gain practical experience, improve and harness your data modelling skills in various domains such as credit, insurance, marketing, natural language processing, sales’ forecasting and computer vision to name a few. At the same time you get to do it in a competitive context against thousands of participants where each one tries to build the most predictive algorithm. Pushing each other to the limit can result in better performance and smaller prediction errors. Being able to achieve high ranks consistently can help you accelerate your career in data science. In this course, you will learn to analyse and solve competitively such predictive modelling tasks. This is not a machine learning course in the general sense. This course will teach you how to get high-rank solutions against thousands of competitors with focus on practical usage of machine learning methods rather than the theoretical underpinnings behind them.
Цель освоения дисциплины

Цель освоения дисциплины

  • - Understand how to solve predictive modelling competitions efficiently and learn which of the skills obtained can be applicable to real-world tasks. - Learn how to preprocess the data and generate new features from various sources such as text and images. - Be taught advanced feature engineering techniques like generating mean-encodings, using aggregated statistical measures or finding nearest neighbors as a means to improve your predictions. - Be able to form reliable cross validation methodologies that help you benchmark your solutions and avoid overfitting or underfitting when tested with unobserved (test) data. - Gain experience of analysing and interpreting the data. You will become aware of inconsistencies, high noise levels, errors and other data-related issues such as leakages and you will learn how to overcome them. - Acquire knowledge of different algorithms and learn how to efficiently tune their hyperparameters and achieve top performance. - Master the art of combining different machine learning models and learn how to ensemble. - Get exposed to past (winning) solutions and codes and learn how to read them.
Планируемые результаты обучения

Планируемые результаты обучения

  • Student knows about competitions' mechanics, the difference between competitions and a real life data science, hardware and software that people usually use in competitions. We will also briefly recap major ML models frequently used in competitions.
  • Student knows how to extract features from text with Bag Of Words and Word2vec, and from images with Convolution Neural Networks.
  • Find a walk through EDA process for Springleaf competition data and an example of prolific EDA for NumerAI competition.
  • Student is convinced that the strategy we choose depends on the competition setup and that correct validation scheme is one of the bricks for any winning solution.
  • Student knows how we can efficiently optimize a metric given in a competition.
  • Student constructs, regularizes and extends mean encodings
  • Student knows about hyperparameter optimization process, about a few more advanced feature engineering techniques, and how it is better to ensemble the models in practice.
Содержание учебной дисциплины

Содержание учебной дисциплины

  • Introduction & Recap. Feature Preprocessing and Generation with Respect to Models. Final Project Description.
    Introduction. Meet your lecturers. Course overview. Competition Mechanics. Kaggle Overview [screencast]. Real World Application vs Competitions. Recap of main ML algorithms. Software/Hardware Requirements. Overview. Numeric features. Categorical and ordinal features. Datetime and coordinates. Handling missing values. Bag of words. Word2vec, CNN. Final project overview.
  • Exploratory Data Analysis. Validation. Data Leakages.
    Exploratory data analysis. Building intuition about the data. Exploring anonymized data. Visualizations. Dataset cleaning and other things to check. Springleaf competition EDA. Numerai competition EDA. Validation and overfitting. Validation strategies. Data splitting strategies. Problems occurring during validation. Basic data leaks. Leaderboard probing and examples of rare data leaks. Expedia challenge.
  • Metrics Optimization. Advanced Feature Engineering I.
    Regression metrics review. Classification metrics review. General approaches for metrics optimization. Regression metrics optimization. Classification metrics optimization. Concept of mean encoding. Regularization. Extensions and generalizations.
  • Hyperparameter Optimization. Advanced feature engineering II. Ensembling.
    Hyperparameter tuning. Practical guide. KazAnova's competition pipeline. Statistics and distance based features. Matrix factorizations. Feature Interactions. Bagging. Boosting. Stacking. StackNet. Ensembling Tips and Tricks. CatBoost.
Элементы контроля

Элементы контроля

  • неблокирующий Выполнение заданий по теме 1
  • неблокирующий Выполнение заданий по теме 2
  • неблокирующий Выполнение заданий по теме 3
  • неблокирующий Выполнение заданий по теме 4
  • неблокирующий Контрольная работа
Промежуточная аттестация

Промежуточная аттестация

  • Промежуточная аттестация (3 модуль)
    0.2 * Выполнение заданий по теме 1 + 0.2 * Выполнение заданий по теме 2 + 0.2 * Выполнение заданий по теме 3 + 0.2 * Выполнение заданий по теме 4 + 0.2 * Контрольная работа
Список литературы

Список литературы

Рекомендуемая основная литература

  • Witten, I. H. et al. Data Mining: Practical machine learning tools and techniques. – Morgan Kaufmann, 2017. – 654 pp.

Рекомендуемая дополнительная литература

  • Lantz, B. (2013). Machine Learning with R : Learn How to Use R to Apply Powerful Machine Learning Methods and Gain an Insight Into Real-world Applications. Birmingham, UK: Packt Publishing. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=656222
  • Mathur, P. (2019). Machine Learning Applications Using Python : Cases Studies From Healthcare, Retail, and Finance. [Berkeley, California]: Apress. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1982259