Statistical Learning Theory
- Understanding basic concepts from statistical learning theory.
- Being able to understand the connection between these models and many machine learning algorithms.
- Training of mathematical skills such as abstract thinking, formal thinking and problem solving; with emphasis on statistics.
- Knowledge of several paradigms in statistical learning theory to select models (Structural risk minimization, Maximal likelihood, Minimal Description Length, etc.)
- Calculate sizes of trainingsets for several machinelearning tasks in the context of PAC-learning (and hence calculate VC-dimensions)
- Understand the link between cryptography and computational limitations of statistical learning.
- Deeper understanding of boosting algorithms and support vector machines.
- Theoretical understanding of several online learning algorithms and learning with expert advice.
- Probably approximately correct learningAs an introduction we study binary classification of points in the plane using nearest neighbor and axis aligned rectangles. Then the general problem is formulated: for most of the course we study classification and assume that the training and data set is obtained by independent sampling from the same probability distribution. We show that in this setting, we need to make some assumptions on the set of classifiers that we are fitting. The corresponding theorems are known as «no-free-lunch theorems». We define sample complexity, PAC-learnability and some equivalent characterizations of a set of classifiers.
- VC-dimensionsThe VC-dimension of a set of classifiers is a measure for the richessness of the set. Roughly stated, the higher this dimension, the more regularities it can model. The fundamental theorem of statistical learning theory states that the VC-dimension of a set of classifiers is finite if and only if it is PAC learnable (i.e., we can learn a correct classifier with any precision using finitely many samples). A quantitative version allows us to interpret the VC-dimension as measure for the risk for overfitting. We study two versions (the realizable and the agnostic version).
- Structural risk minimization and variantsThe Bayes-variance decomposition concerns the difference between the error of a learned classifier and the best possible classifier (also known as «Bayes optimal classifier»). This error is the sum of a bias, due to the limited size of the set of classifiers, and the variance, due to the overfitting error, which typically increases as the size of classifiers becomes larger. For several learning algorithm, this trade-off allows us to assign values to hyperparameters that otherwise need to be assigned using cross-validation. For example, we can use this to determine a good size of decision trees and neural networks.
- The time complexity of learning and cryptographyA typical cryptographic encoding of a string is an example of a an object with a clear structure, but of which no learning algorithm can find the structure of the object in a reasonable time. In this part we discuss some learning tasks that are impossible for computational reasons. Under a plausible computational complexity assumption (which is required for secure RSA encryption) one can show that neural networks of small depth and regular languages can not be learned.
- BoostingA weak learning algorithm can generate from a train set a model that is slightly better than random guessing. We study AdaBoost and DeepBoost, two algorithms that are currently very popular in machine learning. Furthermore we prove performance guarantees and conclude that weak learning is equivalent to PAC-learnability. However, these performance bounds do not explain observed data well. This motivates the approach of the following topic.
- Rademacher complexitiesSimilar to the VC-dimension of a set of classifiers, the Rademacher complexity quantifies the richessness. However, unlike VC-dimensions, the quantity does not change if we add signs of convex combinations of classifiers. Using Rademacher complexity we develop a theory where risk bounds are more closely related to experminentally observed generalization errors. We use the theory to determine risk bounds for neural networks and argue that dropout regularization is very effective for deep neural networks.
- Support vector machines and margin theoryWe review the theory of support vector machines and Kernels. We argue that they optimize risk bounds under the large margin assumption. Then we use the theory to derive a polynomial time algorithm that learns improper representations of L1-regularized neural networks with bounded depth.
- Mutliclass classification and DeepBoostRisk bounds using Rademacher complexities are generalized for multiclass classfication and we study the algorithms from  in the references below.
- Online learningIn online learning there is initially no training set. After each prediction, the class of the label is declared and the learning algorithm can use this information to improve the prediction model. We study 1) prediction with expert advice, 2) linear classification algorithm such as the perceptron algorithm.
- Reinforcement learningIn the unlikely case there is time left, we study reinforcement learning. This part will not be a part of the exam materials (unless at least 1 student asks for a bonus question). We are interest to design an agent that succesfully picks up rewards while navigating in an environment. We model the environment as a Markov decision process and use Bellman equations to define optimal behaviour. Then we discuss several planning algorithms both for the cases where the agent knows and does not know the model of the environment.
- Mohri, M., Talwalkar, A., & Rostamizadeh, A. (2012). Foundations of Machine Learning. Cambridge, MA: The MIT Press. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=478737
- Vitaly Kuznetsov, Mehryar Mohri, & Umar Syed. (n.d.). Multi-Class Deep Boosting.