# 'Machine Learning Algorithm Able to Find Data Patterns a Human Could Not'

In December 2016, five new international laboratories opened up at the Higher School of Economics, one of which was the International Laboratory of Deep Learning and Bayesian Methods. This lab focuses on combined neural Bayesian models that bring together two of the most successful paradigms in modern-day machine learning – the neural network paradigm and the Bayesian paradigm.

#### Dmitry Vetrov, Laboratory Head

### Nothing Is Impossible

I got involved with machine learning back in 2000 after I saw how a machine-learning algorithm was capable of finding data patterns that a human doesn’t see. I thought this was really cool. With an algorithm like this, it’s like gaining access to hidden knowledge that no one else has a key to. For the ambitious young man that I was, this was enough to get me to devote my life to this field. And over the course of the last 17 years, nothing has really changed (except that there are now more tasks, and they’re harder and more interesting). And machine-learning algorithms are still finding patters a person doesn’t see. (The algorithms, of course, also find patterns that a human does see, but that’s not as interesting.) Another important thing was, and still is, the realisation that what I do does not concern spherical horses in a vacuum, but technology that helps with practical and important tasks, such as machine translation, self-driving cars, weather forecasting, the fight against bank fraud, lowering costs at mining companies, energy consumption optimization, and more.

When I was a graduate student, I had a clear understanding of what kind of tasks machine learning could and could not be used to solve. I am very glad that in the years that have passed since, I’ve come to learn that nothing is impossible. There are two things left from the ‘list of impossible tasks for machine learning’ that I thought up 14 years ago: using human language to talk to a computer (this problem has almost been solved as of April 2017) and developing full-fledged artificial intelligence (scientists are of the opinion, and I agree, that this will happen within the next 10 years, and no later). It’s great to see when science surpasses even the boldest of expectations. Along with my own modest knowledge, this is exactly the feeling I try to pass on to my undergraduate and graduate students.

Our Bayesian methods research group was created nearly 10 years ago after I defended my PhD thesis, and they allowed me to take a team of students under my wing. The first group was successful, and one of the graduates, Anton Osokin, became a well-known researcher and is currently returning to Russia to work as an Assistant Professor in the HSE Faculty of Computer Science. The group developed gradually, more slowly at first but then rather quickly.

### Machine Learning’s Scientific Revolution

Over the course of 10 years, machine learning has undergone a scientific revolution. For many scientists, though not all, the long-known and ineffective neural network model became super efficient when humans turned to machine learning for large volumes of data. The results that neural networks have demonstrated when solving a number of cognitive tasks often surpass human capabilities. This technology has gotten a special name – deep learning. Currently, machine learning with small data is not developing at all practically. The main results of this have already become apparent, and effective methods are well known. But the results that they are yielding are very modest.

Ten years ago, it was clear that machine learning in Russia was very much behind global trends. But the ‘deep revolution’ gave a second chance to the countries that were lagging. In this sense, it can be compared to the dreadnought revolution of the early 20^{th} century, when the emergence of dreadnoughts in Great Britain reduced the important of the country’s own naval fleet, and other countries were able to catch up to the ‘Mistress of the Seas’ by starting construction on their own dreadnoughts. It can be said that, overall, Russia took advantage of this, albeit not right away, thanks to large IT companies, and not universities or scientific organisations.

It is precisely the deep learning successes that were achieved over the last two years that allow us to assume that artificial intelligence will come about soon.

### Bayesian Methods: Past and Future

If the first neural networks can be said to have emerged in the 1950s, then the first Bayesian methods date back to the 18^{th} century, when Reverend Thomas Bayes proved his famous theorem. Similar to how the formula E=mc^2 became a symbol of scientific laconicism and elegance, Bayes’ theorem has every chance of becoming the same symbol for the 21^{st} century.

This theorem, which every student who has taken statistics is familiar with, sets the rules for transforming information about an unknown size when observing certain characteristics of the information. In the 20^{th} century, statisticians viewed Bayes’ theorem as a funny trifle that was useful in everyday life, but could rarely be applied to statistical estimation tasks, where what dominated was the classical mathematical statistics that had formed fully by the 1930s. Interest in the construction of complex probability models in machine learning began growing in 1992, when David MacKay’s famous book Information Theory, Inference and Learning Algorithmscame out. In it, MacKay builds the foundation for the Bayesian approach to machine learning and related disciplines.

The results of Bayes’ theorem allow for probabilistic models to take on interesting characteristics when applied to machine learning. Firstly, we are able to take into account the details of a specific task being solved and adapt a basic machine-learning model to the task. Secondly, Bayesian models are modular, and a complex model can always be presented in the form of a combination of many simple probability models. Thirdly, in working with such models, we are able to extract as much information as possible from incomplete, noisy, contradictory data; that is, the models become ‘omnivorous.’ The ability to learn using incomplete data is also interesting because of the fact that during this type of learning, the model is able to learn about what was initially not included in the model. The flipside of the Bayesian methods includes a relatively complex mathematical apparatus and great computational complexity that made it impossible for Bayesian methods to be applied to the processing of big data.

### The 21^{st} Century: Neural Bayesian Models

A sort of compromise had come about by the early 2010s. Everyone uses neural networks to work with big data, while Bayesian models are used with small data that is of a poorer quality and/or when complex models have to be created to process the data (for example, random fields that use computer vision in tasks). To work with small data of a better quality, the entire remaining arsenal of machine-learning methods is used.

Since around 2012, a number of studies have come about suggesting a new mathematical apparatus that allows for Bayesian methods to be scaled for use on big data. An interesting idea lay at the foundation of this apparatus. First, the task of the Bayesian calculation (that is, the process of applying Bayes’ theorem to data) was formulated as an optimization task, and then modern techniques of stochastic optimization were applied, allowing for super-large optimization problems to be solved. This allowed Bayesian methods to enter the field of neural networks. The results of this became apparent rather quickly. Over the last five years, an entire class of neural Bayesian models have been developed that are able to solve an array of tasks – tasks that are broader than usual deep neural networks.

Such models include new methods for building so-called representations (in the forms of vectors) of complex data structures, as well as attention mechanisms, chatbot and machine translation models, certain deep-learning algorithms with reinforcement, new techniques to regularize base neural network models, and more.

It is important to note that the technique of neural Bayesian calculation is developing rapidly, and new mathematical tools are coming about that help make the calculation more accurate and applicable to more complex models. Our international laboratory will focus specifically on developing this type of mathematical apparatus, as well as developing new Bayesian neural models. One of our first events in this area will be a Summer School on Bayesian neural methods in August 2017. At the school, we will discuss some of the most recent achievements that have been made in this field. We will also share our own experience with developing Bayesian neural networks, as well as carry out a number of hands-on training sessions.

### Today’s PhD Students Will Announce the Discovery of AI

The Bayesian methods research group is currently comprised of more than 30 individuals, and the international laboratory is just a small part of it. Aside from the head and the academic supervisor, the laboratory’s staff also includes two researchers, two junior researchers, and a manager who also carries out active scientific research when she’s not occupied with administrative tasks. We recently took on several other people for non-budgeted positions, which was made possible thanks to a contract we signed with Samsung to conduct research in the field of Bayesian neural modelling. Our researchers, Mikhail Figurnov and Alexander Novikov, are already established young scholars whose names are well known among leading AI development centres around the world. I am proud to have participated in helping them become the scientists that they are, and it is an honour to work alongside the two.

I was fortunate enough to become the head of a research group in which nearly all of the graduate students are smarter than their academic supervisor. These young specialists in machine learning of the deep revolution era are people who use blogs, social networks, and Twitter to look for and exchange academic articles. They know about the latest research results earlier than the professors do, and every other researcher is subscribed to the e-prints database arxiv.org. They meet every Wednesday to discuss the article that was released on Monday (sometimes they even invite their academic supervisor). They aren’t that interested in listening to plenary presentations at top machine-learning conferences because they have usually already read the pre-prints several months in advance. It is the younger researchers in particular who keep the field developing at an incredible pace, and because of this more has been done in machine learning over the last 10 years than had been done in the preceding 50. And they are the ones who will be making the announcement about AI in 10 years.

#### Michael Figurnov, Senior Research Fellow

I am currently working on a project to speed up convolutional neural networks, which is one of the most successful mechanisms for deep learning and contemporary computer vision. They are used to identify people in photographs, turn images into text, automate self-driving cars, and hundreds of other things. Unfortunately, convolutional neural networks are very expensive computationally. To process one image in a low resolution requires tens of millions of operations, and high resolution – trillions of floating-point operations! This is too expensive even for powerful servers at datacenters, not to mention mobile devices where every milliwatt counts.

Convolutional neural networks apply the same computations to each portion of an image. It is clear that it’s excessive to spend the same amount of time processing the portion of an image with sky as the objects in the picture.

Around two years ago, Dmitry Vetrov, Pushmeet Kohli, and I were able to find a way to effectively compute the outputs of the convolutional layer (the basic ‘building block’ of convolutional neural networks) just for a portion of an image. These results were published at the 2016 NIPS conference (Perforated CNNs: Acceleration through Elimination of Redundant Convolutions). There was just one thing left to do – learn how to determine where these outputs should be calculated. This turned out to be a very difficult task mathematically. It is necessary to optimize a nonlinear function for hundreds of thousands of binary variables, which humans are not yet capable of doing.

*Left: an image with several specific objects; Right: a map of computations for different regions of the image*

Last summer, while I was interning at Google, I shared my ideas with my manager, Li Zhang. He sent me an article by Alex Graves from Google DeepMind, which focuses on adaptive computation time (ACT) for recurrent neural networks (another successful method of machine learning). I realized that this method could be applied to smaller changes in special kinds of convolutional neural networks called residual neural networks. We were all surprised when this modified approach worked even on very small networks.

The result was a convolutional neural network that automatically determines how many computations will be ‘spent’ on each portion of an image. In order to carry this out, a convolutional layer was needed that could only be computed for a portion of the image. I had thought of one a year and a half before that. The article on this work (‘Spatially Adaptive Computation Time for Residual Networks’) was accepted to the cop conference on computer vision, CVPR 2017.

#### Alexander Novikov, Senior Research Fellow

I came to Professor Vetrov’s laboratory five years ago when I was a sophomore. One of the projects I participated in involved tensor-train decomposition (when applied to Bayesian models), and we had to figure them out along the way.

The main project I’m currently working on tries to shift the search for adequate data transformations from the person to a computer. The thing is, in spite of the deep learning process, the main goal of which is to do away with having to manually work out a machine-learning algorithm’s details and instead learning these details from data, all successful models still use the same trick – the researcher looks at the data by hand, estimates which transformations would not ‘ruin’ them (such a small shifts or turns), and applies these transformations while artificially increasing the amount of available data. We are trying to build a probability model that will learn which transformations can be applied to an existing dataset automatically.

Another one of my current projects is the creation of an applied Bayesian methods course (with an emphasis on Neural Bayesian approaches) on Coursera. I hope that this will allow us to grow the circle of people who work in this field, all while speeding up progress.

#### Kirill Struminsky, Research Assistant

In life, people rely on their sense organs, while a computer’s perception is limited by sequences of zeros and ones. This results in a surprising disparity in the capabilities of humans and computers. For example, decades of work by some of the best engineering minds at Boston Dynamics were spent making signals from cameras, gyroscopes, and accelerometers allow robots to move around in nature with the dexterity of a six-year-old child – a child who, at birth, was blessed with sight and a vestibular apparatus, and who easily learned to walk. On the other hand, not a single researcher, whether they are a researcher who studies elementary particles or a molecular biologist, has been able to process the signals of experimental installations with the speed and accuracy of a computer.

At the lab, I research teacher-less learning models that will help solve the aforementioned problem of disparity. The model I study, the variational autoencoder, upon looking at just raw data learns something like two dictionaries – the first allows for data to be translated from an idea that is understandable to us into something concise and understandable for a computer, while the second does the reverse. Researchers now believe that simple and straightforward data concepts will allow for the efficiency of machine-learning algorithms to be improved.

#### Anton Rodomanov, Research Assistant

My academic interests centre on stochastic optimization. The majority of researchers in our lab are focused on thinking up mathematical models (and very complex ones at that), the main goal of which are to translate a specific practical task from human language into the language of mathematics. The resulting mathematical problem can then be solved with the help of purely mathematical methods. In most cases, the mathematical problem is a problem of optimisation, and this problem requires new and effective methods in order to be solved. The development of these new methods is exactly what I work on.

Actually, the development of mathematical models and the development of corresponding optimisation methods are closely related. On the one hand, without an effective optimisation method, the resulting mathematical model would be more or less useless. On the other hand, the new mathematical models that need to be optimised are in turn a unique ‘engine of progress’ in the optimisation. New models motivate optimisation researchers to look for new tasks and to think up new methods that can effectively solve these new problems (or, conversely, to prove that it is impossible to think up such methods).

#### Nadezhda Chirkova, Manager

Everything we do serves as building blocks in the larger task of creating AI technologies. Any significant progress, whether it’s high-quality algorithms for image recognition or a computer that can beat a human at a game of Go, stems from the achievements of researchers all over the world. Every article that is published relies on the stack of previous works. For example, I am currently working on a project to automatically select hyperparameters in machine-learning language models. The idea behind the project is to further decrease the amount of work a person has to do to build a high-quality model for his or her body of texts. You could say that I’m bridging together two different developments – the results our own research group has achieved and the approach that researchers use at Cambridge. Similar solutions, if they are successful when tested on various problems, are usually integrated into popular machine-learning libraries in order to make it easier for developers and researchers to use these tools.