Tensorizing Deep Neural Networks

The article ‘Tensorizing Neural Networks’, prepared by the Bayesian Methods Research Group under the supervision of Associate Professor of HSE’s Computer Science Faculty Dmitry Vetrov, has been accepted by the NIPS conference – the largest global forum on cognitive research, artificial intelligence, and machine learning, rated A* by the international CORE ranking. This year it is being held December 7-12 in Montreal. Here Dmitry Vetrov talks about the research he presented and about why delivering reports at conferences is better than getting published in the academic press. 


Dmitry Vetrov

The last few years are often referred to as the era of ‘big data’. And in part this is justified. For the first time in history the volume of data that can be analyzed started to exceed the growth in digital technologies that would be used to analyze it. Two interesting facts were identified in experiments. First, classical approaches to data processing (including machine learning) do not scale up to cope with this volume of data, and so their usefulness in dealing with all the data available on today’s computers takes months and years. Second, a sample of a trillion objects contains much more useful information than a sample of a million objects, this is significant and means that we cannot limit data processing to datasets that are relatively small compared with the overall volume of initial data without losing vital information. This was identified thanks to the development of deep neural networks and scalable models of analyzing big data in the late 2000s.

Scalable approaches to optimization (stochastic optimization) work wonders e.g. by optimizing a function that depends on millions of parameters takes less time than identifying the value of this function at one (sic!) point. Stochastic optimization is employed in the study of modern neural networks, and in many instances (speech recognition, image classification etc) exceed what the human mind can accomplish. Experiments have shown that the more layers there are in the neural net, and the wider they are, the higher the network’s accuracy, and that is why the latest generation of neural nets take up a computer’s entire operational memory. Does this mean that we have reached the boundary of extensive expansion of the breadth and depth of neural nets? And if so, what can we do?

Procedures developed by the Bayesian Methods Research Group have made it possible to reduce the volume of memory needed to store one layer of the neural net 700,000 times without any reduction of quality in the net’s own operation, which can be called a tensornet.

The latest approaches to tensorizing array from multilinear algebra are of help here. A tensor is a mathematical object used to define physical properties. Any vector or matrix can be turned into a tensor, via reshaping operations. Tensors can contain a vast array of elements, e.g. the 200 unit-tensor, which each measurement is 2 in length contains more elements than there are atoms in our universe. Tensor array technology makes it possible, under certain conditions, to re-develop a tensor into a more compact format, ridding it of any redundant data, just as files are compressed when they are archived.

This approach might require several times less memory than storing the tensor itself. It turns out that the series of algebraic operations need to be carried out on the neural net in studying it, can be carried out on compact forms of the tensor. That means that the neural net’s weight (dozens and hundreds of millions of parameters) can be presented as a tensor, which in turn, can be compacted and studied not via its weight but via its tensor form.

In order to achieve this, procedures were developed by the Bayesian Methods Research Group to reduce the amount of memory needed to store one layer of the neural net 700,000 times without losing any quality of the functioning of the net itself, which can be called a tensornet.

These results open up new opportunities: storing deep neural works on mobile devices, which makes it possible to process data (e.g. speech recognition) onsite, without sending a signal to the neural net server and back as is currently the case. And in addition this development makes it possible to develop new architecture for neural nets that are several times wider while taking up the same operational memory.

The results of our research were accepted at a leading conference on machine learning Neural Information Processing Systems (NIPS), rated А* by the international CORE ranking.