• A
  • A
  • A
  • ABC
  • ABC
  • ABC
  • А
  • А
  • А
  • А
  • А
Regular version of the site

Educational Programme
Final Grade
Year of Graduation
Evgenii Golikov
Why Does Pre-training Actually Work with Convolutional Neural Networks?
Data Science
(Master’s programme)
In transfer learning our goal is to transfer knowledge from a source dataset with large number of examples, to a target one which is typically small.

The subject of the present work is one of the classical techniques of transfer learning, namely supervised pre-training [Yosinski et al, 2014], which is typically used in image classification problems. Since these problems are usually solved with convolutional architectures, there can be two reasons why pre-training works well:

1 Pre-training on a source dataset provides us good kernels of conv. layers;

2 These kernels come in "right" order.

It is commonly assumed that both factors are sufficient. However, a recent work of [Atanov et al., 2019] indirectly suggests that the second factor could be insufficient. The goal of the present work is to check, how reordering pre-trained kernels affects performance on a target dataset. We compare two setups: the usual pre-training, and the same pre-training where we shuffle kernels of conv. layers before fine-tuning a network on a target dataset.

The main results are the following:

For shallow architectures:

1 Fine-tuning shuffled kernels gives at least as good results as the same setup without shuffling, when the source and the target datasets are different;

2 Fine-tuning shuffled kernels results in similar evolution of kernels in terms of cosine similarity, as the same setup without shuffling. Cosine similarity is measured between kernels after fine-tuning and kernels before fine-tuning.

For deep architectures:

1 Shuffling kernels of several top convolution layers before fine-tuning helps generalization; if more layers are shuffled, performance degrades.

Therefore, we make the following conclusions:

1 The order of kernels after pre-training on a source dataset is not necessarily optimal for fine-tuning on a target dataset;

2 Shuffling pre-trained kernels can help generalization, and hence can be seen as a method of regularization.

Student Theses at HSE must be completed in accordance with the University Rules and regulations specified by each educational programme.

Summaries of all theses must be published and made freely available on the HSE website.

The full text of a thesis can be published in open access on the HSE website only if the authoring student (copyright holder) agrees, or, if the thesis was written by a team of students, if all the co-authors (copyright holders) agree. After a thesis is published on the HSE website, it obtains the status of an online publication.

Student theses are objects of copyright and their use is subject to limitations in accordance with the Russian Federation’s law on intellectual property.

In the event that a thesis is quoted or otherwise used, reference to the author’s name and the source of quotation is required.

Search all student theses