• A
  • A
  • A
  • ABC
  • ABC
  • ABC
  • А
  • А
  • А
  • А
  • А
Regular version of the site

Educational Programme
Final Grade
Year of Graduation
Evgenii Nikishin
Stability Improvement and Knowledge Transfer in Deep Reinforcement Learning
Statistical Learning Theory
(Master’s programme)
Deep reinforcement learning (RL) methods demonstrated a lot of successes in a variety of applications during the last years. Nevertheless, deep RL methods still require solving a lot of engineering problems, lack robustness to hyperparameter selection and struggle to generalize between similar environments. In this thesis, we focus on two important issues of deep RL that impede it from wide practical usage in applications: training instability and poor transferability of learned policies.

We observe that the average cumulative rewards

are unstable throughout the learning process and do not increase monotonically given more training steps.

Furthermore, a highly rewarded policy, once learned, is often forgotten by an agent, leading to performance deterioration.

These problems are partly caused by the fundamental presence of noise in gradient estimators in RL.

In order to reduce the effect of noise on training, we propose to apply stochastic weight averaging

(SWA), a recent method that averages weights along the optimization trajectory.

We show that SWA stabilizes the model solutions, alleviates the problem of forgetting the highly rewarded policy during training,

and improves the average rewards on several Atari and MuJoCo environments.

We further note that the learned representations of observations are often overspecializing to a particular environment and become not useful in other environments even if the environments have the same underlying dynamics.

In order to leverage a trained policy network in another environment, we propose to train a modification of variational autoencoder (VAE) for both of the environments. Each VAE produces latent representations for an observation and the corresponding next observation via encoding the observation and modeling dynamics in a latent space. For environments that have the same underlying dynamics, it is possible to share weights of the dynamics which will enable to obtain the same latent spaces for the two environments. Given the shared latent space, we propose to imitate a trained in a source environment policy resulting in a policy that can be applied in both source and target environments.

In our preliminary experiments, we demonstrate that the proposed model is capable of imitating a trained policy as well as reconstructing observations and next observations for both source and target environments using the same latent dynamics network.

Student Theses at HSE must be completed in accordance with the University Rules and regulations specified by each educational programme.

Summaries of all theses must be published and made freely available on the HSE website.

The full text of a thesis can be published in open access on the HSE website only if the authoring student (copyright holder) agrees, or, if the thesis was written by a team of students, if all the co-authors (copyright holders) agree. After a thesis is published on the HSE website, it obtains the status of an online publication.

Student theses are objects of copyright and their use is subject to limitations in accordance with the Russian Federation’s law on intellectual property.

In the event that a thesis is quoted or otherwise used, reference to the author’s name and the source of quotation is required.

Search all student theses