timeline
title Major Reinforcement Learning Milestones
section Early Foundations
1957 : Perceptron : First neural network model
1989 : Q-Learning : Watkins introduces Q-Learning
section Modern Era
2013 : DQN : DeepMind's Deep Q-Network for Atari
2015 : TRPO : Trust Region Policy Optimization
2016 : AlphaGo : Defeats Lee Sedol
2017 : PPO : Proximal Policy Optimization
2018 : AlphaZero : Masters Chess, Shogi, and Go
2019 : AlphaStar : Grandmaster level in StarCraft II
Lately, I have been going through a lot of OG Reinforcement Learning research papers which has shaped the RL research that is taking place now. With the popularity gained by deep neural networks in the early 2010s, RL got a breath of fresh air as the research progress in this field has become obsolete. The entire RL research was dominated by two particularly prominent methods:
I will be going through both these methods in great detail as these ideas have shaped the RL research we are experiencing today. The newfound popularity by neural networks brought obsolete and forgotten method such as Policy Gradients to the forefront. Some of the key issues with policy gradients were solved by neural networks which helped in making this particular method in RL as a hot new research area. The most important breakthrough was provided by Deepmind which not only brought RL to the forefront but also made DeepMind an AI supergiant that it is today which resulted in them getting bought by Google. Their seminal paper “Playing Atari with Deep Reinforcement Learning” presented the first deep learning model to successfully learn control policies directly from high dimensional sensory input using reinforcement learning. It utilized techniques like experience replay and target networks, enabling the agent to achieve human-level performance on various Atari games without any game specific tuning. Building on this foundation, DeepMind released AlphaGo and AlphaZero which further advanced this approach by learning entirely from self play without any human data, achieving superhuman performance in Go, Chess and Shogi.
While going through these papers, one such paper absolutely perplexed me. I do have a strong background in ML and mathematics, hence I was really surprised when I couldn’t fully grasp all the details in the paper “Trust Region Policy Optimization” by John Schulman and others. This paper describe an iterative procedure for optimizing policies, with guaranteed monotonic improvement. The reason I found this tough to grasp because it primarily builds on top of three already existing ideas:
In order to throughly understand the intricate details of this paper, I will have to understand each of these ideas carefully and in great detail. The blog post will explore the intricacies of 'Importance Sampling', a seemingly simple concept that can take time to fully understand.
Trust Region Policy Optimization
Importance sampling is a statistical technique used to estimate properties of a particular distribution, while only having samples generated from a different distribution than the one of interest. It's especially useful when dealing with rare events or when the probability distribution of interest is difficult to sample from directly. In the context of Monte Carlo methods, it allows us to compute expectations with respect to one distribution using samples from another, more tractable distribution. In reinforcement learning (RL), importance sampling is particularly useful for off-policy evaluation and learning, where we aim to estimate the performance of a new policy using data generated from a different policy.

Suppose we are interested in computing the expected value of a function $f(x)$ under
a probability distribution $P$:
µ $= E_P[f(x)] = \int f(x)P(x)dx$