How did we get here?

timeline
    title Major Reinforcement Learning Milestones
    section Early Foundations
        1957 : Perceptron : First neural network model
        1989 : Q-Learning : Watkins introduces Q-Learning
    section Modern Era
        2013 : DQN : DeepMind's Deep Q-Network for Atari
        2015 : TRPO : Trust Region Policy Optimization
        2016 : AlphaGo : Defeats Lee Sedol
        2017 : PPO : Proximal Policy Optimization
        2018 : AlphaZero : Masters Chess, Shogi, and Go
        2019 : AlphaStar : Grandmaster level in StarCraft II

Lately, I have been going through a lot of OG Reinforcement Learning research papers which has shaped the RL research that is taking place now. With the popularity gained by deep neural networks in the early 2010s, RL got a breath of fresh air as the research progress in this field has become obsolete. The entire RL research was dominated by two particularly prominent methods:

Approximate Value Function Methods
Policy Gradients

I will be going through both these methods in great detail as these ideas have shaped the RL research we are experiencing today. The newfound popularity by neural networks brought obsolete and forgotten method such as Policy Gradients to the forefront. Some of the key issues with policy gradients were solved by neural networks which helped in making this particular method in RL as a hot new research area. The most important breakthrough was provided by Deepmind which not only brought RL to the forefront but also made DeepMind an AI supergiant that it is today which resulted in them getting bought by Google. Their seminal paper “Playing Atari with Deep Reinforcement Learning” presented the first deep learning model to successfully learn control policies directly from high dimensional sensory input using reinforcement learning. It utilized techniques like experience replay and target networks, enabling the agent to achieve human-level performance on various Atari games without any game specific tuning. Building on this foundation, DeepMind released AlphaGo and AlphaZero which further advanced this approach by learning entirely from self play without any human data, achieving superhuman performance in Go, Chess and Shogi.

How did I get here?

While going through these papers, one such paper absolutely perplexed me. I do have a strong background in ML and mathematics, hence I was really surprised when I couldn’t fully grasp all the details in the paper “Trust Region Policy Optimization” by John Schulman and others. This paper describe an iterative procedure for optimizing policies, with guaranteed monotonic improvement. The reason I found this tough to grasp because it primarily builds on top of three already existing ideas:

Importance Sampling: TRPO uses importance sampling to estimate policy improvement over previously collected data. This technique allows the algorithm to reuse past experiences by adjusting the probability of actions, making sample collection more efficient.
Minimize-Maximize Objective (Surrogate Objective): TRPO optimizes a surrogate objective function that seeks to maximize policy improvement while minimizing the difference from the previous policy. This approach ensures that updates improve performance without drastic policy changes that could destabilize learning.
Trust Region Optimization: At its core, TRPO employs trust region constraints to limit each policy update within a "safe" region, preventing overly large updates. By maintaining this constraint, it enhances stability and prevents performance collapse in complex environments.

In order to throughly understand the intricate details of this paper, I will have to understand each of these ideas carefully and in great detail. The blog post will explore the intricacies of 'Importance Sampling', a seemingly simple concept that can take time to fully understand.

Trust Region Policy Optimization

What is Importance Sampling?

Importance sampling is a statistical technique used to estimate properties of a particular distribution, while only having samples generated from a different distribution than the one of interest. It's especially useful when dealing with rare events or when the probability distribution of interest is difficult to sample from directly. In the context of Monte Carlo methods, it allows us to compute expectations with respect to one distribution using samples from another, more tractable distribution. In reinforcement learning (RL), importance sampling is particularly useful for off-policy evaluation and learning, where we aim to estimate the performance of a new policy using data generated from a different policy.

Screenshot 2024-11-06 at 2.02.50 PM.png

Motivation

Suppose we are interested in computing the expected value of a function $f(x)$ under

a probability distribution $P$:

µ $= E_P[f(x)] = \int f(x)P(x)dx$