Come with me now on a journey through code and data...

Reinforcement Learning Finding The Optimal Policy

Recap

Last time I described the definition of Reinforcement Learning as:

Given a particular environment (or game)

Find the optimal function for choosing actions in each state.

By "optimal" we mean that these actions return the greatest rewards

Also known as the optimal policy $\pi^*$, where $$\pi^* = \arg\max_{\pi}\mathbb{E}\left[\sum\limits_{t \ge 0} \gamma^t r_t|\pi\right]$$

Next we'll look at how to find this optimal policy.

Definitions

Sample Trajectories or Paths

A "sample trajectory" or "path" is a route taken through the game-space. Given some initial state $s_t$, it describes every action taken, every reward given, and each new state reached, up until the terminal state.

Following the actions determined by a policy gives a set of trajectories. A trajectory is something like:

$$s_0, a_0, r_0, s_1, a_1, r_1, ...$$

Value Function

We can think of the value function as a way of evaluating how good a state is. It looks at all possible trajectories from this state until the terminal state and calculates the expected cumulative reward given by following the policy.

Q-Value Function

The Q-value function is a way of evaluating how good a state-action pair is. As in, how good is taking a particular action in a given state? A Q-value function is the expected cumulative reward from taking a specific action in a particular state, and then continuing to follow the policy.

So the Q-value function is the expected reward for this action given this state plus the value function for whatever state we end up in.

Separating out this step for calculating the value of an action will become clear beow.

Q*

$Q^*$ is the optimal Q-value function, it returns the maximum expected cumulative reward that we can get from a given state-action pair. So, the value from taking the best trajectory from the given state, after taking the given action.

$$Q^*(s,a)=\max_{\pi}\mathbb{E}\left[\sum_{t\ge0}\gamma^{t}r_{t}|s_{0}=s,a_{0}=a,\pi\right]$$

As you can see this is very similar to our optimal policy function. But instead of returning the best policy, it finds the best reward value available from a particular point in the game.

Bellman Equation

What we are trying to solve is a Bellman Equation. A Bellman Equation writes the value of a decision problem at a certain point in time in terms of the payoff from some initial choices, and the value of the remaining decision problem that results from those initial choices.

$Q^*$ satisfies the Bellman Equation: $$Q^*(s,a)=\mathbb{E}_{s' \sim \mathcal{E}}\left[r+\gamma\max_{a'}Q^*(s',a') | s,a\right]$$

To reiterate the Q-value definition, the value of any state-action pair can be calculated as

the reward that you will get get by taking this action in this state + the value of the state that you end up in (given by the value function)

Walking through the above equation:

  • Say we we are in state $s$ and we take action $a$.
  • And say action $a$ puts us in the state $s'$ ("s prime")
  • If we are calculating the optimal Q-value we know that, by definition, as we continue we're going to play the best action that we can for state $s'$ in the next timestep ($t+1$)
  • So the Q-value of the next state-action pair, $(s', a')$, is going to have the maximum value of all possible actions, given state $s'$

Note: The only way to know we are following the best possible path is to know every path available and have the ability to compare the possible rewards of different trajectories. Obviously as the state-space increases this quickly becomes infeasible. We'll go into this later. For now let's continue with this definition.

This optimal Q-value function is a recursive sum of the reward for the current action in the current state, plus the optimal Q-value for the following state-action pair (which is the reward for the next action when applied to the next state, chosen by following an optimal policy, plus the Q-value for the next state after that).

The definition includes an expectated value ($\mathbb{E}$) because we have randomness over what state we'll end up in.

So what about the policy?

We've been talking about ways of calculating the best path or trajectory through a game from any given state which requires knowing every path available to us. But what we need is to select the policy (function that specifies what action to take given a state) that matches the best action to take in any state, as specified by $Q*$.

Value iteration algorithm

One solution is by a value iteration algorithm, where we use the Bellman equation as an iterative update. $$Q_{i+1}(s,a) = \mathbb{E}\left[r+\gamma\max_{a'}Q_{i}(s',a')|s,a\right]$$ $Q_{i} \mapsto Q^{*}$ as $i \mapsto \infty$

At each step, we refine our approximation of $Q^{*}$ by trying to enforce the Bellman equation. Under some mathematical conditions, we know that this sequence $Q_{i}$ of our Q-function is going to converge to our optimal $Q^{*}$ as $i$ approaches infinity.

As mentioned above, this approach is not feasible when the state-space is too large, like in the example of an Atari game where the state space is the screen of pixels. It's not really possible to make these computations for the entire state-space, so this solution is not scalable.

How can we make the solution scalable?

Deep Learning! Yay!

Neural networks are really good for approximating answers to really complex calculations. Using a function approximator for $Q(s,a)$ to estimate an action's value-function is called Q-Learning, and using a deep neural network for building that function approximator is called deep Q-Learning.

More on that to come 🤗