Categories
Offsites

Bayes Theorem, maximum likelihood estimation and TensorFlow Probability

A growing trend in deep learning (and machine learning in general) is a probabilistic or Bayesian approach to the problem. Why is this? Simply put – a standard deep learning model produces a prediction, but with no statistically robust understanding of how confident the model is in the prediction. This is important in the understanding of the limitations of model predictions, and also if one wants to do probabilistic modeling of any kind. There are also other applications, such as probabilistic programming and being able to use domain knowledge, but more on that in another post. The TensorFlow developers have addressed this problem by creating TensorFlow Probability. This post will introduce some basic Bayesian concepts, specifically the likelihood function and maximum likelihood estimation, and how these can be used in TensorFlow Probability for the modeling of a simple function.

The code contained in this tutorial can be found on this site’s Github repository.


Eager to build deep learning systems in TensorFlow 2? Get the book here

 

Bayes theorem and maximum likelihood estimation

Bayes theorem is one of the most important statistical concepts a machine learning practitioner or data scientist needs to know. In the machine learning context, it can be used to estimate the model parameters (e.g. the weights in a neural network) in a statistically robust way. It can also be used in model selection e.g. choosing which machine learning model is the best to address a given problem. I won’t be going in-depth into all the possible uses of Bayes theorem here, however, but I will be introducing the main components of the theorem.

Bayes theorem can be shown in a fairly simple equation involving conditional probabilities as follows:

$$P(theta vert D) = frac{P(D vert theta) P(theta)}{P(D)}$$

In this representation, the variable $theta$ corresponds to the model parameters (i.e. the values of the weights in a neural network), and the variable $D$ corresponds to the data that we are using to estimate the $theta$ values. Before I talk about what conditional probabilities are, I’ll just quickly point out three terms in this formula which are very important to familiarise yourself with, as they come up in the literature all the time. It is worthwhile memorizing what these terms refer to:

$P(theta vert D)$ – this is called the posterior

$P(D vert theta)$ – this is called the likelihood

$P(theta)$ – this is called the prior

I’m going to explain what all these terms refer to shortly, but first I’ll make a quick detour to discuss conditional probability for those who may not be familiar. If you are already familiar with conditional probability, feel free to skip this section.

Conditional probability

Conditional probability is an important statistical concept that is thankfully easy to understand, as it forms a part of our everyday reasoning. Let’s say we have a random variable called RT which represents whether it will rain today – it is a discrete variable and can take on the value of either 1 or 0, denoting whether it will rain today or not. Let’s say we are in a fairly dry environment, and by consulting some long-term rainfall records we know that RT=1 about 10% of the time, and therefore RT=0 about 90% of the time. This fully represents the probability function for RT which can be written as P(RT). Therefore, we have some prior knowledge of what P(RT) is in the absence of any other determining factors.

Ok, so what does P(RT) look like if we know it rained yesterday? Is it the same or is it different? Well, let’s say the region we are in gets most of its rainfall due to big weather systems that can last for days or weeks – in this case, we have good reason to believe that P(RT) will be different given the fact that it rained yesterday. Therefore, the probability P(RT) is now conditioned on our understanding of another random variable P(RY) which represents whether it has rained yesterday. The way of showing this conditional probability is by using the vertical slash symbol $vert$ – so the conditional probability that it will rain today given it rained yesterday looks like the following: $P(RT=1 vert RY = 1)$. Perhaps for this reason the probability that it will rain today is no longer 10%, but maybe will rise to 30%, so $P(RT=1 vert RY = 1) = 0.3$

We could also look at other probabilities, such as $P(RT=1 vert RY = 0)$ or $P(RT=0 vert RY = 1)$ and so on. To generalize this relationship we would just write $P(RT vert RY)$.

Now that you have an understanding of conditional probabilities, let’s move on to explaining Bayes Theorem (which contains two conditional probability functions) in more detail.

Bayes theorem in more detail

The posterior

Ok, so as I stated above, it is time to delve into the meaning of the individual terms of Bayes theorem. Let’s first look at the posterior term – $P(theta vert D)$. This term can be read as: given we have a certain dataset $D$, what is the probability of our parameters $theta$? This is the term we want to maximize when varying the parameters of a model according to a dataset – by doing so, we find those parameters $theta$ which are most probable given the model we are using and the training data supplied. The posterior is on the left-hand side of the equation of Bayes Theorem, so if we want to maximize the posterior we can do this by maximizing the right-hand side of the equation.

Let’s have a look at the terms on the right-hand side.

The likelihood

The likelihood is expressed as $P(D vert theta)$ and can be read as: given this parameter $theta$, which defines some process of generating data, what is the probability we would see this given set of data $D$? Let’s say we have a scattering of data-points – a good example might be the heights of all the members of a classroom full of kids. We can define a model that we assume is able to generate or represent this data – in this case, the Normal distribution is a good choice. The parameters that we are trying to determine in the Normal distribution is the tuple ($mu$, $sigma$) – the mean and variance of the Normal distribution.

So the likelihood $P(D vert theta)$ in this example is the probability of seeing this sample of measured heights given different values of the mean and variance of the Normal distribution function. There is some more mathematical precision needed here (such as the difference between a probability distribution and a probability density function, discrete samples etc.) but this is ok for our purposes of coming to a conceptual understanding.

I’ll come back to the concept of the likelihood shortly when we discuss maximum likelihood estimation, but for now, let’s move onto the prior.

The prior

The prior probability $P(theta)$, as can be observed, is not a conditioned probability distribution. It is simply a representation of the probability of the parameters prior to any other consideration of data or evidence. You may be puzzled as to what the point of this probability is. In the context of machine learning or probabilistic programming, it’s purpose is to enable us to specify some prior understanding of what the parameters should actually be, and the prior probability distribution it should be drawn from.

Returning to the example of the heights of kids in a classroom. Let’s say the teacher is a pretty good judge of heights, and therefore he or she can come to the problem with a rough prior estimate of what the mean height would be. Let’s say he or she guesses that the average height is around 130cm. He can then put a prior around the mean parameter $mu$ of, say, a normal distribution with a mean of 130cm.

The presence of the prior in the Bayes theorem allows us to introduce expert knowledge or prior beliefs into the problem, which aids the finding of the optimal parameters $theta$. These prior beliefs are then updated by the data collected $D$ – with the updating occurring through the action of the likelihood function.

The graph below is an example of the evolution of a prior distribution function exposed to some set of data:

The evolution of the prior - Bayes Theorem - Maximum likelihood estimation

The evolution of the prior distribution towards the evidence / data

Here we can see that, through the application of the Bayes Theorem, we can start out with a certain set of prior beliefs in the form of a prior distribution function, but by applying the evidence or data through the likelihood $P(D vert theta)$, the posterior estimate $P(theta vert D)$ moves closer to “reality”.

The data

The final term in Bayes Theorem is the unconditioned probability distribution of the process that generated the data $P(D)$. In machine learning applications, this distribution is often unknown – but thankfully, it doesn’t matter. This distribution acts as a normalization constant and has nothing to say about the parameters we are trying to estimate $theta$. Therefore, because we are trying to simply maximize the right-hand side of the equation, it drops out of any derivative calculation that is made in order to find the maximum. So in the context of machine learning and estimating parameters, this term can be safely ignored. Given this understanding, the form of Bayes Theorem that we are mostly interested in for machine learning purposes is as follows: $$P(theta vert D) propto P(D vert theta) P(theta)$$

Given this formulation, all we are concerned about is either maximizing the right-hand side of the equation or by simulating the sampling of the posterior itself (not covered in this post).

How to estimate the posterior

Now that we have reviewed conditional probability concepts and Bayes Theorem, it is now time to consider how to apply Bayes Theorem in practice to estimate the best parameters in a machine learning problem. There are a number of ways of estimating the posterior of the parameters in a machine learning problem. These include maximum likelihood estimation, maximum a posterior probability (MAP) estimation, simulating the sampling from the posterior using Markov Chain Monte Carlo (MCMC) methods such as Gibbs sampling, and so on. In this post, I will just be considering maximum likelihood estimation (MLE) with other methods being considered in future content on this site.

Maximum likelihood estimation (MLE)

What happens if we just throw our hands up in the air with regards to the prior $P(theta)$ and say we don’t know anything about the best parameters to describe the data? In that case, the prior becomes a uniform or un-informative prior – in that case, $P(theta)$ becomes a constant (same probability no matter what the parameter values are), and our Bayes Theorem reduces to:

$$P(theta vert D) propto P(D vert theta)$$

If this is the case, all we have to do is maximize the likelihood $P(D vert theta)$ and by doing so we will also find the maximum of the posterior – i.e. the parameter with the highest probability given our model and data – or, in short, an estimate of the optimal parameters. If we have a way of calculating $P(D vert theta)$ while varying the parameters $theta$, we can then feed this into some sort of optimizer to calculate:

$$underset{theta}{operatorname{argmax}} P(D vert theta)$$
 
Nearly always, instead of maximizing $P(D vert theta)$ the log of $P(D vert theta)$ is maximized. Why? If we were doing the calculations by hand, we would need to calculate the derivative of the product of multiple exponential functions (as probability functions like the Normal distribution have exponentials in them) which is tricky. Because logs are monotonically increasing functions, they have maximums at the same point as the non-log function. So in other words, the maximum likelihood will occur at the same parameter value as the maximum of the log likelihood. By taking the log of the likelihood, products turn into sums and this makes derivative calculations a whole lot easier.
 
Finally, some optimizers in machine learning packages such as TensorFlow only minimize loss functions, so we need to invert the sign of the loss function in order to maximize it. In that case, for maximum likelihood estimation, we would minimize the negative log likelihood, or NLL, and get the same result.
 
Let’s look at a simple example of maximum likelihood estimation by using TensorFlow Probability.

TensorFlow Probability and maximum likelihood estimation

For the simple example of maximum likelihood estimation that is to follow, TensorFlow Probability is overkill – however, TensorFlow Probability is a great extension of TensorFlow into the statistical domain, so it is worthwhile introducing MLE by utilizing it. The Jupyter Notebook containing this example can be found at this site’s Github repository. Note this example is loosely based on the TensorFlow tutorial found here. In this example, we will be estimating linear regression parameters based on noisy data. These parameters can obviously be solved using analytical techniques, but that isn’t as interesting. First, we import some libraries and generate the noisy data:

import tensorflow as tf
import tensorflow_probability as tfp
import numpy as np
import matplotlib.pylab as plt
tfd = tfp.distributions

x_range = np.arange(0, 10, 0.1)
grad = 2.0
intercept = 3.0
lin_reg = x_range * grad + np.random.normal(0, 3.0, len(x_range)) + intercept

Plotting our noisy regression line looks like the following:

Noisy regression line - Maximum likelihood estimation

Noisy regression line

Next, let’s set up our little model to predict the underlying regression function from the noisy data:

model = tf.keras.Sequential([
  tf.keras.layers.Dense(1),
  tfp.layers.DistributionLambda(lambda x: tfd.Normal(loc=x, scale=1)),
])

So here we have a simple Keras sequential model (for more detail on Keras and TensorFlow, see this post). The first layer is a Dense layer with one node. Given each Dense layer has one bias input by default – this layer equates to generating a simple line with a gradient and intercept: $xW + b$ where x is the input data, W is the single input weight and b is the bias weight. So the first Dense layer produces a line with a trainable gradient and y-intercept value.

The next layer is where TensorFlow Probability comes in. This layer allows you to create a parameterized probability distribution, with the parameter being “fed in” from the output of previous layers. In this case, you can observe that the lambda x, which is the output from the previous layer, is defining the mean of a Normal distribution. In this case, the scale (i.e. the standard deviation) is fixed to 1.0. So, using TensorFlow probability, our model no longer will just predict a single value for each input (as in a non-probabilistic neural network) – no, instead the output is actually a Normal distribution. In that case, to actually predict values we need to call statistical functions from the output of the model. For instance:

  • model(np.array([[1.0]])).sample(10) will produce a random sample of 10 outputs from the Normal distribution, parameterized by the input value 1.0 fed through the first Dense layer
  • model(np.array([[1.0]])).mean() will produce the mean of the distribution, given the input
  • model(np.array([[1.0]])).stddev() will produce the standard deviation of the distribution, given the input

and so on. We can also calculate the log probability of the output distribution, as will be discussed shortly. Next, we need to set up our “loss” function – in this case, our “loss” function is actually just the negative log likelihood (NLL):

def neg_log_likelihood(y_actual, y_predict):
  return -y_predict.log_prob(y_actual)

In the above, the y_actual values are the actual noisy training samples. The values y_predict are actually a tensor of parameterized Normal probability distributions – one for each different training input. So, for instance, if one training input is 5.0, the corresponding y_predict value will be a Normal distribution with a mean value of, say, 12. Another training input may have a value 10.0, and the corresponding y_predict will be a Normal distribution with a mean value of, say, 20, and so on. Therefore, for each y_predict and y_actual pair, it is possible to calculate the log probability of that actual value occurring given the predicted Normal distribution.

To make this more concrete – let’s say for a training input value 5.0, the corresponding actual noisy regression value is 8.0. However, let’s say the predicted Normal distribution has a mean of 10.0 (and a fixed variance of 1.0). Using the formula for the log probability / log likelihood of a Normal distribution:

$$ell_x(mu,sigma^2) = – ln sigma – frac{1}{2} ln (2 pi) – frac{1}{2} Big( frac{x-mu}{sigma} Big)^2$$

Substituting in the example values mentioned above:

$$ell_x(10.0,1.0) = – ln 1.0 – frac{1}{2} ln (2 pi) – frac{1}{2} Big( frac{8.0-10.0}{1.0} Big)^2$$

We can calculate the log likelihood from the y_predict distribution and the y_actual values. Of course, TensorFlow Probability does this for us by calling the log_prob method on the y_predict distribution. Taking the negative of this calculation, as I have done in the function above, gives us the negative log likelihood value that we need to minimize to perform MLE.

After the loss function, it is now time to compile the model, train it, and make some predictions:

model.compile(optimizer=tf.optimizers.Adam(learning_rate=0.05), loss=neg_log_likelihood)
model.fit(x_range, lin_reg, epochs=500, verbose=False)

yhat = model(x_range)
mean = yhat.mean()

As can be observed, the model is compiled using our custom neg_log_likelihood function as the loss. Because this is just a toy example, I am using the full dataset as both the train and test set. The estimated regression line is simply the mean of all the predicted distributions, and plotting it produces the following:

plt.close("all")
plt.scatter(x_range, lin_reg)
plt.plot(x_range, mean, label='predicted')
plt.plot(x_range, x_range * grad + intercept, label='ground truth')
plt.legend(loc="upper left")
plt.show()
plt.close("all") plt.scatter(x_range, lin_reg) plt.plot(x_range, mean, label='predicted') plt.plot(x_range, x_range * grad + intercept, label='ground truth') plt.legend(loc="upper left") plt.show()

TensorFlow Probability based regression using maximum likelihood estimation

Another example with changing variance

Another, more interesting, example is to use the model to predict not only the mean but also the changing variance of a dataset. In this example, the dataset consists of the same trend but the noise variance increases along with the values:

def noise(x, grad=0.5, const=2.0):
  return np.random.normal(0, grad * x + const)

x_range = np.arange(0, 10, 0.1)
noise = np.array(list(map(noise, x_range)))
grad = 2.0
intercept = 3.0
lin_reg = x_range * grad + intercept + noise

plt.scatter(x_range, lin_reg)
plt.show()
linear regression with increasing noise variance

Linear regression with increasing noise variance

The new model looks like the following:

model = tf.keras.Sequential([
  tf.keras.layers.Dense(2),
  tfp.layers.DistributionLambda(lambda x: tfd.Normal(loc=x[:, 0], scale=1e-3 + tf.math.softplus(0.3 * x[:, 1]))),
])

In this case, we have two nodes in the first layer, ostensibly to predict both the mean and standard deviation of the Normal distribution, instead of just the mean as in the last example. The mean of the distribution is assigned to the output of the first node (x[:, 0]) and the standard deviation / scale is set to be equal to a softplus function based on the output of the second node (x[:, 1]). After training this model on the same data and using the same loss as the previous example, we can predict both the mean and standard deviation of the model like so:

mean = yhat.mean()
upper = mean + 2 * yhat.stddev()
lower = mean - 2 * yhat.stddev()

In this case, the upper and lower variables are the 2-standard deviation upper and lower bounds of the predicted distributions. Plotting this produces:

plt.close("all")
plt.scatter(x_range, lin_reg)
plt.plot(x_range, mean, label='predicted')
plt.fill_between(x_range, lower, upper, alpha=0.1)
plt.plot(x_range, x_range * grad + intercept, label='ground truth')
plt.legend(loc="upper left")
plt.show()
Regression prediction with increasing variance

Regression prediction with increasing variance

As can be observed, the model is successfully predicting the increasing variance of the dataset, along with the mean of the trend. This is a limited example of the power of TensorFlow Probability, but in future posts I plan to show how to develop more complicated applications like Bayesian Neural Networks. I hope this post has been useful for you in getting up to speed in topics such as conditional probability, Bayes Theorem, the prior, posterior and likelihood function, maximum likelihood estimation and a quick introduction to TensorFlow Probability. Look out for future posts expanding on the increasingly important probabilistic side of machine learning.


Eager to build deep learning systems in TensorFlow 2? Get the book here


 

The post Bayes Theorem, maximum likelihood estimation and TensorFlow Probability appeared first on Adventures in Machine Learning.

Categories
Offsites

Policy Gradient Reinforcement Learning in TensorFlow 2

In a series of recent posts, I have been reviewing the various Q based methods of deep reinforcement learning (see here, here, here, here and so on). Deep Q based reinforcement learning operates by training a neural network to learn the Q value for each action of an agent which resides in a certain state s of the environment. The policy which guides the actions of the agent in this paradigm operates by a random selection of actions at the beginning of training (the epsilon greedy method), but then the agent will select actions based on the highest Q value predicted in each state s. The Q value is simply an estimation of future rewards which will result from taking action a. An alternative to the deep Q based reinforcement learning is to forget about the Q value and instead have the neural network estimate the optimal policy directly. Reinforcement learning methods based on this idea are often called Policy Gradient methods.

This post will review the REINFORCE or Monte-Carlo version of the Policy Gradient methodology. This methodology will be used in the Open AI gym Cartpole environment. All code used and explained in this post can be found on this site’s Github repository.  


Eager to build deep learning systems in TensorFlow 2? Get the book here


Policy Gradients and their theoretical foundation

This section will review the theory of Policy Gradients, and how we can use them to train our neural network for deep reinforcement learning. This section will feature a fair bit of mathematics, but I will try to explain each step and idea carefully for those who aren’t as familiar with the mathematical ideas. We’ll also skip over a step at the end of the analysis for the sake of brevity.  

In Policy Gradient based reinforcement learning, the objective function which we are trying to maximise is the following:

$$J(theta) = mathbb{E}_{pi_theta} left[sum_{t=0}^{T-1} gamma^t r_t right]$$
 
The function above means that we are attempting to find a policy ($pi$) with parameters ($theta$) which maximises the expected value of the sum of the discounted rewards of an agent in an environment. Therefore, we need to find a way of varying the parameters of the policy $theta$ such that the expected value of the discounted rewards are maximised. In the deep reinforcement learning case, the parameters $theta$ are the parameters of the neural network.
 
Note the difference to the deep Q learning case – in deep Q based learning, the parameters we are trying to find are those that minimise the difference between the actual Q values (drawn from experiences) and the Q values predicted by the network. However, in Policy Gradient methods, the neural network directly determines the actions of the agent – usually by using a softmax output and sampling from this. 
 
Also note that, because environments are usually non-deterministic, under any given policy ($pi_theta$) we are not always going to get the same reward. Rather, we are going to be sampling from some probability function as the agent operates in the environment, and therefore we are trying to maximise the expected sum of rewards, not the 100% certain, we-will-get-this-every-time reward sum.  
 
Ok, so we want to learn the optimal $theta$. The way we generally learn parameters in deep learning is by performing some sort of gradient based search of $theta$. So we want to iteratively execute the following:
 
$$theta leftarrow theta + alpha nabla J(theta)$$
 
So the question is, how do we find $nabla J(theta)$?

Finding the Policy Gradient

First, let’s make the expectation a little more explicit. Remember, the expectation of the value of a function $f(x)$ is the summation of all the possible values due to variations in x multiplied by the probability of x, like so:

$$mathbb{E}[f(x)] = sum_x P(x)f(x)$$
 
Ok, so what does the cashing out of the expectation in $J(theta)$ look like? First, we have to define the function which produces the rewards, i.e. the rewards equivalent of $f(x)$ above. Let’s call this $R(tau)$ (where
$R(tau) = sum_{t=0}^{T-1}r_t$, ignoring discounting for the moment).  The value $tau$ is the trajectory of the agent “moving” through the environment. It can be defined as:
 
$$tau = (s_0, a_0, r_0, s_1, a_1, r_1, ldots, s_{T-1}, a_{T-1}, r_{T-1}, s_T)$$
 
The trajectory, as can be seen, is the progress of the agent through an episode of a game of length T. This trajectory is the fundamental factor which determines the sum of the rewards – hence $R(tau)$. This covers the $f(x)$ component of the expectation definition. What about the $P(x)$ component? In this case, it is equivalent to $P(tau)$ – but what does this actually look like in a reinforcement learning environment? 
 
It consists of the two components – the probabilistic policy function which yields an action $a_t$ from states $s_t$ with a certain probability, and a probability that state $s_{t+1}$ will result from taking action $a_t$ from state $s_t$. The latter probabilistic component is uncertain due to the random nature of many environments. These two components operating together will “roll out” the trajectory of the agent $tau$. 
 
$P(tau)$ looks like:
 
$$P(tau) = prod_{t=0}^{T-1} P_{pi_{theta}}(a_t|s_t)P(s_{t+1}|s_t,a_t)$$
 
If we take the first step, starting in state $s_0$ – our neural network will produce a softmax output with each action assigned a certain probability. The action is then selected by weighted random sampling subject to these probabilities – therefore, we have a probability of action $a_0$ being selected according to $P_{pi_{theta}}(a_t|s_t)$. This probability is determined by the policy $pi$ which in turn is parameterised according to $theta$ (i.e. a neural network with weights $theta$.  The next term will be $P(s_1|s_0,a_0)$ which expresses any non-determinism in the environment.
 
(Note: the vertical line in the probability functions above are conditional probabilities. $P_{pi_{theta}}(a_t|s_t)$ refers to the probability of action $a_t$ being selected, given the agent is in state $s_t$). 
 
These probabilities are multiplied out over all the steps in the episode of length T to produce the trajectory $tau$. (Note, the probability of being in the first state, $s_0$, has been excluded from this analysis for simplicity). Now we can substitute $P(tau)$ and $R(tau)$ into the original expectation and take the derivative to get to $nabla J(theta)$ which is what we need to do the gradient based optimisation. However, to get there, we first need to apply a trick or two. 
 
First, let’s take the log derivative of $P(tau)$ with respect to $theta$ i.e. $nabla_theta$ and work out what we get:
 
 
$$nabla_theta log P(tau) = nabla log left(prod_{t=0}^{T-1} P_{pi_{theta}}(a_t|s_t)P(s_{t+1}|s_t,a_t)right) $$
$$ =nabla_theta left[sum_{t=0}^{T-1} (log P_{pi_{theta}}(a_t|s_t) + log P(s_{t+1}|s_t,a_t)) right]$$
$$ =nabla_theta sum_{t=0}^{T-1}log P_{pi_{theta}}(a_t|s_t)$$
 
The reason we are taking the log will be made clear shortly. As can be observed, when the log is taken of the multiplicative operator ($prod$) this is converted to a summation (as multiplying terms within a log function is equivalent to adding them separately). In the final line, it can be seen that taking the derivative with respect to the parameters ($theta$) removes the dynamics of the environment ($log P(s_{t+1}|s_t,a_t))$) as these are independent of the neural network parameters / $theta$. 
 
Let’s go back to our original expectation function, substituting in our new trajectory based functions, and apply the derivative (again ignoring discounting for simplicity):
 
$$J(theta) = mathbb{E}[R(tau)]$$
$$ = smallint P(tau) R(tau)$$
$$nabla_theta J(theta) = nabla_theta smallint P(tau) R(tau)$$
 
So far so good. Now, we are going to utilise the following rule which is sometimes called the “log-derivative” trick:
 
$$frac{nabla_theta p(X,theta)}{p(X, theta)} = nabla_theta log p(X,theta)$$
 
We can then apply the $nabla_{theta}$ operator within the integral, and cajole our equation so that we get the $frac{nabla_{theta} P{tau}}{P(tau)}$ expression like so:
 
$$nabla_theta J(theta)=int P(tau) frac{nabla_theta P(tau)}{P(tau)} R(tau)$$
 
Then, using the log-derivative trick and applying the definition of expectation, we arrive at:
 
$$nabla_theta J(theta)=mathbb{E}left[R(tau) nabla_theta logP(tau)right]$$
 
We can them substitute our previous derivation of $nabla_{theta} log P(tau)$ into the above to arrive at:
 
$$nabla_theta J(theta) =mathbb{E}left[R(tau) nabla_theta sum_{t=0}^{T-1} log P_{pi_{theta}}(a_t|s_t)right]$$
 
This is now close to the point of being something we can work with in our learning algorithm. Let’s take it one step further by recognising that, during our learning process, we are randomly sampling trajectories from the environment, and hoping to make informed training steps. Therefore, we can recognise that, to maximise the expectation above, we need to maximise it with respect to its argument i.e. we maximise:
 
$$nabla_theta J(theta) sim R(tau) nabla_theta sum_{t=0}^{T-1} log P_{pi_{theta}}(a_t|s_t)$$
 
Recall that $R(tau)$ is equal to $R(tau) = sum_{t=0}^{T-1}r_t$ (ignoring discounting). Therefore, we have two summations that need to be multiplied out, element by element. It turns out that after doing this, we arrive at an expression like so:
 
$$nabla_theta J(theta) sim left(sum_{t=0}^{T-1} log P_{pi_{theta}}(a_t|r_t)right)left(sum_{t’= t + 1}^{T} gamma^{t’-t-1} r_{t’} right)$$
 
As can be observed, there are two main components that need to be multiplied. However, one should note the differences in the bounds of the summation terms in the equation above – these will be explained in the next section.
 

Calculating the Policy Gradient

The way we compute the gradient as expressed above in the REINFORCE method of the Policy Gradient algorithm involves sampling trajectories through the environment to estimate the expectation, as discussed previously. This REINFORCE method is therefore a kind of Monte-Carlo algorithm. Let’s consider this a bit more concretely. 
 
Let’s say we initialise the agent and let it play a trajectory $tau$ through the environment. The actions of the agent will be selected by performing weighted sampling from the softmax output of the neural network – in other words, we’ll be sampling the action according to $P_{pi_{theta}}(a_t|r_t)$. At each step in the trajectory, we can easily calculate $log P_{pi_{theta}}(a_t|r_t)$ by simply taking the log of the softmax output values from the neural network. So, for the first step in the trajectory, the neural network would take the initial states $s_0$ as input, and it would produce a vector of actions $a_0$ with pseudo-probabilities generated by the softmax operation in the final layer. 
 
What about the second part of the $nabla_theta J(theta)$ equation – $sum_{t’= t + 1}^{T} gamma^{t’-t-1} r_{t’}$? We can see that the summation term starts at $t’ = t + 1 = 1$. The summation then goes from t=1 to the total length of the trajectory T – in other words, from t=1 to the total length of the episode. Let’s say the episode length was 4 states long – this term would then look like $gamma^0 r_1 + gamma^1 r_2 + gamma^2 r_3$, where $gamma$ is the discounting factor and is < 1. 
 
Straight-forward enough. However, you may have realised that, in order to calculate the gradient $nabla_theta J(theta)$ at the first step in the trajectory/episode, we need to know the reward values of all subsequent states in the episode. Therefore, in order to execute this method of learning, we can only take gradient learning steps after the full episode has been played to completion. Only after the episode is complete can we perform the training step. 
 
We are almost ready to move onto the code part of this tutorial. However, this is a good place for a quick discussion about how we would actually implement the calculations $nabla_theta J(theta)$ equation in TensorFlow 2 / Keras. It turns out we can just use the standard cross entropy loss function to execute these calculations. Recall that cross entropy is defined as (for a deeper explanation of entropy, cross entropy, information and KL divergence, see this post):
 
$$CE = -sum p(x) log(q(x))$$
 
Which is just the summation between one function $p(x)$ multiplied by the log of another function $q(x)$ over the possible values of the argument x. If we look at the source code of the Keras implementation of cross-entropy, we can see the following:
 
Keras output of cross-entropy loss function

Keras output of cross-entropy loss function

The output tensor here is simply the softmax output of the neural network, which, for our purposes, will be a tensor of size (num_steps_in_episode, num_actions). Note that the log of output is calculated in the above. The target value, for our purposes, can be all the discounted rewards calculated at each step in the trajectory, and will be of size (num_steps_in_episode, 1). The summation of the multiplication of these terms is then calculated (reduce_sum). Gradient based training in TensorFlow 2 is generally a minimisation of the loss function, however, we want to maximise the calculation as discussed above. The good thing is, the sign of cross entropy calculation shown above is inverted – so we are good to go. 

To call this training step utilising Keras, all we have to do is execute something like the following:

network.train_on_batch(states, discounted_rewards)

Here, we supply all the states gathered over the length of the episode, and the discounted rewards at each of those steps. The Keras backend will pass the states through network, apply the softmax function, and this will become the output variable in the Keras source code snippet above. Likewise, discounted_rewards is the same as target in the source code snippet above. 

Now that we have covered all the pre-requisite knowledge required to build a REINFORCE-type method of Policy Gradient reinforcement learning, let’s have a look at how this can be coded and applied to the Cartpole environment.

Policy Gradient reinforcement learning in TensorFlow 2 and Keras

In this section, I will detail how to code a Policy Gradient reinforcement learning algorithm in TensorFlow 2 applied to the Cartpole environment. As always, the code for this tutorial can be found on this site’s Github repository.

First, we define the network which we will use to produce $P_{pi_{theta}}(a_t|r_t)$ with the state as the input:

GAMMA = 0.95

env = gym.make("CartPole-v0")
state_size = 4
num_actions = env.action_space.n

network = keras.Sequential([
    keras.layers.Dense(30, activation='relu', kernel_initializer=keras.initializers.he_normal()),
    keras.layers.Dense(30, activation='relu', kernel_initializer=keras.initializers.he_normal()),
    keras.layers.Dense(num_actions, activation='softmax')
])
network.compile(loss='categorical_crossentropy',optimizer=keras.optimizers.Adam())

As can be observed, first the environment is initialised. Next, the network is defined using the Keras Sequential API. The network consists of 3 densely connected layers. The first 2 layers have ReLU activations, and the final layer has a softmax activation to produce the pseudo-probabilities to approximate $P_{pi_{theta}}(a_t|r_t)$. Finally, the network is compiled with a cross entropy loss function and an Adam optimiser. 

The next part of the code chooses the action from the output of the model:

def get_action(network, state, num_actions):
    softmax_out = network(state.reshape((1, -1)))
    selected_action = np.random.choice(num_actions, p=softmax_out.numpy()[0])
    return selected_action

As can be seen, first the softmax output is extracted from the network by inputing the current state. The action is then selected by making a random choice from the number of possible actions, with the probabilities weighted according to the softmax values. 

The next function is the main function involved in executing the training step:

def update_network(network, rewards, states, actions, num_actions):
    reward_sum = 0
    discounted_rewards = []
    for reward in rewards[::-1]:  # reverse buffer r
        reward_sum = reward + GAMMA * reward_sum
        discounted_rewards.append(reward_sum)
    discounted_rewards.reverse()
    discounted_rewards = np.array(discounted_rewards)
    # standardise the rewards
    discounted_rewards -= np.mean(discounted_rewards)
    discounted_rewards /= np.std(discounted_rewards)
    states = np.vstack(states)
    loss = network.train_on_batch(states, discounted_rewards)
    return loss

First, the discounted rewards list is created: this is a list where each element corresponds to the summation from t + 1 to T according to $sum_{t’= t + 1}^{T} gamma^{t’-t-1} r_{t’}$. The input argument rewards is a list of all the rewards achieved at each step in the episode. The rewards[::-1] operation reverses the order of the rewards list, so the first run through the for loop will deal with last reward recorded in the episode. As can be observed, a reward sum is accumulated each time the for loop is executed. Let’s say that the episode length is equal to 4 – $r_3$ will refer to the last reward recorded in the episode. In this case, the discounted_rewards list would look like:

[$r_3$, $r_2 + gamma r_3$, $r_1 + gamma r_2 + gamma^2 r_3$, $r_0 + gamma r_1 + gamma^2 r_2 + gamma^3 r_3$]

 

This list is in reverse to the order of the actual state value list (i.e. [$s_0$, $s_1$, $s_2$, $s_3$]), so the next line after the for loop reverses the list (discounted_rewards.reverse()). 

Next, the list is converted into a numpy array, and the rewards are normalised to reduce the variance in the training. Finally, the states list is stacked into a numpy array and both this array and the discounted rewards array are passed to the Keras train_on_batch function, which was detailed earlier. 

The next part of the code is the main episode and training loop:

num_episodes = 10000000
train_writer = tf.summary.create_file_writer(STORE_PATH + f"/PGCartPole_{dt.datetime.now().strftime('%d%m%Y%H%M')}")
for episode in range(num_episodes):
    state = env.reset()
    rewards = []
    states = []
    actions = []
    while True:
        action = get_action(network, state, num_actions)
        new_state, reward, done, _ = env.step(action)
        states.append(state)
        rewards.append(reward)
        actions.append(action)

        if done:
            loss = update_network(network, rewards, states, actions, num_actions)
            tot_reward = sum(rewards)
            print(f"Episode: {episode}, Reward: {tot_reward}, avg loss: {loss:.5f}")
            with train_writer.as_default():
                tf.summary.scalar('reward', tot_reward, step=episode)
                tf.summary.scalar('avg loss', loss, step=episode)
            break

        state = new_state

As can be observed, at the beginning of each episode, three lists are created which will contain the state, reward and action values for each step in the episode / trajectory. These lists are appended to until the done flag is returned from the environment signifying that the episode is complete. At the end of the episode, the training step is performed on the network by running update_network. Finally, the rewards and loss are logged in the train_writer for viewing in TensorBoard. 

The training results can be observed below:

Training progress of Policy Gradient RL on Cartpole environment

Training progress of Policy Gradient RL in Cartpole environment

As can be observed, the rewards steadily progress until they “top out” at the maximum possible reward summation for the Cartpole environment, which is equal to 200. However, the user can verify that repeated runs of this version of Policy Gradient training has a high variance in its outcomes. Therefore, improvements in the Policy Gradient REINFORCE algorithm are required and available – these improvements will be detailed in future posts.

 


Eager to build deep learning systems in TensorFlow 2? Get the book here


 

 

The post Policy Gradient Reinforcement Learning in TensorFlow 2 appeared first on Adventures in Machine Learning.

Categories
Offsites

Atari Space Invaders and Dueling Q RL in TensorFlow 2

In previous posts (here and here) I introduced Double Q learning and the Dueling Q architecture. These followed on from posts about deep Q learning, and showed how double Q and dueling Q learning is superior to vanilla deep Q learning. However, these posts only included examples of simplistic environments like the OpenAI Cartpole environment. These types of environments are good to learn on, but more complicated environments are both more interesting and fun. They also demonstrate better the complexities of implementing deep reinforcement learning in realistic cases. In this post, I’ll use similar code to that shown in my Dueling Q TensorFlow 2 but in this case apply it to the Open AI Atari Space Invaders environment. All code for this post can be found on this site’s Github repository. Also, as mentioned in the title, the example code for this post is written using TensorFlow 2. TensorFlow 2 is now released and installation instructions can be found here.


Eager to build deep learning systems in TensorFlow 2? Get the book here


Double and Dueling Q learning recap

Double Q recap

Double Q learning was created to address two problems with vanilla deep Q learning. These are:

  1. Using the same network to both choose the best action and evaluate the quality of that action is a source of feedback / learning instability.
  2. The max function used in calculating the target Q value (see formula below), which the neural network is to learn, tends to bias the network towards high, noisy, rewards. This again hampers learning and makes it more erratic

The problematic Bellman equation is shown below: $$Q_{target} = r_{t+1} + gamma max_{{a}}Q(s_{t+1}, a;theta_t)$$ The Double Q solution to the two problems above involves creating another target network, which is initially created with weights equal to the primary network. However, during training the primary network and the target network are allowed to “drift” apart. The primary network is trained as per usual, but the target network is not. Instead, the target network weights are either periodically (but not frequently) set equal to the primary network weights, or they are only gradually “blended” with the primary network in a weighted average fashion. The benefit then comes from the fact that in Double Q learning, the Q value of the best action in the next state ($s_{t + 1}$) is extracted from the target network, not the primary network. The primary network is still used to evaluate what the best action will be, a*, by taking an argmax of the outputs from the primary network, but the Q value for this action is evaluated from the target network. This can be observed in the formulation below: $$a* = argmax Q(s_{t+1}, a; theta_t)$$ $$Q_{target} = r_{t+1} + gamma Q(s_{t+1}, a*; theta^-_t)$$ Notice the different weights involved in the formulas above – the best action, a*, is calculated from the network with $theta_t$ weights – this is the primary network weights. However the $Q_{target}$ calculation uses the target network, with weights $theta^-_t$, to estimate the Q value for this chosen action. This Double Q methodology decouples the choosing of an action from the evaluation of the Q value of such an action. This provides more stability to the learning – for more details and a demonstration of the superiority of the Double Q methodology over vanilla Deep Q learning, see this post.

Dueling Q recap

The Dueling Q architecture, discussed in detail in this post, is an improvement to the Double Q network. It uses the same methodology of a target and a primary network, with periodic updates or blending of the target network weights to the primary network weights. However, it builds two important concepts into the architecture of the network. These are the advantage and value functions:

  • Advantage function A(s, a): The advantage function is the relative benefit of choosing a certain action in state s over the other possible actions in state s
  • Value function V(s): The value function is the value of being in state s, independent of the relative benefits of the actions within that state

The Q function is the simple addition of these two functions: $$Q(s, a) = V(s) + A(s, a)$$ The motivation of splitting these two functions explicitly in the architecture is that there can be inherently good or bad states for the agent to be in, regardless of the relative benefit of any actions within that state. For instance, in a certain state, all actions may lead to the agent “dying” in a game – this is an inherently bad state to be in, and there is no need to waste computational resources trying to determine the best action in this state. The converse can also be true. Ideally, this “splitting” into the advantage function and value function should be learnt implicitly during training. However, the Dueling Q architecture makes this split explicit, which acts to improve training. The Dueling Q architecture can be observed in the figure below:  

Dueling Q architecture

Dueling Q architecture

It can be observed that in the Dueling Q architecture, there are common Convolutional Neural Network layers which perform image processing. The output from these layers is then flattened and the network then bifurcates into a Value function stream V(s) and an Advantage function stream A(s, a). The output of these separate streams are then aggregated in a special layer, before finally outputting Q values from the network. The aggregation layer does not perform a simple addition of the Value and Advantage streams – this would result in problems of identifiability (for more details on this, see the original Dueling Q post). Instead, the following aggregation function is performed: $$Q(s,a) = V(s) + A(s,a) – frac{1}{|a|}sum_{a’}A(s,a’)$$ In this post, I’ll demonstrate how to use the Dueling Q architecture to train an agent in TensorFlow 2 to play Atari Space Invaders. However, in this post I will concentrate on the extra considerations required to train the agent via an image stream from an Atari game. For more extra details, again, refer to the original Dueling Q post.

Considerations for training in an Atari environment

Training reinforcement learning agents on Atari environments is hard – it can be a very time consuming process as the environment complexity is high, especially when the agent needs to visually interpret objects direct from images. As such, each environment needs to be considered to determine legitimate ways of reducing the training burden and improving the performance. Three methods will be used in this post:

  1. Converting images to greyscale
  2. Reducing the image size
  3. Stacking frames

Converting Atari images to greyscale and reducing the image size

The first, relatively easy, step in reducing the computational training burden is to convert all the incoming Atari images from depth-3 RGB colour images to depth-1 greyscale images. This reduces the number of input CNN filters required in the first layer by 3. Another step which can be performed to reduce the size of the input CNN filters is to resize the image inputs to make them smaller. There is obviously a limit in the reduction of the image sizes before learning performance is affected, however, in this case, a halving of the image size by rescaling is possible without affecting performance too much. The original image sizes from the Atari Space Invaders game are (210, 160, 3) – after converting to greyscale and resizing by half, the new image size is (105, 80, 1). Both of these operations are easy enough to implement in TensorFlow 2:

def image_preprocess(image, new_size=(105, 80)):
    # convert to greyscale, resize and normalize the image
    image = tf.image.rgb_to_grayscale(image)
    image = tf.image.resize(image, new_size)
    image = image / 255
    return image

Stacking image frames

The next step that is commonly performed when training agents on Atari games is the practice of stacking image frames, and feeding all these frames into the input CNN layers. The purpose of this is to allow the neural network to get some sense of direction of the objects moving within the image. Consider a single, static image – examining such an image on its own will give no information about which direction any of the objects moving within this image are travelling (or their respective speeds). Therefore, for each sample fed into the neural network, a stack of frames is presented to the input – this gives the neural network both time and spatial information to work with. The input dimension to the network are not, then, of size (105, 80, 1) but rather (105, 80, NUM_FRAMES). In this case, we’ll use 3 frames to feed into the network i.e. NUM_FRAMES = 3. The specifics of how these stacked frames are stored, extracted and updated will be revealed as we step through the code in the next section. Additional steps can be taken to improve performance in complex Atari environment and similar cases. These include the skipping of frames and prioritised experience replay (PER). However, these have not been implemented in this example. A future post will discuss the benefits of PER and how to implement it.

Atari Space Invaders TensorFlow 2 implementation

The section below details the TensorFlow 2 implementation of training an agent on the Atari Space Invaders environment. In this post, comprehensive details of the Dueling Q architecture and training implementation will not be given – for a step by step discussion on these details, see my Dueling Q introductory post. However, detailed information will be given about the specific new steps required to train in the Atari environment. As stated at the beginning of the post, all code can be found on this site’s Github repository.

Model definition

First we define the Double/Dueling Q model class with its structure:

env = gym.make("SpaceInvaders-v0")
num_actions = env.action_space.n


class DQModel(keras.Model):
    def __init__(self, hidden_size: int, num_actions: int, dueling: bool):
        super(DQModel, self).__init__()
        self.dueling = dueling
        self.conv1 = keras.layers.Conv2D(16, (8, 8), (4, 4), activation='relu')
        self.conv2 = keras.layers.Conv2D(32, (4, 4), (2, 2), activation='relu')
        self.flatten = keras.layers.Flatten()
        self.adv_dense = keras.layers.Dense(hidden_size, activation='relu',
                                         kernel_initializer=keras.initializers.he_normal())
        self.adv_out = keras.layers.Dense(num_actions,
                                          kernel_initializer=keras.initializers.he_normal())
        if dueling:
            self.v_dense = keras.layers.Dense(hidden_size, activation='relu',
                                         kernel_initializer=keras.initializers.he_normal())
            self.v_out = keras.layers.Dense(1, kernel_initializer=keras.initializers.he_normal())
            self.lambda_layer = keras.layers.Lambda(lambda x: x - tf.reduce_mean(x))
            self.combine = keras.layers.Add()

    def call(self, input):
        x = self.conv1(input)
        x = self.conv2(x)
        x = self.flatten(x)
        adv = self.adv_dense(x)
        adv = self.adv_out(adv)
        if self.dueling:
            v = self.v_dense(x)
            v = self.v_out(v)
            norm_adv = self.lambda_layer(adv)
            combined = self.combine([v, norm_adv])
            return combined
        return adv

primary_network = DQModel(256, num_actions, True)
target_network = DQModel(256, num_actions, True)
primary_network.compile(optimizer=keras.optimizers.Adam(), loss='mse')
# make target_network = primary_network
for t, e in zip(target_network.trainable_variables, primary_network.trainable_variables):
    t.assign(e)

primary_network.compile(optimizer=keras.optimizers.Adam(), loss=tf.keras.losses.Huber())

In the code above, first the Space Invaders environment is created. After this, the DQModel class is defined as a keras.Model base class. In this model, you can observe that first a number of convolutional layers are created, then a flatten layer and dedicated fully connected layers to enact the value and advantage streams. This structure is then implemented in the model call function. After this model class has been defined, two versions of it are implemented corresponding to the primary_network and the target_network – as discussed above, both of these will be utilised in the Double Q component of the learning. The target_network weights are then set to be initially equal to the primary_network weights. Finally the primary_network is compiled for training using an Adam optimizer and a Huber loss function. As stated previously, for more details see this post.

The Memory class

Next we will look at the Memory class, which is to hold all the previous experiences of the agent. This class is a little more complicated in the Atari environment case, due to the necessity of dealing with stacked frames:

class Memory:
    def __init__(self, max_memory):
        self._max_memory = max_memory
        self._actions = np.zeros(max_memory, dtype=np.int32)
        self._rewards = np.zeros(max_memory, dtype=np.float32)
        self._frames = np.zeros((POST_PROCESS_IMAGE_SIZE[0], POST_PROCESS_IMAGE_SIZE[1], max_memory), dtype=np.float32)
        self._terminal = np.zeros(max_memory, dtype=np.bool)
        self._i = 0

In the class __init__ function, it can be observed that all the various memory buffers (for actions, rewards etc.) are initialized according to max_memory at the get-go. This is in opposition to a memory approach which involves appending to lists. This is performed so that it can be determined whether there will be a memory problem during training from the very beginning (as opposed to the code falling over after you’ve already been running it for 3 days!). It also increases the efficiency of the memory allocation process (as appending / growing memory dynamically is an inefficient process). You’ll also observe the creation of a counter variable, self._i. This is to record the present location of stored samples in the memory buffer, and will ensure that the memory is not overflowed. The next function within the class shows how samples are stored within the class:

def add_sample(self, frame, action, reward, terminal):
    self._actions[self._i] = action
    self._rewards[self._i] = reward
    self._frames[:, :, self._i] = frame[:, :, 0]
    self._terminal[self._i] = terminal
    if self._i % (self._max_memory - 1) == 0 and self._i != 0:
        self._i = BATCH_SIZE + NUM_FRAMES + 1
    else:
        self._i += 1

As will be shown shortly, for every step in the Atari environment, the current image frame, the action taken, the reward received and whether the state is terminal (i.e. the agent ran out of lives and the game ends) is stored in memory. Notice that nothing special as yet is being done with the stored frames – they are simply stored in order as the game progresses. The frame stacking process occurs during the sample extraction method to be covered next. One thing to notice is that once self._i reaches max_memory the index is reset back to the beginning of the memory buffer (but offset by the batch size and the number of frames). This reset means that, once the memory buffer reaches it’s maximum size, it will begin to overwrite the older samples. The next method in the class governs how random sampling from the memory buffer occurs:

def sample(self):
    if self._i < BATCH_SIZE + NUM_FRAMES + 1:
        raise ValueError("Not enough memory to extract a batch")
    else:
        rand_idxs = np.random.randint(NUM_FRAMES + 1, self._i, size=BATCH_SIZE)
        states = np.zeros((BATCH_SIZE, POST_PROCESS_IMAGE_SIZE[0], POST_PROCESS_IMAGE_SIZE[1], NUM_FRAMES),
                         dtype=np.float32)
        next_states = np.zeros((BATCH_SIZE, POST_PROCESS_IMAGE_SIZE[0], POST_PROCESS_IMAGE_SIZE[1], NUM_FRAMES),
                         dtype=np.float32)
        for i, idx in enumerate(rand_idxs):
            states[i] = self._frames[:, :, idx - 1 - NUM_FRAMES:idx - 1]
            next_states[i] = self._frames[:, :, idx - NUM_FRAMES:idx]
        return states, self._actions[rand_idxs], self._rewards[rand_idxs], next_states, self._terminal[rand_idxs]

First, a simple check is performed to ensure there are enough samples in the memory to actually extract a batch. If so, a set of random indices rand_idxs is selected. These random integers are selected from a range with a lower bound of NUM_FRAMES + 1 and an upper bound of self._i. In other words, it is possible to select any indices from the start of the memory buffer to the current filled location of the buffer – however, because NUM_FRAMES of images prior to the selected indices is extracted, indices less than NUM_FRAMES are not allowed. The number of random indices selected is equal to the batch size.

Next, some numpy arrays are initialised which will hold the current states and the next states – in this example, these are of size (32, 105, 80, 3) where 3 is the number of frames to be stacked (NUM_FRAMES). A loop is then entered into for each of the randomly selected memory indices. As can be observed, the states batch row is populated by the stored frames ranging from idx – 1 – NUM_FRAMES to idx – 1. In other words, it is the 3 frames including and prior to the randomly selected index idx – 1. Alternatively, the batch row for next_states is the 3 frames including and prior to the randomly selected index idx (think of a window of 3 frames shifted along by 1 position). These variables states and next_states are then returned from this function, along with the corresponding actions, rewards and terminal flags. The terminal flags communicate whether the game finished for during the randomly selected states. Finally, the memory class is instantiated with the memory size as the argument:

memory = Memory(200000)

The memory size should ideally be as large as possible, but considerations must be given to the amount of memory available on whatever computing platform is being used to run the training.

Miscellaneous functions

The following two functions are standard functions to choose the actions and update the target network:

def choose_action(state, primary_network, eps, step):
    if step < DELAY_TRAINING:
        return random.randint(0, num_actions - 1)
    else:
        if random.random() < eps:
            return random.randint(0, num_actions - 1)
        else:
            return np.argmax(primary_network(tf.reshape(state, (1, POST_PROCESS_IMAGE_SIZE[0],
                                                           POST_PROCESS_IMAGE_SIZE[1], NUM_FRAMES)).numpy()))


def update_network(primary_network, target_network):
    # update target network parameters slowly from primary network
    for t, e in zip(target_network.trainable_variables, primary_network.trainable_variables):
        t.assign(t * (1 - TAU) + e * TAU)

The choose_action function performs the  epsilon-greedy action selection policy, where a random action is selected if a random value falls below eps, otherwise it is selected by choosing the action with the highest Q value from the network. The update_network function slowly shifts the target network weights towards the primary network weights in accordance with the Double Q learning methodology. The next function deals with the “state stack” which is an array which holds the last NUM_FRAMES of the episode:

def process_state_stack(state_stack, state):
    for i in range(1, state_stack.shape[-1]):
        state_stack[:, :, i - 1].assign(state_stack[:, :, i])
    state_stack[:, :, -1].assign(state[:, :, 0])
    return state_stack

This function takes the existing state stack array, and the newest state to be added. It then shuffles all the existing frames within the state stack “back” one position. In other words, the most recent state, in this case, sitting in row 2 of the state stack, if shuffled back to row 1. The frame / state in row 1 is shuffled to row 0. Finally, the most recent state or frame is stored in the newly vacated row 2 of the state stack. The state stack is required so that it can be fed into the neural network in order to choose actions, and its updating can be observed in the main training loop, as will be reviewed shortly.

The Dueling Q / Double Q training function

Next up is the training function:

def train(primary_network, memory, target_network=None):
    states, actions, rewards, next_states, terminal = memory.sample()
    # predict Q(s,a) given the batch of states
    prim_qt = primary_network(states)
    # predict Q(s',a') from the evaluation network
    prim_qtp1 = primary_network(next_states)
    # copy the prim_qt tensor into the target_q tensor - we then will update one index corresponding to the max action
    target_q = prim_qt.numpy()
    updates = rewards
    valid_idxs = terminal != True
    batch_idxs = np.arange(BATCH_SIZE)
    if target_network is None:
        updates[valid_idxs] += GAMMA * np.amax(prim_qtp1.numpy()[valid_idxs, :], axis=1)
    else:
        prim_action_tp1 = np.argmax(prim_qtp1.numpy(), axis=1)
        q_from_target = target_network(next_states)
        updates[valid_idxs] += GAMMA * q_from_target.numpy()[batch_idxs[valid_idxs], prim_action_tp1[valid_idxs]]
    target_q[batch_idxs, actions] = updates
    loss = primary_network.train_on_batch(states, target_q)
    return loss

This train function is very similar to the train function reviewed in my first Dueling Q tutorial. Essentially, it first extracts batches of data from the memory buffer. Next the Q values from the current state (states) and the following states (next_states) are extracted from the primary network – these values are returned in prim_qt and prim_qtp1 respectively (where qtp1 refers to the Q values for the time t + 1). Next, the target Q values are initialized from the prim_qt values. After this, the updates variable is created – this holds the target Q values for the actions. These target values will be the Q values which the network will “step towards” during the optimization step – hence the name “target” Q values. 

The variable valid_idxs specifies those indices which don’t include terminal states – obviously for terminal states (states where the game ended), there are no future rewards to discount from, so the target value for these states is the rewards value. For other states, which do have future rewards, these need to be discounted and added to the current reward for the target Q values. If no target_network is provided, it is assumed vanilla Q learning should be used to provide the discounted target Q values. If not, double Q learning is implemented.

According to that methodology, first the a* actions are selected which are those actions with the highest Q values in the next state (t + 1). These actions are taken from the primary network, using the numpy argmax function. Next, the Q values from the target network are extracted from the next state (t + 1). Finally, the updates value is incremented for valid indices by adding the discounted future Q values from the target network, for the actions a* selected from the primary network. Finally, the network is trained using the Keras train_on_batch function.

The main Atari training loop

Now it is time to review the main training loop:

num_episodes = 1000000
eps = MAX_EPSILON
render = False
train_writer = tf.summary.create_file_writer(STORE_PATH + f"/DuelingQSI_{dt.datetime.now().strftime('%d%m%Y%H%M')}")
double_q = True
steps = 0
for i in range(num_episodes):
    state = env.reset()
    state = image_preprocess(state)
    state_stack = tf.Variable(np.repeat(state.numpy(), NUM_FRAMES).reshape((POST_PROCESS_IMAGE_SIZE[0],
                                                                            POST_PROCESS_IMAGE_SIZE[1],
                                                                            NUM_FRAMES)))
    cnt = 1
    avg_loss = 0
    tot_reward = 0
    if i % GIF_RECORDING_FREQ == 0:
        frame_list = []
    while True:
        if render:
            env.render()
        action = choose_action(state_stack, primary_network, eps, steps)
        next_state, reward, done, info = env.step(action)
        tot_reward += reward
        if i % GIF_RECORDING_FREQ == 0:
            frame_list.append(tf.cast(tf.image.resize(next_state, (480, 320)), tf.uint8).numpy())
        next_state = image_preprocess(next_state)
        state_stack = process_state_stack(state_stack, next_state)
        # store in memory
        memory.add_sample(next_state, action, reward, done)

        if steps > DELAY_TRAINING:
            loss = train(primary_network, memory, target_network if double_q else None)
            update_network(primary_network, target_network)
        else:
            loss = -1
        avg_loss += loss

        # linearly decay the eps value
        if steps > DELAY_TRAINING:
            eps = MAX_EPSILON - ((steps - DELAY_TRAINING) / EPSILON_MIN_ITER) * 
                  (MAX_EPSILON - MIN_EPSILON) if steps < EPSILON_MIN_ITER else 
                MIN_EPSILON
        steps += 1

        if done:
            if steps > DELAY_TRAINING:
                avg_loss /= cnt
                print(f"Episode: {i}, Reward: {tot_reward}, avg loss: {avg_loss:.5f}, eps: {eps:.3f}")
                with train_writer.as_default():
                    tf.summary.scalar('reward', tot_reward, step=i)
                    tf.summary.scalar('avg loss', avg_loss, step=i)
            else:
                print(f"Pre-training...Episode: {i}")
            if i % GIF_RECORDING_FREQ == 0:
                record_gif(frame_list, i)
            break

        cnt += 1

This training loop is very similar to the training loop in my Dueling Q tutorial, so for a detailed review, please see that post. The main differences relate to how the frame stacking is handled. First, you’ll notice at the start of the loop that the environment is reset, and the first state / image is extracted. This state or image is pre-processed and then repeated NUM_FRAMES times and reshaped to create the first state or frame stack, of size (105, 80, 3) in this example. Another point to note is that a gif recording function has been created which is called every GIF_RECORDING_FREQ episodes. This function involves simply outputting every frame to a gif so that the training progress can be monitored by observing actual gameplay. As such, there is a frame list which is filled whenever each GIF_RECORDING_FREQ episode comes around, and this frame list is passed to the gif recording function. Check out the code for this tutorial for more details. Finally, it can be observed that after every state, the state stack is processed by shuffling along each recorded frame / state in that stack. 

Space Invader Atari training results

The image below shows how the training progresses through each episode with respect to the total reward received for each episode:    

Atari Space Invaders - Dueling Q training reward

Atari Space Invaders – Dueling Q training reward

As can be observed from the plot above, the reward steadily increases over 1500 episodes of game play. Note – if you wish to replicate this training on your own, you will need GPU processing support in order to reduce the training timeframes to a reasonable level. In this case, I utilised the Google Cloud Compute Engine and a single GPU. The gifs below show the progress of the agent in gameplay between episode 50 and episode 1450:

Atari Space Invaders - gameplay episode 50

Atari Space Invaders – gameplay episode 50

 

Atari Space Invaders - gameplay episode 1450

Atari Space Invaders – gameplay episode 1450

As can be observed, after 50 epsiodes the agent still moves around randomly and is quickly killed, achieving a score of only 60 points. However, after 1450 episodes, the agent can be seen to be playing the game much more effectively, even having learnt to destroy the occasional purple “master ship” flying overhead to gain extra points. 

This post has demonstrated how to effectively train agents to operate in Atari environments such as Space Invaders. In particular it has demonstrated how to use the Dueling Q reinforcement learning algorithm to train the agent. A future post will demonstrate how to make the training even more efficient using the Prioritised Experience Replay (PER) approach. 


Eager to build deep learning systems in TensorFlow 2? Get the book here

The post Atari Space Invaders and Dueling Q RL in TensorFlow 2 appeared first on Adventures in Machine Learning.