8 minute read

In a previous post on Q-learning, I demonstrated how to train an agent to play tic-tac-toe. One of the challenges with Q-learning is that it doesn’t scale well to environments with a large state space. Even after experiencing 200,000 tic-tac-toe games, the agent had visited only a fraction of the total state space. After training and when playing against an opponent, if the agent encountered a board that it didn’t experience in training, the Q-values for all available moves were 0, and so the agent played randomly. This becomes even more problematic if we try to apply Q-learning to a game like chess or Go. We need a method that doesn’t rely on a training agent visiting every state. Neural networks have been very successful in solving this problem. DeepMind, for example, has not only mastered 49 different Atari games, but has also defeated the raining Go champion. I’m not that ambitious. I’ll just stick with tic-tac-toe. The most basic application of neural networks to this problem is a deep Q-Network, which I’ll demonstrate here.

Tic-Tac-Toe Environment

As with Q-learning, I needed to define an environment that contains the game fundamentals. I started with the same environment I set up for Q-learning and then stripped out the Q-learning algorithm so I just have the basics of the game. The env class below does just three things:

  1. It resets the game board to begin a new game. I define the board state as a list with nine elements to represent each of the nine game board positions. A 0 represents an empty position, 1 is an X, and -1 is an O.
  2. In the step function, it makes a move for either player X or O.
  3. In the game_over function, it checks the state of the board and returns a reward. The reward system is the same as before. An X win gets a reward of 1, an O win gets a reward of -1, and everything else gets a reward of 0.
import tensorflow as tf
import numpy as np
import copy
import random
class env:

    def __init__(self):
        self.state = self.reset()

    def reset(self):
        return [0 for i in range(9)]

    def game_over(self, s):
        done = False
        reward = 0
        if (s[0] + s[1] + s[2]  == 3 or s[3] + s[4] + s[5]  == 3 or s[6] + s[7] + s[8]  == 3 or
            s[0] + s[3] + s[6]  == 3 or s[1] + s[4] + s[7]  == 3 or s[2] + s[5] + s[8]  == 3 or
            s[0] + s[4] + s[8]  == 3 or s[2] + s[4] + s[6]  == 3):
            done = True
            reward = 1
        if (s[0] + s[1] + s[2]  == -3 or s[3] + s[4] + s[5]  == -3 or s[6] + s[7] + s[8]  == -3 or
            s[0] + s[3] + s[6]  == -3 or s[1] + s[4] + s[7]  == -3 or s[2] + s[5] + s[8]  == -3 or
            s[0] + s[4] + s[8]  == -3 or s[2] + s[4] + s[6]  == -3):
            done = True
            reward = -1
        if sum(1 for i in s if i != 0)==9 and not done:
            done = True
        return done, reward

    def step(self, state, action, player):
        next_state = state.copy()
        if player == 0: next_state[action] = 1
        else: next_state[action] = -1
        done, reward = self.game_over(next_state)
        return next_state, done, reward

Deep Q-Learning

Next, I create a class DQNagent where the real work is done. I’ll go through the three main functions in detail.

build_model Function. This is where I create the neural network itself. It’s a simple sequential model with two hidden layers that I built some flexibility into. The input shape, the number of nodes, and the output size are all variables because I intend to apply this basic set-up to other problems. In the tic-tac-toe problem, the neural network takes the board state as input, so in this case the input_shape is nine. The two hidden layers have 81 nodes each with the relu activation function. The output also has a length of nine that are the Q-values for each of the nine actions (i.e., moves on the game board). In other words, the network takes the game board as input, and provides the best move as output.

play_ttt Function. It might make more sense to explain this function next. Basically, there’s an inner loop to play one game of tic-tac-toe, and an outer loop to play the game multiple times. Most of the function just controls the mechanics of the game, but there are two aspects I want to point out.

  1. The function uses an epsilon-greedy policy to determine whether to make a move based on the neural network Q-values or make a random move. The probability of playing randomly is initially high during training to encourage the algorithm to explore the state space. As training progresses, player X moves are more and more likely to be based on the neural network. Note that the neural network is set up to learn the game from player X’s perspective. Player O always plays randomly.
  2. During the course of playing one game, the game board states, player moves (actions), and rewards are recorded. These are passed to the train_model function at the end of a game.

train_model Function. This is where the neural network gets trained and things get complicated. I’m going to let Aurélien Géron do the ‘splainin from his book Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow.

Consider the approximate Q-Value computed by the DQN for a given state-action pair (s, a). Thanks to Bellman, we know we want this approximate Q-Value to be as close as possible to the reward r that we actually observe after playing action a in state s, plus the discounted value of playing optimally from then on. To estimate this sum of future discounted rewards, we can simply execute the DQN on the next state s′ and for all possible actions a′. We get an approximate future Q-Value for each possible action. We then pick the highest (since we assume we will be playing optimally) and discount it, and this gives us an estimate of the sum of future discounted rewards. By summing the reward r and the future discounted value estimate, we get a target Q-Value y(s, a) for the state-action pair (s, a), as shown in the equation:

With this target Q-Value, we can run a training step using any Gradient Descent algorithm. Specifically, we generally try to minimize the squared error between the estimated Q-Value Q(s, a) and the target Q-Value (or the Huber loss to reduce the algorithm’s sensitivity to large errors). And that’s all for the basic Deep Q-Learning algorithm!

Right. So that’s what happening in the train_model function.

class DQNagent:

    def __init__(self, state_size, action_size, iterations):
        self.alpha0 = 0.05           # learning rate
        self.decay = 0.005           # learning rate decay
        self.gamma = 0.95            # discount factor
        self.state_size = state_size
        self.action_size = action_size
        self.iterations = iterations
        self.model = self.build_model()
        self.optimizer = tf.keras.optimizers.SGD(lr=0.02)
        self.loss_fn = tf.keras.losses.mean_squared_error

    def build_model(self):
        model = tf.keras.models.Sequential([
            tf.keras.layers.Dense(self.state_size**2, activation="relu", input_shape=[self.state_size]),
            tf.keras.layers.Dense(self.state_size**2, activation="relu"),
            tf.keras.layers.Dense(self.action_size)
        ])
        return model

    def train_model(self, state_history, action_history, next_state_history, rewards, dones):
        next_Q_values = self.model.predict(np.array(next_state_history))
        max_next_Q_values = np.max(next_Q_values, axis=1)
        target_Q_values = rewards + (1 - 1*np.array(dones)) * self.gamma * max_next_Q_values
        target_Q_values = tf.reshape(target_Q_values, [len(rewards), 1])
        mask = tf.one_hot(action_history, 9)
        with tf.GradientTape() as tape:
            all_Q_values = self.model(np.array(state_history))
            Q_values = tf.reduce_sum(all_Q_values * mask, axis=1, keepdims=True)
            loss = tf.reduce_mean(self.loss_fn(target_Q_values, Q_values))
        grads = tape.gradient(loss, self.model.trainable_variables)
        self.optimizer.apply_gradients(zip(grads, self.model.trainable_variables))

    def play_ttt(self):
        for iteration in range(self.iterations):    # outer loop to play the game a bunch of times
            state = env().reset()
            next_state = state.copy()
            done = False
            dones = []
            state_history = []
            state_history.append(state)
            action_history = []
            rewards = []
            epsilon = max(1 - iteration/(self.iterations*0.8), 0.01)
            while not done:                          # inner loop to play one game
                if random.random() < epsilon:        # epsilon-greedy policy
                    action = random.choice([i for i in range(len(state)) if state[i] == 0])
                else:
                    action = np.argmax(self.model.predict(np.array(state)[np.newaxis])[0])
                action_history.append(action)
                next_state, done, reward = env().step(state, action, 0)
                if done:
                    state_history.append(next_state)
                    dones.append(done)
                    rewards.append(reward)
                if not done:
                    omove = random.choice([i for i in range(len(next_state)) if next_state[i] == 0])
                    next_state, done, reward = env().step(next_state, omove, 1)
                    state = next_state.copy()
                    state_history.append(next_state)
                    dones.append(done)
                    rewards.append(reward)
            next_state_history = state_history[1:len(state_history)]
            state_history = state_history[0:len(action_history)]
            self.train_model(state_history, action_history, next_state_history, rewards, dones)
        return self.model

Training

Here I just provide the DQNagent with the number of input and output nodes for the neural network (9 each) and how many games to play (1000). Compare the number of games here with the 200,000 games I used for the Q-learning method - a huge difference!

m = DQNagent(9,9,1000).play_ttt()

Results

I want to see real quick if the neural net has learned to play in the center of the board on the opening move, so I have the model give me the Q-values for an empty board. The fifth number returned represents the center of the board, and it is the highest Q-value, so the neural net has already learned it!

m.predict(np.zeros(9)[np.newaxis])
array([[0.6415714 , 0.54576135, 0.69200796, 0.58051527, 0.81977105,
        0.5372652 , 0.4971155 , 0.38079423, 0.53279865]], dtype=float32)

Next, the play_v_random function pits the AI-enabled player against an opponent that plays randomly.

def play_v_random (games):
    results = [0 for i in range(games)]
    for i in range(games):
        board = env().reset()
        done = False
        while not done:
            #print(board[0:3])
            #print(board[3:6])
            #print(board[6:9])
            xmoves = m.predict(np.array(board)[np.newaxis])[0]
            xmoves[np.where(np.array(board)!=0)[0]] = -1
            xmove = np.argmax(xmoves)
            #print("move", xmove)
            board[xmove] = 1
            done, reward = env().game_over(board)
            if not done:
                omove = random.choice([i for i in range(len(board)) if board[i] == 0])
                board[omove] = -1
                done, reward = env().game_over(board)
        #print(board[0:3])
        #print(board[3:6])
        #print(board[6:9])
        results[i] = reward
    return results

So here we have it. The AI-enabled player won 90.0% of the games, lost 7.2%, and 2.8% were ties. For comparison, the best performance using Q-learning was 81.2% AI wins, 2.3% losses, and 16.5% ties. The deep Q-network performed significantly better after being trained on 1,000 games than Q-learning did after being trained on 200,000 games. Remarkable!

results = play_v_random(1000)
sum(1 for i in results if i == 1), sum(1 for i in results if i == -1), sum(1 for i in results if i == 0)
(900, 72, 28)

As I mentioned at the beginning of the post, a deep Q-network is the most basic application of neural networks to reinforcement learning. More advanced applications include double deep Q-networks and dueling deep Q networks, and each of these can be enhanced with techniques like actor-critic algorithms, curiosity-based exploration, proximal policy optimization, and so on. There’s a lot left to explore.

Comments