4 minute read

Why PyTorch?

A couple of weeks ago, I attempted to install the GPU version of TensorFlow and failed miserably. I should have set up a new virtual environment for it, but threw caution into the wind and installed it in my base environment. Bad idea. As a result, when I import the TensorFlow module, I get pages and pages of error messages. I spent an entire morning Googling for solutions, and eventually gave up. So, I decided to see if I would have any better luck with PyTorch. Within minutes, I was up and running on my GPU - nice! What follows is me just working out how to implement a deep Q-network (DQN) in PyTorch. I’m not going to re-hash how DQN’s work since I already covered that ground in a previous post. I also found a better way to render the environments in a Jupyter notebook using HTML from from IPython.display.

import gym
from gym import wrappers
import random
import torch
import numpy as np
from collections import deque
import matplotlib.pyplot as plt
import io
import base64
from IPython.display import HTML
%matplotlib inline

Lunar Lander Environment

The state of a Lunar Lander environment has eight continuous values that represent the lander’s x and y position, it’s velocity, angular speed, orientation, and other. With DQN, it doesn’t really matter what they all are, I just need to know how many there are. There are also four discrete actions: do nothing, fire left rocket, fire right rocket, and fire the main rocket. The reward system is more complicated, too. A small negative reward of -0.3 is given for each time the rocket is fired, and a large negative reward (-100) if the lander crashes. If the lander comes to rest on the ground, a positive reward of 100 is given, and a reward between 100-140 is given depending on how close to the center of the pad (marked by two flags) it lands.

env = gym.make('LunarLander-v2')
env.seed(0)
print('State shape: ', env.observation_space.shape)
print('Number of actions: ', env.action_space.n)
State shape:  (8,)
Number of actions:  4

The PyTorch Model

I set up a neural net with three hidden layers and 128 nodes each with a 60% dropout between each layer. The net also uses the relu activation function. All of this is in a separate file model.py, which I import in another script dqn_agent.py. The agent script sets up the replay buffer, epsilon greedy policy, the training hyperparameters, etc. as I described in previous posts.

import torch
import torch.nn as nn
import torch.nn.functional as F

class QNetwork(nn.Module):
    def __init__(self, state_size, action_size, seed, fc1_units=128, fc2_units=128, fc3_units=128):
        super(QNetwork, self).__init__()
        self.seed = torch.manual_seed(seed)
        self.fc1 = nn.Linear(state_size, fc1_units)
        self.dropout1 = nn.Dropout(p=0.6)
        self.fc2 = nn.Linear(fc1_units, fc2_units)
        self.dropout2 = nn.Dropout(p=0.6)
        self.fc3 = nn.Linear(fc2_units, fc3_units)
        self.dropout3 = nn.Dropout(p=0.6)
        self.fc4 = nn.Linear(fc3_units, action_size)

    def forward(self, state):
        x = self.fc1(state)
        x = self.dropout1(x)
        x = F.relu(x)
        x = self.fc2(x)
        x = self.dropout2(x)
        x = F.relu(x)
        x = self.fc3(x)
        x = self.dropout3(x)
        x = F.relu(x)
        return self.fc4(x)
from dqn_agent import Agent
agent = Agent(state_size=8, action_size=4, seed=3)

Untrained Agent

The untrained agent is terrible, of course, and immediately crashes (although disappointingly not in a ball of fire).

env = wrappers.Monitor(env, "./gym-results", force=True)
state = env.reset()
for _ in range(1000):
    action = agent.act(state)
    observation, reward, done, info = env.step(action)
    if done: break
env.close()

video = io.open('./gym-results/openaigym.video.%s.video000000.mp4' % env.file_infix, 'r+b').read()
encoded = base64.b64encode(video)
HTML(data='''
    <video width="360" height="auto" alt="test" controls><source src="data:video/mp4;base64,{0}" type="video/mp4" /></video>'''
.format(encoded.decode('ascii')))

Train the Agent

Training proceeds as usual. The environment is considered solved when the average reward over 100 episodes exceeds 200, which was achieved after 723 episodes.

def dqn(n_episodes=2000, max_t=1000, eps_start=1.0, eps_end=0.01, eps_decay=0.995):
    scores = []                        # list containing scores from each episode
    scores_window = deque(maxlen=100)  # last 100 scores
    eps = eps_start                    # initialize epsilon
    for i_episode in range(1, n_episodes+1):
        state = env.reset()
        score = 0
        for t in range(max_t):
            action = agent.act(state, eps)
            next_state, reward, done, _ = env.step(action)
            agent.step(state, action, reward, next_state, done)
            state = next_state
            score += reward
            if done:
                break
        scores_window.append(score)       # save most recent score
        scores.append(score)              # save most recent score
        eps = max(eps_end, eps_decay*eps) # decrease epsilon
        print('\rEpisode {}\tAverage Score: {:.2f}'.format(i_episode, np.mean(scores_window)), end="")
        if i_episode % 100 == 0:
            print('\rEpisode {}\tAverage Score: {:.2f}'.format(i_episode, np.mean(scores_window)))
        if np.mean(scores_window)>=200.0:
            print('\nEnvironment solved in {:d} episodes!\tAverage Score: {:.2f}'.format(i_episode-100, np.mean(scores_window)))
            torch.save(agent.qnetwork_local.state_dict(), 'checkpoint.pth')
            break
    return scores

scores = dqn()

# plot the scores
fig = plt.figure()
ax = fig.add_subplot(111)
plt.plot(np.arange(len(scores)), scores)
plt.ylabel('Score')
plt.xlabel('Episode #')
plt.show()
Episode 100	Average Score: -177.97
Episode 200	Average Score: -124.90
Episode 300	Average Score: -37.344
Episode 400	Average Score: 15.484
Episode 500	Average Score: 54.92
Episode 600	Average Score: 98.795
Episode 700	Average Score: 120.36
Episode 800	Average Score: 173.05
Episode 823	Average Score: 200.43
Environment solved in 723 episodes!	Average Score: 200.43

png

Watch the Trained Agent

It took a bit of hyperparameter turning, but the trained agent looks like it does a pretty good job!

env = gym.make('LunarLander-v2')
env.seed(3)
env = wrappers.Monitor(env, "./gym-results", force=True)

# load the weights from file
agent.qnetwork_local.load_state_dict(torch.load('checkpoint.pth'))

state = env.reset()
for j in range(1000):
    action = agent.act(state)
    state, reward, done, _ = env.step(action)
    if done: break

env.close()

video = io.open('./gym-results/openaigym.video.%s.video000000.mp4' % env.file_infix, 'r+b').read()
encoded = base64.b64encode(video)
HTML(data='''
    <video width="360" height="auto" alt="test" controls><source src="data:video/mp4;base64,{0}" type="video/mp4" /></video>'''
.format(encoded.decode('ascii')))

Comments