PyTorch DQN Solves LunarLander-v2
Why PyTorch?
A couple of weeks ago, I attempted to install the GPU version of TensorFlow and failed miserably. I should have set up a new virtual environment for it, but threw caution into the wind and installed it in my base environment. Bad idea. As a result, when I import the TensorFlow module, I get pages and pages of error messages. I spent an entire morning Googling for solutions, and eventually gave up. So, I decided to see if I would have any better luck with PyTorch. Within minutes, I was up and running on my GPU - nice! What follows is me just working out how to implement a deep Q-network (DQN) in PyTorch. I’m not going to re-hash how DQN’s work since I already covered that ground in a previous post. I also found a better way to render the environments in a Jupyter notebook using HTML from from IPython.display.
import gym
from gym import wrappers
import random
import torch
import numpy as np
from collections import deque
import matplotlib.pyplot as plt
import io
import base64
from IPython.display import HTML
%matplotlib inline
Lunar Lander Environment
The state of a Lunar Lander environment has eight continuous values that represent the lander’s x and y position, it’s velocity, angular speed, orientation, and other. With DQN, it doesn’t really matter what they all are, I just need to know how many there are. There are also four discrete actions: do nothing, fire left rocket, fire right rocket, and fire the main rocket. The reward system is more complicated, too. A small negative reward of -0.3 is given for each time the rocket is fired, and a large negative reward (-100) if the lander crashes. If the lander comes to rest on the ground, a positive reward of 100 is given, and a reward between 100-140 is given depending on how close to the center of the pad (marked by two flags) it lands.
env = gym.make('LunarLander-v2')
env.seed(0)
print('State shape: ', env.observation_space.shape)
print('Number of actions: ', env.action_space.n)
State shape: (8,)
Number of actions: 4
The PyTorch Model
I set up a neural net with three hidden layers and 128 nodes each with a 60% dropout between each layer. The net also uses the relu activation function. All of this is in a separate file model.py, which I import in another script dqn_agent.py. The agent script sets up the replay buffer, epsilon greedy policy, the training hyperparameters, etc. as I described in previous posts.
import torch
import torch.nn as nn
import torch.nn.functional as F
class QNetwork(nn.Module):
def __init__(self, state_size, action_size, seed, fc1_units=128, fc2_units=128, fc3_units=128):
super(QNetwork, self).__init__()
self.seed = torch.manual_seed(seed)
self.fc1 = nn.Linear(state_size, fc1_units)
self.dropout1 = nn.Dropout(p=0.6)
self.fc2 = nn.Linear(fc1_units, fc2_units)
self.dropout2 = nn.Dropout(p=0.6)
self.fc3 = nn.Linear(fc2_units, fc3_units)
self.dropout3 = nn.Dropout(p=0.6)
self.fc4 = nn.Linear(fc3_units, action_size)
def forward(self, state):
x = self.fc1(state)
x = self.dropout1(x)
x = F.relu(x)
x = self.fc2(x)
x = self.dropout2(x)
x = F.relu(x)
x = self.fc3(x)
x = self.dropout3(x)
x = F.relu(x)
return self.fc4(x)
from dqn_agent import Agent
agent = Agent(state_size=8, action_size=4, seed=3)
Untrained Agent
The untrained agent is terrible, of course, and immediately crashes (although disappointingly not in a ball of fire).
env = wrappers.Monitor(env, "./gym-results", force=True)
state = env.reset()
for _ in range(1000):
action = agent.act(state)
observation, reward, done, info = env.step(action)
if done: break
env.close()
video = io.open('./gym-results/openaigym.video.%s.video000000.mp4' % env.file_infix, 'r+b').read()
encoded = base64.b64encode(video)
HTML(data='''
<video width="360" height="auto" alt="test" controls><source src="data:video/mp4;base64,{0}" type="video/mp4" /></video>'''
.format(encoded.decode('ascii')))
Train the Agent
Training proceeds as usual. The environment is considered solved when the average reward over 100 episodes exceeds 200, which was achieved after 723 episodes.
def dqn(n_episodes=2000, max_t=1000, eps_start=1.0, eps_end=0.01, eps_decay=0.995):
scores = [] # list containing scores from each episode
scores_window = deque(maxlen=100) # last 100 scores
eps = eps_start # initialize epsilon
for i_episode in range(1, n_episodes+1):
state = env.reset()
score = 0
for t in range(max_t):
action = agent.act(state, eps)
next_state, reward, done, _ = env.step(action)
agent.step(state, action, reward, next_state, done)
state = next_state
score += reward
if done:
break
scores_window.append(score) # save most recent score
scores.append(score) # save most recent score
eps = max(eps_end, eps_decay*eps) # decrease epsilon
print('\rEpisode {}\tAverage Score: {:.2f}'.format(i_episode, np.mean(scores_window)), end="")
if i_episode % 100 == 0:
print('\rEpisode {}\tAverage Score: {:.2f}'.format(i_episode, np.mean(scores_window)))
if np.mean(scores_window)>=200.0:
print('\nEnvironment solved in {:d} episodes!\tAverage Score: {:.2f}'.format(i_episode-100, np.mean(scores_window)))
torch.save(agent.qnetwork_local.state_dict(), 'checkpoint.pth')
break
return scores
scores = dqn()
# plot the scores
fig = plt.figure()
ax = fig.add_subplot(111)
plt.plot(np.arange(len(scores)), scores)
plt.ylabel('Score')
plt.xlabel('Episode #')
plt.show()
Episode 100 Average Score: -177.97
Episode 200 Average Score: -124.90
Episode 300 Average Score: -37.344
Episode 400 Average Score: 15.484
Episode 500 Average Score: 54.92
Episode 600 Average Score: 98.795
Episode 700 Average Score: 120.36
Episode 800 Average Score: 173.05
Episode 823 Average Score: 200.43
Environment solved in 723 episodes! Average Score: 200.43

Watch the Trained Agent
It took a bit of hyperparameter turning, but the trained agent looks like it does a pretty good job!
env = gym.make('LunarLander-v2')
env.seed(3)
env = wrappers.Monitor(env, "./gym-results", force=True)
# load the weights from file
agent.qnetwork_local.load_state_dict(torch.load('checkpoint.pth'))
state = env.reset()
for j in range(1000):
action = agent.act(state)
state, reward, done, _ = env.step(action)
if done: break
env.close()
video = io.open('./gym-results/openaigym.video.%s.video000000.mp4' % env.file_infix, 'r+b').read()
encoded = base64.b64encode(video)
HTML(data='''
<video width="360" height="auto" alt="test" controls><source src="data:video/mp4;base64,{0}" type="video/mp4" /></video>'''
.format(encoded.decode('ascii')))
Comments