Tzeny's demesne - Engineering and travelling

Posted by : Tzeny at Apr 25, 2019

Last week me and a couple of friends had an idea: let’s use PySC2 (a Python wrapper for the StarCraft 2 API) to build a reinforcement learning agent that can teach itself how to play StarCraft 2.

This is no easy task, and many people have attempted it:

But we thought it would be an interesting way to get into the world of Reinforcement Learning.

Our agent is an A2C network, that, if trained, right now, converges to outputting only 0s.

Below is our first version of the AI, taking random actions. The map is called CollectMineralShards; the aim is to move your marines (the two green circles) to as many Mineral Shards (blue circles) as possible in the allotted time.

Getting it to run

If you want to try out our project yourselves, head over to https://github.com/deepmind/pysc2 and follow their instructions to get the StarCraft2 environment.

For our algorithm we used PyTorch, so make sure you have that installed and running.

After you install PyTorch, head over to github, clone our repository, and run it according to the instructions: https://github.com/Tzeny/deepstellar

PySC2 – StarCraft II Learning Environment

PySC2 is DeepMind’s Python component of the StarCraft II Learning Environment (SC2LE). It exposes Blizzard Entertainment’s StarCraft II Machine Learning API as a Python RL Environment.

It can run one or two agents / game, and many games in parallel.

Agent get an observation of the game state after each N in game time steps. N can be set, in our case we used N = 16 for an equivalent APM of 90.

Below is the code for an agent that takes a random action at each time step.

class RandomAgent(base_agent.BaseAgent):
"""A random agent for starcraft."""

def step(self, obs):
super(RandomAgent, self).step(obs)
function_id = numpy.random.choice(obs.observation.available_actions)
args = [[numpy.random.randint(, size) for size in arg.sizes]
for arg in self.action_spec.functions[function_id].args]
return actions.FunctionCall(function_id, args)

The obs object contains valueable information about the game state:

Screen (you can select any combination of the 2 items below)
- Features (shown in Figure 1. above)
- RGB pixels
Minimap (you can select any combination of the 2 items below)
- Features (shown in Figure 1. above)
- RGB pixels
Player information
Control groups
Single select
Multi select
Cargo
Build queue
Alers
Available actions
Last actions (only for successful actions)
Action result

Reinforcement Learning – A2C

Our agent has to look at the observations for the current step, choose an action that would best further its goals, and predict a value for the current state.

A state’s value = the sum of all the rewards if you were to start in that state and move forward

Intuitive explanation of A2C: https://hackernoon.com/intuitive-rl-intro-to-advantage-actor-critic-a2c-4ff545978752

All A2C architectures have 2 heads:

Actor – responsible for choosing an action, in our case the actor outputs both an action_id distribution, and 4 continuous values used as arguments for some of the actions (for example the 331/Move_screen command requires a point (x,y) on the screen as argument)
Critic – responsible for deciding how good the actor’s actions are

There are some inputs that have variable sizes, so we are not feeding them to the network just yet.

Figure 2. Architecture of our 2 head neural network

The idea is to run the model for a number of time steps (10 in our case), and then train both the actor and critic.

I will explain the loss function in the next post, as I don’t have a good understanding of it yet.