Sunday, March 31, 2019

How I made my AlphaZero experiment 8x faster.

To learn AlphaZero, I decided to apply it to one of my favorite board game.

Along the way, I discovered many great examples that helped me learn about reinforced learning.


Once my first implementation proved to be functional, I did a quick bit of math to predict how long training my neural network would take :

  • 1000 training iteration consisting of :
    • 100 learning games
    • 1 neural network training pass
    • 50 arena games to measure improvements
  • A game consists of : 
    • ~ 400 decision per game
    • 1000 Monte Carlo simulation per decision
    • 1 neural network call per simulation
    • 1 millisecond per neural network call
  • Neural Network training is done over :
    • 200,000 training examples
    • Repeated over 10 epoch
    • 10 milliseconds per training example

Which adds up to : 2.54 years !


Batching to the rescue

Training batch size

The first improvement I did was to increase the neural network training batch size. Going from a batch size of 1 to 5000 made training about 100 times faster to get the same job done. That's 230 days saved.

Neural network call batching

With training batch size getting such a huge impact on performance, I wondered : could neural network calls be batched as well ?

A few experiments later, it turned out that batching neural network calls also yields good benefits.
  • Batch sizes of 100 are 20 times faster than individual calls
  • Batch sizes of 1000 are 40 times faster than individual calls

The first problem : sequential algorithms

Monte Carlo simulations do not lend themselves very well to batching. Starting with a current state, they explore a decision tree until a new state is encountered. They then query the neural network to analyze the new state. This causes a problem as every call to the neural network depends on the result of previous calls.

The solution ? By running many games in parallel, it's possible to run each game up to a neural network call and then batch all the calls together.

The second problem : recursive algorithms

The Monte Carlo algorithm I had was recursive and didn't lend itself to injecting a pause in the middle of the algorithm execution.

The solution : Refactor the algorithm from a recursive call to a loop.

The third problem : memory

Most Monte Carlo algorithms I saw keep a history of all the moves they encountered during a game, allowing them to reuse states they've seen before effectively. When running 100 games in parallel, each leading to about 400 decision, each doing 1000 simulations, the program needs to hold 40 million game state statistics. At 1000 bytes per game state, it adds up to 40 gigabytes of required space.

The solution : Purge old data. While it's convenient to remember all the game states, the older states are rarely reused in most games. This can be done by keeping a counter on the last time a game state was reused. Once the game state gets too old, it gets purged.

The results : 8x faster

While running a single game is now significantly slower, running a 100 games in parallel is 8x faster. The average time to run a neural network evaluation went from 800 microsecond to 80. 

Both before and after the change, my 4 CPU cores were running at 95%.

Total execution time is now 58 days instead of 1.9 years. Not too bad for running the same job, on the same hardware. And all the multi threaded complexity is contained in Deeplearning4j.

Next steps

To speed things up, there are a few things to explore.  So far, switching to CUDA cores on my GPU actually slowed things down as I'm mostly running small scale dense layers. CUDA cores work wonders for neural network convolutions that operate on a fixed grid pattern, but not all games operate on that concept. In some board games, a single tile may have eight neighbors, another only one.

Running across multiple computers in parallel would definitely help as well.

If you have any other tips to speed things up, let me know in the comments!