Note: All images below are used with permission from Chalk Dust Magazine. Please see their article here.
Using machine learning to play games has always been an excellent way to understand and evolve learning models. In the last ten years, Google DeepMind managed to train a convolutional neural network to play Go, and IBM’s chess computer Deep Blue beat the best chess player in the world.
This all started with Donald Michie, who designed a model of matchboxes in 1960.
Before the implementation of sophisticated learning models, any ideas were designed mechanically. Michie’s design, called MENACE, was a large pile of matchboxes that contained a number of beads and learned to play tic-tac-toe.
MENACE works a little like a neural network. It is randomly optimized at the beginning, but after playing a few games it adjusts itself to favour moves that are supposedly more successful in each situation. Depending on the model’s success, it will either be punished or rewarded.
Each matchbox represents a specific board layout of tic-tac-toe, which explains why there are so many of them. It doesn’t consist of a box for each unique layout, though — if it did, there would actually be many more boxes. To make the model feasible, Michie took a few steps to simplify it: Firstly, he represented all layouts that were rotated versions of the same thing or that were symmetrical to each other with a single box.
For example, one box would represent all the layouts below:
When training begins, all boxes contain colour-coded beads, where each colour represents a move (or position) on a board.
MENACE makes a move when the human player randomly picks a bead out of the box that represents the game’s current state. The colour of the bead determines where MENACE will move. In some versions of MENACE, there were beads that only represented more blatant moves such as the side, centre, or corner.
The human player chooses the beads at random, just like a neural network’s weights are random at the start. Also like weights, the beads are adjusted when there is failure or success. At the end of each game, if MENACE loses, each bead MENACE used is removed from each box. If MENACE wins, three beads the same as the colour used during each individual turn are added to their respective box. If if the game resulted in a draw, one bead is added.
After 200 games play out, the matchboxes will have more of some beads than of others. This makes it more (or less) likely for a given bead to be picked. If there are 10 green beads in a matchbox, you can assume for that specific board layout that you should move to the green position. In some cases, a specific colour may no longer be in a matchbox because it was never a move.
Bear in mind that a draw is considered positive, because it suggests MENACE has learned. MENACE could never win against the perfect algorithm, but ended up drawing every time after about 90 games, making it equally as perfect.
When MENACE played a with a random picking opponent, the result is a near-perfect positive correlation:
You can try out a simulation of MENACE yourself here. It is a very simple example of how machine learning works, but acts as a great demonstration.
My name is Caspar Wylie, and I have been passionately computer programming for as long as I can remember. I am currently a teenager, 17, and have taught myself to write code with initial help from an employee at Google in Mountain View California, who truly motivated me. I program everyday and am always putting new ideas into perspective. I try to keep a good balance between jobs and personal projects in order to advance my research and understanding. My interest in computers started with very basic electronic engineering when I was only 6, before I then moved on to software development at the age of about 8. Since, I have experimented with many different areas of computing, from web security to computer vision.