StarCraft II, mpyq and adjutant
30 August 2010
StarCraft II was easily the most anticipated game in 2010 for me. I had to pre-order it from US to get into the beta and I have probably sunk a couple of hundred hours into playing it already. The game is excellent and definitely a worthy successor to the venerable StarCraft: Brood War.
The idea of creating some kind of library to help me analyze the heaps of replay files I'd generate through playing on ladder crossed my mind pretty early in the beta. Ideally it would later form the backbone of some sort of web application with pretty graphs and everything, but the library would have to exist first — so I started writing it. I chose Python for this project because it's the language I'm most familiar with and I was eager to get fast results.
I think Blizzard made a mistake in not making StarCraft II replay format completely public from the get-go. Writing both of these tools required considerable reverse engineering efforts to get anywhere. There are still a LOT of unknown details in the replay files — I feel I've barely scratched the surface so far. Making the replay format open and documenting it would help tremendously.
mpyq
The first issue I encountered was MPQ: Blizzard has been using MPQ (MoPaQ) as a fast-to-read binary archive format for their game assets for ages, and StarCraft II replays are no exception. After briefly getting familiar with the various C-based MPQ libraries I decided to roll my own — I did not want a C-based external dependency to my project and I wanted to get more familiar with the MPQ format myself.
Luckily I found a couple of great resources of the inner workings of MPQ and wrote a preliminary version of my MPQ library in pure Python over the following weekend.
The result is mpyq, a Python library and command line tool for extracting MPQ archives. I was intrigued by the idea of having an "executable library" and to my delight this approach worked nicely.
The library is not the fastest MPQ library around by any means and it only handles a couple of the various compression schemes MPQ supports, so its general purpose utility is kind of limited at this point. However, it works great for smaller MPQ archives like replays. It is also great for taking a peek inside the larger MPQ archives — full extraction is not required to get a decent idea of what's inside, and with mpyq you can easily extract only the files you are interested in. It's great for exploring Blizzard's game assets.
In the future I'm going to add support for more compression schemes for increased backwards compatibility, add a decent test suite and turn the project into a "real" Python project indexed in PyPI and installable as an egg.
adjutant
That brings me to the actual StarCraft II replay parser, adjutant. After the initial version of mpyq was done, using it to parse the key details of replay files was pretty easy. I chose to separate the MPQ library from StarCraft II specific code pretty early on to keep the design clean and to make it easier to create other MPQ-related tools. The first independent version of my replay analyzer came together slightly before StarCraft II was released and it was able to gather map name, game duration, client version and players with races and colors from the replay file.
After playing a couple of games and wanting to review them I noticed the lack of sophistication in the automatically generated replay names: first only a timestamp and later only the map name. I thought that this minor annoyance would be fixed before the game was released, but sadly Blizzard did not deliver on this front. Many players want to study and archive their replay files for future reference, and names like “Toxic Slums (434).SC2Replay” are not very illuminating. The replay browser also doesn't display the relevant data like players and map name in a tabular form, so players are forced to hunt for the right replay manually.
I tried to persuade one my programmer friends to tackle on this relatively simple problem using my MPQ library, but apparently the documentation is still lacking as he found it a little too intimidating. I was pretty confident it was a 10 minute job and said so, and he challenged me to prove it. I added the renaming feature in about an hour of continuous development time — not quite the 10 minutes but definitely not too hard either.
The tool is currently far from what it can become — as I said before, I've barely scratched the surface. There is a lot of data to mine from the replay files, starting from winning percentages in each matchup all the way to statistics of APM and build orders in each game. I'm definitely going to create some sort of batch analysis mode where you feed the tool a directory of replays and it will aggregate the data inside. I'm also thinking about packaging this library inside a dedicated tool for only renaming the replay files. It would watch a given directory for new files and automatically rename them using the library. Stay tuned for version 0.2!