taw's blog: Adventures with reverse engineering

Tuesday, April 28, 2015

Adventures with reverse engineering - Total War DB tables

Sailor Cat by dongato from flickr (CC-NC-ND)

Nearly five years after it all started, every single DB table for all Total War games is now decoded. I thought it would be a good idea to write about the process as reverse engineering case study.

If you find this interesting, you should probably also check my post about general principles of data archeology.

What are DB tables anyway?

Like most games these days, Total War games (Empire, Napoleon, Shogun 2, Rome 2, and various expansions for them) use virtual file system to store their data.

Within this virtual file system data is mostly of those kinds:

standard formats for things like textures, sounds, text - no need to reverse engineer those
ESF format - it's sort of like binary XML, and I documented it previously here
Various one-off formats, usually very simple
UI files - still not fully decoded, they control user interface
DB tables - sort of SQL-like tables

There can be multiple files for same DB table - they are merged by virtual filesystem level. Generally one of the fields in a row is primary key, to decide if DB tables in later patch override or append to original patch's DB table.

DB table structure

DB tables contain bulk of the data. The format itself is as simple as possible:

(optional) GUID
schema version number
number of rows
first row
second row
...

Every row within same table version has same schema. Whenever schema changed - usually to add extra rows - version number would be increased.

That sounds like pretty easy reverse engineering job. There are only a few problems:

schema is not stored with the data, or anywhere else I'm aware of - the game code knows it, but we need to figure it out
fields are stored as raw data, without any kind of type prefix
rows might have same schema, but many fields are variable size, so we don't know where one row ends and another begins

What types are we working with anyway?

As is usually the case for data formats created for one format, they just go for the simplest thing that could possibly work, so data is mostly just a few things:

int32
float32
boolean
length-prefixed string
nullable length-prefixed string

That's a very small set, and a few simpler DB tables can be solved entirely by eyeballing them in hex editor.

Sadly, the following 00 00 00 00, can mean any of:

1x int32 - 0
1x float32 - 0
4x boolean - false, false, false, false
2x unicode string - "", ""
4x nullable unicode string - NULL, NULL, NULL, NULL

That gets exponentially worse when there's a block of tons of NULLs. 12 consecutive 00 bytes can be any of 50559 schemas using just those 5 types, and we don't even know if we're still in the same row or another one started.

Is type T likely at offset N?

It was time to reduce it to a simpler question - can type T be at offset N? That's a stupid question, as the answer is generally yes. For example if next 4 bytes are: 01 00 59 00, that can mean any of:

int32 - number 5832705
float32 - number 8.173360559359682e-39
booleans - true, (and 3 more bytes)
unicode string "Y"
some 22784-character nullable unicode string (first one some kind of +Uxx00)

Most of them are pretty unlikely. So let's instead ask a question - is type T likely at offset N?

integers over 100000 or any negative integers other than -1 are unlikely
floats above 10000.0 or below 0.001 are unlikely
all possible boolean values - 00 and 01 - are considered likely
unicode strings with any non-ISO-8859-1 characters or above 256 characters are unlikely
nullable string is likely if it's NULL (00) or 01followed by likely string.

That's a decent way to quickly evaluate a schema I guess. If I think the file may possibly be "int32, float32, nullable unicode string, boolean, boolean", and I know number of records from the header, I can quickly check if it parses, and contains sensible values.

Brute force time!

Now that I have a way to answer the question if a schema matches, why not just brute force every schema shorter than 10 columns or so? And so I did. And it decoded a lot of tables.

Unfortunately monstrosities with 20+ are not uncommon. The largest one had literally 184 columns (and there was no way to know how many it would end up with just raw data). That's definitely beyond brute force range.

It was time for a few more advanced techniques:

the most common ambiguity was 00 00 00 00 which was either float 0.0 or integer 0. It was of course important, but it drastically reduces schema search space to treat them both as one type (int32_or_float32) - and guess in postprocessing which one it actually was.
sometimes values I considered unlikely happened - like very large integers (for some IDs) and very long strings (help messages) - so for many tables I had to adjust likely heuristics based on eyeballing what was inside

The biggest problem with all that was that I pretty much had to get it all right. If I had row boundaries I would be able to make good guess for column 1, then column 2, and so on. Without row boundaries, I know where column 1 of row 1 starts, but column 1 of row 2 could be pretty much anywhere.

In a fortunate accident, a lot of tables started with primary key. Which was usually a non-nullable string. Which was really easy to spot. Even the 184-column monstrosity (models_artillery_tables) just so happened to start with a string. Followed by another string, and then constant number of bytes for rest of the row. To figure out exact mix of floats, integers, and booleans in rest of the row I had to write special script, but row boundaries solves most of the problem.

Quite often I had more complex situation - I could figure out where row starts (with non-null string primary key), but the rest of the row was not fixed size. In such cases I'd just guess the first few column types, and get my program to brute force the rest.

models_buildings

With this mix of eyeballing hexes and autodetection script I managed to figure out most schemas, but two really didn't work. As it turned out, both used schemas completely unlike any other DB table.

Table for building models was unique, but it relatively easy to figure out manually with hex editor, each row being:

string
string
int32
number of elements (as int32), each of them being:

string
int32
9x float32 (3x XYZ vectors)

models_naval

That left just one table which was both unique and insanely complicated. I tried a bunch of times, but it was just really difficult.

My entire approach was based on figuring out whole schema at once - so it was not useful that I could see bits and pieces of structure.

I tried a bit, but eventually I gave up.

Many years later, Creative Assembly released modding kit for Shogun 2, and with it an example ship as XML. It was fairly frustrating as it was just a single model and it didn't have any cannons, but it contained a lot of hints as to structure of the data.

Another great thing was that for some weird reason Shogun 2 contained main table with a lot of ship models, but also 11 DB files with one model each. That's amazing - instead of trying to figure out 2.2MB file Empire had, I could try matching individual 80kB or so models.

Even then, my approach of full schema match wouldn't work - I knew the format was too complex for it from my previous failed attempts.

Instead I started by fishing for strings, converting the file to output like:

bytes 100..107 are string "foo"
bytes 108..119 are something that's not a string
bytes 120..127 are string "bar"
...

Well, that's a good start. Then I applied similar fishing to float32, int32, and everything that remained. This decoding contained a lot of errors (00 00 00 00 00 was sometimes "0, false", and something "false, 0" and the script completely messed that up), but it a lot of structure was visible, and there were loads of strings to work with.

Then came the breakthrough. I added custom matchers depending on value of a string. If a string started with word "cannon", try matching it with whatever I guessed to be cannon model. If it was "efline", go for guessed efline model. And so on.

Soon I'd get to:

int32 4
cannon: {cannon 1 data}
cannon: {cannon 2 data}
cannon: {cannon 3 data}
cannon: {cannon 4 data}
int32 15
efline: {efline 1 data}
...

Important thing was that those custom matchers operated on raw bytes (and falled back to lower level matchers if their match failed) - so they weren't at mercy of autodetector for raw 00s and such ambiguous data.

There were bits of weird stuff leftover, but surprisingly little.

Then I reran autodetector on data from Empire, Napoleon, and Rome 2, and it 90% fit, with fairly easy adjustments to make it fit completely.

And so it ends

With hindsight, naval models weren't quite as bad as I thought during my first attempt, and I probably could have figured them out even back then if I kept trying.

This still leaves a few things undecoded - mostly UI files for Shogun 2 and Rome 2. Perhaps I should give them another go as well.