The best kittens, technology, and video games blog in the world.

Saturday, July 20, 2013

Introduction to Data Archeology

A lot of data is in open or otherwise maintained formats - which you can access directly or use one of existing converters to work with. But sooner or later you're going to run into some completely undocumented format created just for a single no longer maintained software for which you have no source code, documentation, or tools of any kind.

Especially when you want to mod video games - which are pretty much all one-off no longer maintained programs with no source code, documentation, or tools of any kind.

If that ever happens - it's time to put your Indiana Jones hat and start some good old Data Archeology!

The most important fact about Data Archeology

The task might seem insurmountable at first - you're given a huge stash of undocumented binary (usually, I'll deal with text files or executable code later) files and task to make sense out of them, and at least extract data - or even better write tools to modify them.

Fortunately there's one thing helping you:
Data formats weren't designed to make your job harder - they were designed to make original developers' jobs easier.
Hypothetically the data could be completely crazy - it's not that much more difficult to use 22-bit middle-endian integers, strings encoded in backwards UTF-7, and interleaving members of the same struct one byte at a type. In practice you'll be seeing the simplest possible encodings 99% of the time because they get things done just as well.

How to get started

The best way to start is just getting some hex editor - 0xED for OSX is pretty good, but then so is plenty of others (feel free to recommend them in comments). Then open some data files and look inside.

There are two ways to look at files - bottom up and top down. Usually it's easier to start with a bottom up approach first. At this point you don't want to decode anything, just identify some patterns how data is encoded, and get an idea what the file is for in case that's not yet obvious.

Going bottom up

Here are some of the things you could look for:
  • Are there obvious strings in the file? Usually they're either ASCII (or derived encoding), or UTF-16. Usually they are either terminated by a 0 byte like in C, prefixed by number of characters or bytes (with length encoded in 8 or 16 bits), or padded with 0 bytes or spaces to some predefined size, but they could be something else.
  • Do you see any small integers? Are they 8, 16, or 32 bits? Little or big endian?
  • Can you see any floating point numbers? For PC games they're usually 32 bit (float), less commonly 64 bit (double). You'll probably need help from your hex editor to see them. Are there any obvious pairs or triples which could be 2D or 3D coordinates?
  • Does the file has obvious header at the beginning? Some kind of section headers within the file?
  • Do you see embedded text files, XMLs, or other formats?
  • Do you see anything which could be a file offset, or data size? These will take a little practice to recognize - and it helps to have file offsets and data decoding in your hex editor set to the same setting (either both hex or both decimal). File offsets are usually 32 bits, data sizes can be 16 or 32 bits, but it could be something else.
  • Are there any repetitive patterns like common data structures? Strings or file offsets in the file make it relatively easy to notice patterns, otherwise it's usually a lot trickier.
  • Does the file contain a lot of data that's not obviously decodable as something already mentioned?
You're not trying to decode the entire file at this point - just get a sense of what kind of data is inside, and how it's encoded.

For bottom up analysis it's best to pick a variety of highly complex samples of the format - the more complex and bigger the data, the more likely you'll be to find something useful there.

Bottom up - example

Let's take a look at groupformations.bin file from Empire: Total War.

We open the file - and from abundance of 00s everywhere it's clear they didn't bother encrypting (or more like obfuscating, usually), compressing, or optimizing it too much. We can get straight to the good stuff.

The first thing that should be immediately apparent are UTF-16 strings - "Multiple Selection Drag Out Line" and "Single Line Standard" (highlighted red). Just before each of them there seems to be a small little-endian int16 number (highlighted green) - it doesn't take much guessing that it's probably number of characters in the string. We already have something.

There are a few things that look very much like small little endian int32 numbers (highlighted blue) - FF FF FF FF block is pretty clearly -1 (unless it's two int16 -1s, but that's much less likely). The one at the beginning of file just before the string looks like int32 too. Could that be number of records in the file, size of something, section id? We'll figure that out later.

It seems everything is aligned to 2 bytes. There's no guarantee we won't find something that breaks the alignment, but that can save us some guesswork. It also seems that things are aligned to 4 bytes, but presence of dynamically-sized UTF-16 strings suggests any string with odd number of characters will violate 4-bytes alignment. You can scroll down the file to check if alignment is consistent everywhere - if it is, it will save you some time.

There are a few stranger things. What's that "00 00 80 3F" (highlighted yellow)? If you move your cursor there with your hex editor it will tell you it's a 32-bit floating point number for 1.0. Once we establish there are floats there we can guess that "00 00 00 40" a bit further down is probably floating point for 2.0, not alignment-violating integer 64 (hex 0x40).

We can also take advantage of negative spaces - the "01 00 00 00" between "FF FF FF FF" (-1) and "00 00 80 3F" (1.0) we already decoded could be a few things - but since it's exactly 4 bytes and everything else seems to be 4 bytes long int32 number 1 seems like the most obvious interpretation.

This is a really good start. The file seems to consist of only:
  • UTF-16 strings (with 16-bit character count prefix)
  • 32-bit little endian integers
  • 32-bit floats
We don't quite know how they're structured yet, and what they mean, but we're up for a good start.

Top down

The second approach is going top down. Instead of trying to figure out bits and pieces of data we'll try to figure out how the data is organized.

For top down analysis it's best to start with a few very simple files - the simpler the better. Hopefully you'll see all the structural elements like headers, timestamps, checksums, and whatnot without actual data getting in the way.

Once you're comfortable decoding the simplest files try your understanding on the next slightly more complex one - it probably will need a few tweaks to work. Then keep adding more files to your test set.

You can either test your understanding manually with pen and paper - or by writing a converter program. I'll cover writing converters (it's much easier than it sounds) in some future post, for now let's do good old pen and paper conversion.

Top down - example

Let's try decoding much more complex target - Empire: Total War db files. It's actually not a single format, but a group of somewhat related formats.

Looking for the simplest case we found a db file with just 5 bytes. Let's look inside.
5 is a strange number and there are two reasonable way to interpret this data - it's either "01" (byte 1 - highlighted green) followed by "00 00 00 00" (int32 0 - highlighted red), or "01 00 00 00" (int32 1) followed by "00" (byte 0).

The answer is not obvious yet, so let's look at some slightly more complex file. We found one with just 27 bytes.
First, the mystery from the previous file is solved - "1 (green), 2 (red)" is much more likely combination that "513, 0". The rest of the file is two length-prefixed UTF-16 strings - first in yellow, second in blue.

That gives us a pretty decent theory of how db files are organized:

  • mysterious byte 01
  • number of records as int32
  • record 1
  • record 2
  • ...
We still have a lot to discover:
  • What's the mysterious byte 01? (it turns out it's a schema version number - but with funny encoding)
  • How can we know what kind of data is in records (it turns out every db table has different record types)
  • What these records actually mean? (that's a lot of detective work, but it's much easier after you decode the data into spreadsheet or text file)

Writing converters

The final step is writing a converter. If you managed to completely figure the format out with pen and paper this is really straightforward.

Unfortunately most formats are too complex to decode them by hand, so learning about the format and writing the converter will have to be done simultaneously.

And again, there are two ways to go about writing converters - bottom up or top down.

In top down approach, you write a converter that tries to decode the file completely, and throw increasingly complex files at it, modifying it as you go. This is relatively easy, since you have a working converter (for simplest of the files) very early, and you can be sure you're not breaking things by rerunning it on the same simple example even once it gets support for more complex formats. The downside is that it's an all-or-nothing approach - if the format is too complicated to decode fully you won't get even partial information out of it.

In bottom up approach you teach converter how to detect partial data structures (like strings, floats, etc.) and then to skip parts of the file which it doesn't understand. This can be very powerful technique - especially for very complex formats where full decoding is not yet possible. Unfortunately it's usually much more difficult to write such converters, so it's an advanced technique to try only after you get some practice with data archeology.

Coming next

In future blog posts, I'll show you how to easily write such converters to decode a few file formats.

Or if you can't wait, you can check etwng repository, with a lot of converters for various formats from Empire: Total War and other games from Total War series.


Anonymous said...

How 'bout HexFiend?

taw said...

Anonymous: I just gave HexFiend a try, and I'd not recommend it.

The main problem is that it will only show you a few data decodings (instead of all reasonable decodings) and only when you highlight the data.

0xED displays it right away (you might need to disable "Require selection for visualizer") - this is many many times faster.

Anonymous said...

As a tester, this is pure GOLD! Thanks mate!

Prozacgod said...

I used to do data conversions for AVImark (veterinarian medicine) we had an in-house tool we hacked on called 'Schema' that allowed us to examine and decode data from a wide variety of systems, so that we could export the records into our proprietary format.

I wish I could get a copy of that and open source parts of it, it was honestly a dream tool for record based file decoding.

After decoding data for a while, it really does feel like that scene from the matrix, except its more like "oh that's obviously a column of integers, and thats a IEEE float... oh and that looks like a floating point mantissa, but the length is only 3 bytes??"

Thanks for the article!

taw said...

Prozacgod: If you're working with fairly constrained data, it's possible to write automatic analysis tools. I wrote some for decoding Empire Total War db tables - they were horrible mess, but they could do wonders.