taw's blog: March 2012

Thursday, March 29, 2012

ESF - Empire Total War Object Serialization Format

Princess Sophia (R.I.P. babygirl!) by Shandi-lee from flickr (CC-NC-ND)

ESF is the primary data format used by Total War series games starting from Empire, and then Napoleon, Shogun 2, and whatever comes next.

It was reverse engineered from partial documentation by various people including your truly, but so far there's no complete documentation available anywhere. This post is meant to be such documentation.

Obvious data types

Most of data types used in ESFs are straightforward - almost everything is little endian unless noted otherwise, data sizes are standard powers of two, negative numbers are two's-complement, character set is Unicode, floating point numbers are IEEE 754 etc.

int8, int16, int32, int64 - little-endian signed integers
uint8, uint16, uint32, uint64 - little endian unsigned integers
float32, float64 - single and double precission IEEE 754 floating point numbers
char8 - ASCII character
char16 - little-endian UTF-18 codepoint
bool8 - boolean (00 false, 01 true, other values never observed)

Strings

There are two kinds of strings used in ESF files - ASCII strings (used mostly for internal identifiers), and Unicode strings (used mostly for displayable text).

Both are prefixed by character count as uint16, for ASCII strings (ca_ascii):

uint16 count, char8[count] characters

For Unicode strings (ca_unicode):

uint16 count, char16[count] codepoints

There are no observed instances of ASCII strings having high bit set - so it's not possible to tell if that's meant as ISO-Latin-1, UTF-8, an error, or something else.

There are no observed instances of Unicode planes other than Basic Multilingual Plane being used.

There are no observed instances of 0 character/codepoint being present in either of them.

Neither string format is ever zero-terminated.

Exotic encodings for numbers

Before I get to the format itself, there are a few exotic encodings used in ESFs.

int24be and uint24be are big-endian (that is reversed) 24 bit (3 byte) signed and unsigned integer. So sequence of bytes 01 00 00 means 65536.

uintvar is a variable length encoding for unsigned integers, with following rules:

result = 0;
  while(data[i] & 0x80) {
    result = (result << 7) | (data[i] & 0x7f);
    i += 1;
  }

Or in diagrams:

0XXXXXXX → 0bXXXXXXX
1YYYYYYY 0XXXXXXX → 0bYYYYYYYXXXXXXX
1ZZZZZZZ 1YYYYYYY 0XXXXXXX → 0bZZZZZZZYYYYYYYXXXXXXX
1WWWWWWW 1ZZZZZZZ 1YYYYYYY 0XXXXXXX → 0bWWWWWWWZZZZZZZYYYYYYYXXXXXXX
1VVVVVVV 1WWWWWWW 1ZZZZZZZ 1YYYYYYY 0XXXXXXX → 0bVVVVVVVWWWWWWWZZZZZZZYYYYYYYXXXXXXX

Or in examples:

0 is encoded as 00
1 is encoded as 01
127 is encoded as 7f
128 is encoded as 81 00
255 is encoded as 81 7f

This format is somewhat analogous to UTF-8. In theory you could encode arbitrarily long numbers this way, but it's never longer than 5 bytes. And for that matter encoded numbers are never longer than 32 bits (even though 5 bytes would fit 35 bits in this encoding). There are many observed cases where encoding different than the shortest possible encoding is used - as far as I know it's typically 5 bytes used when 4 would suffice, but I haven't investigated it any deeper than that. For example - 80 80 80 80 01, 80 80 80 01, 80 80 01, 80 01, and 01 are all valid encodings for number 1.

Format variants

There are 4 variants of ESF format, usually referred to by their magic number:

ABCD - used in Empire Total War and Napoleon Total War
ABCE - used in Empire Total War and Napoleon Total War
ABCF - used in Shogun 2 Total War
ABCA - used in Shogun 2 Total War

Header

ESF files start with the following header:

uint32 - magic number (0xABCD, 0xABCE, 0xABCF, 0xABCA)
uint32 - 4 bytes, always zeros - not present in ABCD format
uint32 - 4 bytes, look like Unix timestamp - not present in ABCD format
uint32 - offset where footer starts

Lack of these two fields in the header is the only difference between ABCD and ABCE formats. Following header is a single root node (usually with more nodes nested within). Following that is footer, and possibly zero padding.

Footer

ESF format consists of nodes similar to XML tags. The first thing in the footer is a lookup table for names of these tags:

uint16 number of tag types
ca_ascii name of tag 0
ca_ascii name of tag 1
ca_ascii name of tag 2
...

For example tag table containing nodes "kittens" and "pandas" would be:

02 00 - size of tag name table as uint16
07 00 - size of string "kittens" as uint16
6b 69 74 74 65 6e 73 - ASCII encoding of "kittens"
06 00 - size of string "pandas" as uint16
70 61 6e 64 61 73 - ASCII encoding of "pandas"

For formats ABCD and ABCE that's the entire footer. Formats ABCF and ABCA contain two more data structures - lookup tables for Unicode strings and ASCII strings. ABCD/ABCE formats kept all strings other than node names within main data. The main change in ABCF/ABCA formats was moving the strings to the footer, and only using indexes to footer tables in main data, resulting in much smaller files. Footer for ABCF/ABCA continues as:

uint32 size of Unicode string lookup table
ca_unicode string A
uint32 index of string A
ca_unicode string B
uint32 index of string B
...
uint32 size of ASCII string lookup table
ca_ascii string A
uint32 index of string A
ca_ascii string B
uint32 index of string B
...

The main difference between tag name lookup table and these tables is that tag name lookup table simply used positions on the list as indexes, and these tables use explicit indexes (it's like Array vs Hash). There have been no observed cases of multiple entries having the same ID, or any other such irregularities. Following footer in some ABCA format files there are some 00 bytes. These as far as I know do nothing at all and are ignored.

Data nodes - numbers

Presenting His Royal Highness King Milo the First! by Malingering from flickr (CC-NC-ND)

Between header and footer there are data nodes - or to be more precise exactly one data node, which may contain nested nodes - very much like XML has one root element. Node type can be determined by its first byte. Numbers nodes have very simple correspondence between encoding and encoded data:

01 bool8 - boolean
02 int8 - int8 (never observed in practice)
03 int16 - int16
04 int32 - int32
05 int64 - int64 (rarely observed in practice)
06 uint8 - uint8
07 uint16 - uint16
08 uint32 - uint32
09 uint64 - uint64 (rarely observed in practice)
0a float32 - float32
0b float64 - float64 (never observed in practice)

A few further node types are only a bit more complicated:

0c float32 float32 - XY coordinates
0d float32 float32 float32 - XYZ coordinates
10 uint16 - angle from 0 to almost-360 degrees

Choice of numerical node types is not data dependent - if certain object has uint32 with value 1, it will always be encoded as 08 01 00 00 00 in ABCD/ABCE/ABCF formats. Most numerical data in ESF files is uint32, int32, and float32 - so there's plenty of zeroes there. As an optimization ABCA format introduced further node types not present in ABCD/ABCE/ABCF formats:

12 - boolean true
13 - boolean false
14 - uint32 zero
15 - uint32 one
16 uint8 - uint32
17 uint16 - uint32
18 uint24be - uint32
19 - int32 zero
1a int8 - int32
1b int16 - int32
1c int24be - int32
1d - float32 zero

So uint32 node with value 5 might be< encoded as 08 05 00 00 00, 16 05, 17 05 00, or 18 00 00 05. As far as I can tell the most compact possible representation is always used (16 05 in this case), but I haven't strictly verified that.

String nodes

In formats ABCD/ABCE string nodes are encoded as follows:

0e ca_unicode - unicode string node
0f ca_ascii - ASCII string node

In formats ABCF/ABCA instead of string they only contain an index to appropriate lookup table in the footer

0e uint32 - unicode string node
0f uint32 - ASCII string node

Array nodes

There are array nodes types corresponding to basic node types, their code is 40 + code of basic data type:

41 boolean array
42 int8 array
43 int16 array
etc.

In ABCD/ABCE/ABCF formats arrays are encoded as:

uint8 node-type-code (40..5f)
uint32 offset of first byte after end of array (this is a weird way of encoding size)
element 0
element 1
...

In ABCA format it's:

uint8 node-type-code (40..5f)
uintvar number of bytes in the array
element 0
element 1
...

So for example if we have array of two uint32s - [100, 200] starting at offset 0x6000 in file, it will be encoded like this in ABCD/ABCE/ABCF:

[offset 0x6000] 48 - since 08 is node type code for uint32, 48 is node type code for array of uint32s
[offset 0x6001..0x6004] 0c 60 00 00 - offset of first byte after end of array
[offset 0x6005..0x6008] 64 00 00 00 - 100 encoded as uint32
[offset 0x6009..0x600b] c8 00 00 00 - 200 encoded as uint32

In ABCA format array encoding doesn't depend on offset, so it's:

48 - code for array of uint32s
08 - size of the array as uintvar
64 00 00 00 - 100 encoded as uint32
c8 00 00 00 - 200 encoded as uint32

However the same way numbers in ABCA format can be encoded with shorter encodings if they fit, entire arrays can as well, if every one of their elements does:

56 - since 16 is code of uint32-encoded-as-uint8
02 - size of the array as uintvar
64 - 100 encoded as uint8
c8 - 200 encoded as uint8

There's no way to mix encodings within the array. Array of [0, 1, 1000] would be encoded as:

57 - since 17 is code of uint32-encoded-as-uint16
06 - size of array as uintvar
00 00 - 0 encoded as uint16
01 00 - 1 encoded as uint16
e8 03 - 1000 encoded as uint16

Now since arrays encode their size as either byte count (ABCA) or file offset (ABCD/ABCE/ABCF) and not as number of elements, they never use 0-size encodings like 52, 53, 54, 55, 59, 5d. In ABCF/ABCA formats there are arrays of ASCII and Unicode strings, and they cause no complications since "string" there is just uint32 index to a separate lookup table. There are no such arrays in ABCD/ABCE formats, where strings are encoded directly in the data using variable number of bytes.

Record node

Prince Levi by geckoam from flickr (CC-NC-ND)

The most important node type in ESF is record type, vaguely corresponding to tags in XML, or non-basic object types in Java. The root of ESF files is always a record type, and since records can contain any number of nodes in them, they provide basic structure to ESF files. For ABCD/ABCE/ABCF format record nodes are encoded as follows:

uint8 80 - record node code
uint16 tag name - it's index to table of tags in the footer
uint8 version - version number - starts with 0, updated every time object format changes
uint32 offset of first byte after end of record
node 0
node 1
...

In ABCA situation is much more complicated, since there are actually three formats of records. Root node - and root node only - is encoded as:

uint8 80
uint16 tag name
uint8 version
uintvar size of data within record in bytes
node 0
node 1
...

Other nodes can either use traditional encoding, except with different node type code:

uint8 a0
uint16 tag name
uint8 version
uintvar size of data within record in bytes
node 0
node 1
...

Or a new compact encoding. In compact encoding first two bytes are better considered as a bitfield:

100vvvvt tttttttt - That's 4 bits for version number and 9 bits for node type.
uintvar size of data within record in bytes
node 0
node 1
...

So node type byte can be anything between 80 and 9f.

Array of records node

The last type is arrays of records. All records in the array are the same type and version. For ABCD/ABCE/ABCF format it's encoded as follows:

uint8 81 - array of records node code
uint16 tag name - it's index to table of tags in the footer
uint8 version - version number
uint32 offset of first byte after end of array
uint32 number of elements
uint32 offset of first byte after end of record #0
contents of record 0
uint32 offset of first byte after end of record #1
contents of record 1
...

ABCA format has two encodings. Traditional encoding:

uint8 e0 - array of records node code
uint16 tag name - it's index to table of tags in the footer
uint8 version - version number
uintvar size of array contents in bytes
uintvar number of elements
uintvar size of record 0 in bytes
contents of record 0
uintvar size of record 1 in bytes
contents of record 1
...

And compact encoding very similar to one used for records:

110vvvvt tttttttt - That's 4 bits for version number and 9 bits for node type.
uintvar size of array contents in bytes
uintvar number of elements
uintvar size of record 0 in bytes
contents of record 0
uintvar size of record 1 in bytes
contents of record 1
...

So array node type is either c0..df for compact encoding, or e0 for traditional encoding.

Putting it all together

So to put it all together ESF files consist of:

uint32 - magic number (0xABCD, 0xABCE, 0xABCF, 0xABCA)
uint32 - 4 bytes, always zeros - not present in ABCD format
uint32 - 4 bytes, look like Unix timestamp - not present in ABCD format
uint32 - offset where footer starts
root node (and more nodes within it)
array table of tag name
hash table for Unicode string lookup
hash table for ASCII string lookup
optional and completely ignored zero padding

If you want to play with ESF files, the best way is to download esf-xml converter from etwng repository.

Monday, March 19, 2012

Software Triage

Muffin // Our new kitten by Merlijn Hoek from flickr (CC-NC-ND)

Hi!

For variety of complicated personal reasons I haven't been updating my blog in ages.

If you want to follow me, I've been a lot more active on Twitter and Google+. And on soup, but that's very different kind of activity. If social networking nonsense ever stabilizes, we might end up in a flow where things start as a single sentence on something Twitter-like, then some of them get elaborated into a couple of paragraphs on something Google+-like, and eventually a few become full blog posts on something - uhm - blog-like. And the stupid masses are kept away from it all on something Facebook-like, and recruiters keep spamming people on something LinkedIn-like etc.

Anyway, I'm sort of back, I hopefully your deliveries of kittens and technology will resume on a more frequent basis.

Since my software (at least the kind I published) has seen very little activity, I want to start with triage of everything I published as Open Source (and still remember), and quick decisions what I'm planning to do about it all. And by the way I'll probably be switching from OSX to Ubuntu seriously this time, which is also somewhat relevant to these decisions.

The Morgue

First, some truly ancient software I mention mostly to make sure I don't miss anything important.

I have 4 projects on sourceforge. Three of them - FreeTable,
Gtk+/CLI IDP Interface, and
Joystick Mouse Emulation (funny story that, I fried mouse port on my computer and had to do this as a hack, then it turned out it got some legitimate users) are like 12 years old, back from good old days when people still believed in Linux on desktop. They're of course dead, have been dead for ages, and won't see any development ever. I vaguely recall that FreeTable actually got some uses, and people even packaged it for a few distros, and if they did so, good for them!

My fourth sourceforge project is MediaWiki, which is doing perfectly well without me. texvc was one of the big features that let Wikipedia become what it is, but it's pretty much done and doesn't need any extra work.

Another project which is pretty much dead is iPod-last.fm bridge, which died together with my iPod. Sansa Clip is an MP3 player superior is every possible way, except it doesn't keep logs of what you've played, but then I can live without last.fm knowing everything about me (especially since its recommendation-based radio is a pile of fail anyway). I don't play to get any new iPods, so it's unlikely it will get resurrected ever. Feel free to take over the project if you need it.

Art Projects

I apparently published quite a few programs which are entirely useless for anything other than providing 15 minutes of amusement tops. They won't see any more development not because they're "dead" - they're just "complete". Death of the Author and such pomo stuff (a small fraction of pomo stuff is actually not total bullshit, and that's one of these rare things).

Amethyst - Ruby syntax for Perl - was full of win, but it doesn't need or want any more work on it. Successor to Perl 5 arrived some time ago and it's called Ruby.

My attempts at building a CPU out of TTLs are also unlikely to be resurrected. I sort of vaguely plan to go back to hardware hacking some day, but if it happens, it will more likely be Arduino or something else higher level, not TTLs. Raspberry Pi looks interesting as well, other than the horrible name.

Does anybody use these things?

Back when I was on Wikipedia I wrote tawbot, which was a small program for doing various admin tasks on Wikipedia. It used to have quite a few users, but I haven't heard from them ever since. Is it still in use? I don't think so, and since I'm not really editing Wikipedia past fixing typos and dead links and such trivial things, there's probably no reason to unabandon it. If I'm wrong, definitely tell me.

XSS Shield for Rails was a revolutionary system (the revolutionary part is that anybody cared at all about security in the Rails world) for protecting Rails against XSS attacks. Rails 1.x. These days Rails supposedly has builtin XSS protection system (and ton of other security screwups, hi there github!), so unless you're forced to use some ancient version of Rails, my XSS Shield can be allowed to rest in peace.

I have a bunch of cool local patches for acts_as_solr, but I had trouble even finding upstream. Is anybody interested? I could put the patched version on github or something.

Update urgently needed

One thing I really need to do something with, and sooner rather than later, is jrpg. I don't plan to do any serious development, but it's suffering from software rot - every new version of Python, Windows, or graphics drivers causes more and more problems. Upgrading it to some new Python (not sure if 3.x would be necessary or if 2.7 would be enough), and building new packages for Windows and OSX should be about enough. My bad that I left it not done for so long.

I have a bunch of Ruby libraries - magic/help, magic/xml, and libgmp-ruby which are not in gems for a simple reason that they're older than gems. All three could still see some use, except of course nobody will ever bother to manually build them in 2012, we're too lazy these days. So an evening or two of modernizing them, putting them on github, packaging them into gems, and possibly making them Ruby 1.9.x (or 1.8.7 for all I know, they still remember Ruby 1.6 era) compatible would be a huge improvement.

Total War

One thing I have been doing a lot - and I never wrote one word about it on this blog - are development tools for Total War series. A big problem with video game modding is that people lack good Open Source practices - they don't put their software in publicly accessible repositories, often get into drama on who can do what (I wrote a mod but don't you dare reuse my files in your mod!). I've been trying to change that and all my Total War related stuff is on github in etwng repository.

I have tons of converters for various formats, scripts for common modding tasks, even Linux and OSX filesystem drivers (FUSE FTW! I just feel dirty for using C++ for it) for their custom package format. This is still active and well, and people have even written third party GUI frontends for my tools etc.

It might be a good idea to take some time and gather links to everything Total War related that's not there, find all abandoned tools and give them new home (like I did with alpaca's ui coverter), and maybe put my old Rome Total War / Medieval 2 Total war stuff there just for completeness, but none of these are really high priority issues.

RLisp

RLisp was something between an art project and a "serious" Ruby/Lisp hybrid language - where "serious" means something like CoffeeScript-serious or HAML-serious, which isn't really all that serious when you think about it.

The thing is - VM for Ruby 1.8 really didn't let me do everything I wanted to do with it for just boring practical reasons. I even tried to compile it to C linked with libruby. I'd like to at least take a look at Ruby 1.9 and JRuby VMs (VM inception) if they are any better. If not, tough luck, it was loads of fun anyway.

Did I mention I'm switching to Linux?

You won't hear me bitch about OSX any more, how awesome is that? Unfortunately, there are two huge things missing on Linux - TextMate and OmniFocus.

TextMate has a sort of replacement in redcar, which I'm not really too happy with. I've complained about it, and even made some minor patches, the truth is I will pretty much have to spend a lot of time on redcar to make it more like TextMate if I want to stay OSX-free for long. I just hope I won't even reach the desperation necessary to switch to Emacs.

The other thing that Linux misses is OmniFocus - pretty much the only GTD app which isn't made of suck. This doesn't sound too complicated (compared with let's say a text editor or packaging system - things people keep failing at all the time), so I might end up writing one if I'm unhappy with everything. If you know of a good offline GTD app I can try, do tell me.

And that's about it

That's it for today. I'm surprised my kitten-pic-finding scripts even worked after all that time. Did I miss any software I wrote that you care about?

Hopefully it will take less than a year for the next post.