Thursday, March 29, 2012

ESF - Empire Total War Object Serialization Format

Princess Sophia (R.I.P. babygirl!) by Shandi-lee from flickr (CC-NC-ND)
ESF is the primary data format used by Total War series games starting from Empire, and then Napoleon, Shogun 2, and whatever comes next.

It was reverse engineered from partial documentation by various people including your truly, but so far there's no complete documentation available anywhere. This post is meant to be such documentation.

Obvious data types


Most of data types used in ESFs are straightforward - almost everything is little endian unless noted otherwise, data sizes are standard powers of two, negative numbers are two's-complement, character set is Unicode, floating point numbers are IEEE 754 etc.

  • int8, int16, int32, int64 - little-endian signed integers
  • uint8, uint16, uint32, uint64 - little endian unsigned integers
  • float32, float64 - single and double precission IEEE 754 floating point numbers
  • char8 - ASCII character
  • char16 - little-endian UTF-18 codepoint
  • bool8 - boolean (00 false, 01 true, other values never observed)

Strings


There are two kinds of strings used in ESF files - ASCII strings (used mostly for internal identifiers), and Unicode strings (used mostly for displayable text).

Both are prefixed by character count as uint16, for ASCII strings (ca_ascii):

  • uint16 count, char8[count] characters

For Unicode strings (ca_unicode):

  • uint16 count, char16[count] codepoints

There are no observed instances of ASCII strings having high bit set - so it's not possible to tell if that's meant as ISO-Latin-1, UTF-8, an error, or something else.

There are no observed instances of Unicode planes other than Basic Multilingual Plane being used.

There are no observed instances of 0 character/codepoint being present in either of them.

Neither string format is ever zero-terminated.

Exotic encodings for numbers


Before I get to the format itself, there are a few exotic encodings used in ESFs.

int24be and uint24be are big-endian (that is reversed) 24 bit (3 byte) signed and unsigned integer. So sequence of bytes 01 00 00 means 65536.

uintvar is a variable length encoding for unsigned integers, with following rules:

result = 0;
  while(data[i] & 0x80) {
    result = (result << 7) | (data[i] & 0x7f);
    i += 1;
  }
Or in diagrams:
  • 0XXXXXXX → 0bXXXXXXX
  • 1YYYYYYY 0XXXXXXX → 0bYYYYYYYXXXXXXX
  • 1ZZZZZZZ 1YYYYYYY 0XXXXXXX → 0bZZZZZZZYYYYYYYXXXXXXX
  • 1WWWWWWW 1ZZZZZZZ 1YYYYYYY 0XXXXXXX → 0bWWWWWWWZZZZZZZYYYYYYYXXXXXXX
  • 1VVVVVVV 1WWWWWWW 1ZZZZZZZ 1YYYYYYY 0XXXXXXX → 0bVVVVVVVWWWWWWWZZZZZZZYYYYYYYXXXXXXX
Or in examples:
  • 0 is encoded as 00
  • 1 is encoded as 01
  • 127 is encoded as 7f
  • 128 is encoded as 81 00
  • 255 is encoded as 81 7f
This format is somewhat analogous to UTF-8. In theory you could encode arbitrarily long numbers this way, but it's never longer than 5 bytes. And for that matter encoded numbers are never longer than 32 bits (even though 5 bytes would fit 35 bits in this encoding). There are many observed cases where encoding different than the shortest possible encoding is used - as far as I know it's typically 5 bytes used when 4 would suffice, but I haven't investigated it any deeper than that. For example - 80 80 80 80 01, 80 80 80 01, 80 80 01, 80 01, and 01 are all valid encodings for number 1.

Format variants

There are 4 variants of ESF format, usually referred to by their magic number:
  • ABCD - used in Empire Total War and Napoleon Total War
  • ABCE - used in Empire Total War and Napoleon Total War
  • ABCF - used in Shogun 2 Total War
  • ABCA - used in Shogun 2 Total War

Header

ESF files start with the following header:
  • uint32 - magic number (0xABCD, 0xABCE, 0xABCF, 0xABCA)
  • uint32 - 4 bytes, always zeros - not present in ABCD format
  • uint32 - 4 bytes, look like Unix timestamp - not present in ABCD format
  • uint32 - offset where footer starts
Lack of these two fields in the header is the only difference between ABCD and ABCE formats. Following header is a single root node (usually with more nodes nested within). Following that is footer, and possibly zero padding.

Footer

ESF format consists of nodes similar to XML tags. The first thing in the footer is a lookup table for names of these tags:
  • uint16 number of tag types
  • ca_ascii name of tag 0
  • ca_ascii name of tag 1
  • ca_ascii name of tag 2
  • ...
For example tag table containing nodes "kittens" and "pandas" would be:
  • 02 00 - size of tag name table as uint16
  • 07 00 - size of string "kittens" as uint16
  • 6b 69 74 74 65 6e 73 - ASCII encoding of "kittens"
  • 06 00 - size of string "pandas" as uint16
  • 70 61 6e 64 61 73 - ASCII encoding of "pandas"
For formats ABCD and ABCE that's the entire footer. Formats ABCF and ABCA contain two more data structures - lookup tables for Unicode strings and ASCII strings. ABCD/ABCE formats kept all strings other than node names within main data. The main change in ABCF/ABCA formats was moving the strings to the footer, and only using indexes to footer tables in main data, resulting in much smaller files. Footer for ABCF/ABCA continues as:
  • uint32 size of Unicode string lookup table
  • ca_unicode string A
  • uint32 index of string A
  • ca_unicode string B
  • uint32 index of string B
  • ...
  • uint32 size of ASCII string lookup table
  • ca_ascii string A
  • uint32 index of string A
  • ca_ascii string B
  • uint32 index of string B
  • ...
The main difference between tag name lookup table and these tables is that tag name lookup table simply used positions on the list as indexes, and these tables use explicit indexes (it's like Array vs Hash). There have been no observed cases of multiple entries having the same ID, or any other such irregularities. Following footer in some ABCA format files there are some 00 bytes. These as far as I know do nothing at all and are ignored.

Data nodes - numbers

Presenting His Royal Highness King Milo the First! by Malingering from flickr (CC-NC-ND)
Between header and footer there are data nodes - or to be more precise exactly one data node, which may contain nested nodes - very much like XML has one root element. Node type can be determined by its first byte. Numbers nodes have very simple correspondence between encoding and encoded data:
  • 01 bool8 - boolean
  • 02 int8 - int8 (never observed in practice)
  • 03 int16 - int16
  • 04 int32 - int32
  • 05 int64 - int64 (rarely observed in practice)
  • 06 uint8 - uint8
  • 07 uint16 - uint16
  • 08 uint32 - uint32
  • 09 uint64 - uint64 (rarely observed in practice)
  • 0a float32 - float32
  • 0b float64 - float64 (never observed in practice)
A few further node types are only a bit more complicated:
  • 0c float32 float32 - XY coordinates
  • 0d float32 float32 float32 - XYZ coordinates
  • 10 uint16 - angle from 0 to almost-360 degrees
Choice of numerical node types is not data dependent - if certain object has uint32 with value 1, it will always be encoded as 08 01 00 00 00 in ABCD/ABCE/ABCF formats. Most numerical data in ESF files is uint32, int32, and float32 - so there's plenty of zeroes there. As an optimization ABCA format introduced further node types not present in ABCD/ABCE/ABCF formats:
  • 12 - boolean true
  • 13 - boolean false
  • 14 - uint32 zero
  • 15 - uint32 one
  • 16 uint8 - uint32
  • 17 uint16 - uint32
  • 18 uint24be - uint32
  • 19 - int32 zero
  • 1a int8 - int32
  • 1b int16 - int32
  • 1c int24be - int32
  • 1d - float32 zero
So uint32 node with value 5 might be< encoded as 08 05 00 00 00, 16 05, 17 05 00, or 18 00 00 05. As far as I can tell the most compact possible representation is always used (16 05 in this case), but I haven't strictly verified that.

String nodes

In formats ABCD/ABCE string nodes are encoded as follows:
  • 0e ca_unicode - unicode string node
  • 0f ca_ascii - ASCII string node
In formats ABCF/ABCA instead of string they only contain an index to appropriate lookup table in the footer
  • 0e uint32 - unicode string node
  • 0f uint32 - ASCII string node

Array nodes

There are array nodes types corresponding to basic node types, their code is 40 + code of basic data type:
  • 41 boolean array
  • 42 int8 array
  • 43 int16 array
  • etc.
In ABCD/ABCE/ABCF formats arrays are encoded as:
  • uint8 node-type-code (40..5f)
  • uint32 offset of first byte after end of array (this is a weird way of encoding size)
  • element 0
  • element 1
  • ...
In ABCA format it's:
  • uint8 node-type-code (40..5f)
  • uintvar number of bytes in the array
  • element 0
  • element 1
  • ...
So for example if we have array of two uint32s - [100, 200] starting at offset 0x6000 in file, it will be encoded like this in ABCD/ABCE/ABCF:
  • [offset 0x6000] 48 - since 08 is node type code for uint32, 48 is node type code for array of uint32s
  • [offset 0x6001..0x6004] 0c 60 00 00 - offset of first byte after end of array
  • [offset 0x6005..0x6008] 64 00 00 00 - 100 encoded as uint32
  • [offset 0x6009..0x600b] c8 00 00 00 - 200 encoded as uint32
In ABCA format array encoding doesn't depend on offset, so it's:
  • 48 - code for array of uint32s
  • 08 - size of the array as uintvar
  • 64 00 00 00 - 100 encoded as uint32
  • c8 00 00 00 - 200 encoded as uint32
However the same way numbers in ABCA format can be encoded with shorter encodings if they fit, entire arrays can as well, if every one of their elements does:
  • 56 - since 16 is code of uint32-encoded-as-uint8
  • 02 - size of the array as uintvar
  • 64 - 100 encoded as uint8
  • c8 - 200 encoded as uint8
There's no way to mix encodings within the array. Array of [0, 1, 1000] would be encoded as:
  • 57 - since 17 is code of uint32-encoded-as-uint16
  • 06 - size of array as uintvar
  • 00 00 - 0 encoded as uint16
  • 01 00 - 1 encoded as uint16
  • e8 03 - 1000 encoded as uint16
Now since arrays encode their size as either byte count (ABCA) or file offset (ABCD/ABCE/ABCF) and not as number of elements, they never use 0-size encodings like 52, 53, 54, 55, 59, 5d. In ABCF/ABCA formats there are arrays of ASCII and Unicode strings, and they cause no complications since "string" there is just uint32 index to a separate lookup table. There are no such arrays in ABCD/ABCE formats, where strings are encoded directly in the data using variable number of bytes.

Record node

Prince Levi by geckoam from flickr (CC-NC-ND)
The most important node type in ESF is record type, vaguely corresponding to tags in XML, or non-basic object types in Java. The root of ESF files is always a record type, and since records can contain any number of nodes in them, they provide basic structure to ESF files. For ABCD/ABCE/ABCF format record nodes are encoded as follows:
  • uint8 80 - record node code
  • uint16 tag name - it's index to table of tags in the footer
  • uint8 version - version number - starts with 0, updated every time object format changes
  • uint32 offset of first byte after end of record
  • node 0
  • node 1
  • ...
In ABCA situation is much more complicated, since there are actually three formats of records. Root node - and root node only - is encoded as:
  • uint8 80
  • uint16 tag name
  • uint8 version
  • uintvar size of data within record in bytes
  • node 0
  • node 1
  • ...
Other nodes can either use traditional encoding, except with different node type code:
  • uint8 a0
  • uint16 tag name
  • uint8 version
  • uintvar size of data within record in bytes
  • node 0
  • node 1
  • ...
Or a new compact encoding. In compact encoding first two bytes are better considered as a bitfield:
  • 100vvvvt tttttttt - That's 4 bits for version number and 9 bits for node type.
  • uintvar size of data within record in bytes
  • node 0
  • node 1
  • ...
So node type byte can be anything between 80 and 9f.

Array of records node

The last type is arrays of records. All records in the array are the same type and version. For ABCD/ABCE/ABCF format it's encoded as follows:
  • uint8 81 - array of records node code
  • uint16 tag name - it's index to table of tags in the footer
  • uint8 version - version number
  • uint32 offset of first byte after end of array
  • uint32 number of elements
  • uint32 offset of first byte after end of record #0
  • contents of record 0
  • uint32 offset of first byte after end of record #1
  • contents of record 1
  • ...
ABCA format has two encodings. Traditional encoding:
  • uint8 e0 - array of records node code
  • uint16 tag name - it's index to table of tags in the footer
  • uint8 version - version number
  • uintvar size of array contents in bytes
  • uintvar number of elements
  • uintvar size of record 0 in bytes
  • contents of record 0
  • uintvar size of record 1 in bytes
  • contents of record 1
  • ...
And compact encoding very similar to one used for records:
  • 110vvvvt tttttttt - That's 4 bits for version number and 9 bits for node type.
  • uintvar size of array contents in bytes
  • uintvar number of elements
  • uintvar size of record 0 in bytes
  • contents of record 0
  • uintvar size of record 1 in bytes
  • contents of record 1
  • ...
So array node type is either c0..df for compact encoding, or e0 for traditional encoding.

Putting it all together

So to put it all together ESF files consist of:
  • uint32 - magic number (0xABCD, 0xABCE, 0xABCF, 0xABCA)
  • uint32 - 4 bytes, always zeros - not present in ABCD format
  • uint32 - 4 bytes, look like Unix timestamp - not present in ABCD format
  • uint32 - offset where footer starts
  • root node (and more nodes within it)
  • array table of tag name
  • hash table for Unicode string lookup
  • hash table for ASCII string lookup
  • optional and completely ignored zero padding
If you want to play with ESF files, the best way is to download esf-xml converter from etwng repository.

No comments:

Post a Comment