My most recent coding project was decoding UI layout files for all 10 Total War games from Empire to Three Kingdoms and writing converter that translates them to XML and back.
Here's a quick writeup of what I did, and how that went.
UI Layout files
Layout in the games are controlled by UI Layout files. They all helpfully start with a version number header - currently from Version025 to Version129. After that follows top level UI element, and within are nested children UI elements and many other things like UI states, transitions, events, and so on.
Basic building blocks
Basic building blocks of the format were fairly easy to understand, mainly:
- booleans as 00 or 01
- integers as int32
- floats as float32
- colors as BGRA32 (that is - one byte per component, in this order)
- ASCII strings as int16 character count, followed by that many characters
- Unicode strings as int16 character count, followed by that many UTF16 characters
- various data structures had their fields in specific order, without any headers, or delimiters
- for arrays of data structures there was generally int32 element count, then followed by each element in succession, without any headers or delimiters
There were also a few other patterns used less often, like:
- optional fields - either 01 followed by some data structure, or just a 00
- 128-bit uuids (weirdly no specific version, but still market as a uuid in variant bits)
- occasional int8s and int16s
- arrays of elements repeating until some special value like events_end
- 2D arrays of elements prefixed by xsize and ysize
- and so on
Manual decoding with hex editor
Most formats are quite easy to decode with a hex editor. This one wasn't - there were far too many versions, no data structure headers, no separators between data structures, and as pretty much everything was optional, so there were huge blocks of zeroes.
For example a block of 20 zero bytes could be any of:
- 20 booleans false
- 5 floats 0.0
- 5 ints 0
- 10 empty ASCII strings
- 10 empty Unicode strings
- 5 empty nested arrays of some child elements
- or most likely some combinations of all of them
And there were such huge blocks of zeroes everywhere.
Decoding it without tool assist would be just too difficult, especially doing it over and over for every single version.
Original converter
Once upon a time alpaca wrote a Python converter for Napoleon Total War (second game on the engine). I inherited that, and extended it to backwards to Empire and forwards Shogun 2.
Even with all the fixes it had only maybe 90% support for those three games.
The most obvious approach would be fixing remaining issues and extending it further.
Unfortunately that would be very difficult approach.
Internal Representation Pattern
The converter was based on principle of Internal Representation. Every structure has a class. That class basically has five methods:
- initialize empty data structure with default values
- read from binary file
- write to XML
- read from XML
- write to binary file
This works well enough when there's one version of every structure, and it's fully understood. Unfortunately we have 62 different versions (some numbers between 25 and 129 were skipped), and we have very limited idea how things are represented.
Old converter tried to ignore many of those issues. For example writing to XML was just one hardcoded template string per data structure, so if layout file's version lacked some fields, it would just write default values anyway. Then on converting back it would read them and throw them away. This specific issue was partly limitation of Python, which is bad at DSLs, and this XML output really wanted a DSL.
A bigger problem was that if it didn't work for any reason, I got nothing. I'd get some "reading past end of file" error without any context whatsoever, and actual point where parsing derailed was located long before that crash.
Data gathering
Before I even started, I took latest versions of all 10 Total War games using current engine, extracted all UI layout files and put them as test set.
Analysis tool
Then I wrote analysis tool. The formats were really complicated, but there were some obvious things in them. Especially strings. Basically the analysis tool went over the file and identified every ASCII or Unicode string. Then it printed any undecoded data in nice ASCII + hex format.
That was a good starting point, but there was something I could do next. Not only I could see the strings, it was really easy to guess which string meant what. A string with font name was always followed by some ints controlling text display. A string with shader names by shader variables. Strings with image names were used in a few ways, but some simple heuristics could guess which were they.
So I soon had listings along these lines:
000129-000147 FontNameBlock "Ingame 12, Normal"
000148-000151 LineHeightBlock 2
000152-000155 FontLeadingBlock 1
000156-000159 FontTrailingBlock 255
000160-000174 DataBlock
............... 00 00 00 00 00 00 00 00 00 00 00 00 01 00 00
000175-000185 ShaderNameBlock "normal_t0"
000186-000189 ShaderVariableBlock 0.0 (0)
000190-000193 ShaderVariableBlock 0.0 (0)
000194-000197 ShaderVariableBlock 0.0 (0)
000198-000201 ShaderVariableBlock 0.0 (0)
000202-000270 DataBlock
........0....... 00 00 00 00 01 00 00 00 30 12 00 09 00 00 00 00
................ 00 00 00 00 00 05 00 00 00 04 00 00 bb ff be ff
................ 00 00 00 00 00 00 00 01 01 00 00 00 00 00 00 00
................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
..... 00 00 00 00 00
000271-000282 EventListBlock []
000283-000294 DataBlock
........ .<. 00 00 00 00 01 00 00 00 20 b3 3c 0b
000295-000314 StringBlock "government_screens"
000315-000346 DataBlock
H............... 48 01 00 00 8e 00 00 00 01 01 00 01 00 00 00 00
................ 00 ff ff ff ff 00 00 00 00 05 00 00 00 00 00 00
000347-000421 ImageListBlock 1 elements:
000351-000421 ImageBlockGen1 id=163829448 xsize=256 ysize=256 path="data\\UI\\Campaign UI\\Skins\\fill 2 leather 256 tile.tga" unknown=4294967295
000422-000433 DataBlock
............ 00 00 00 00 00 00 00 00 01 00 00 00
000434-000437 StateIDBlock 162986096
000438-000447 StateNameBlock "NewState"
000448-000451 XSizeBlock 624
000452-000455 YSizeBlock 720
000456-000484 DataBlock
................ 00 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00
............. 01 00 00 00 01 00 00 00 00 00 00 00 00
000485-000503 FontNameBlock "Ingame 12, Normal"
000504-000507 LineHeightBlock 2
000508-000511 FontLeadingBlock 1
000512-000515 FontTrailingBlock -16777216
000516-000530 DataBlock
............... 00 00 00 00 00 00 00 00 00 00 00 00 01 00 00
000531-000541 ShaderNameBlock "normal_t0"
000542-000545 ShaderVariableBlock 0.0 (0)
000546-000549 ShaderVariableBlock 0.0 (0)
000550-000553 ShaderVariableBlock 0.0 (0)
000554-000557 ShaderVariableBlock 0.0 (0)
000558-000565 DataBlock
........ 00 00 00 00 01 00 00 00
000566-000589 ImageUseBlock id=163829448 xofs=0 yofs=0 xsize=624 ysize=720 bgra=bgra(255,255,255,255)
000590-000626 DataBlock
................ 01 00 00 00 00 00 00 01 01 00 00 00 00 00 00 00
................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
..... 00 00 00 00 00
000627-000693 EventListBlock ["OnUpdatePulse", "OnUpdatePulse", "OnDock", "DockHudRelative"]
000694-000705 DataBlock
............ 0a 00 00 00 0e 00 00 00 e8 db b8 09
And you can probably already notice huge blocks of zeros I mentioned before - even after some zeros are not shown as decoded from context.
Direct Conversion Pattern
Now that I wasn't going completely blindly, I started writing a converter. In Ruby, as there was a lot of DSLing to do. But mostly it was based on a completely different principle - Direct Conversion.
Direct Conversion doesn't bother with any classes, or internal representations. It has methods such as (not actual code, just the general idea):
def convert_int
value = get(4).unpack1("V")
puts "<i>#{ value }</i>"
end
def convert_string
size = get(2).unpack1("v")
str = get(size)
puts "<s>#{ str.xml_escape }</s>"
end
def convert_color
b, g, r, a = get(4).unpack("CCCC")
puts "<color>"
puts " <byte>#{b}</byte><!-- blue -->"
puts " <byte>#{g}</byte><!-- green -->"
puts " <byte>#{r}</byte><!-- red -->"
puts " <byte>#{a}</byte><!-- alpha -->"
puts "</color>"
end
But bigger methods can be composed from smaller ones (also not actual code):
def output(str, comment=nil)
print " " * indent
print str
print "<!-- #{comment} -->" if comment
print "\n"
end
def convert_int(comment=nil)
output "<i>#{ get_int }</i>", comment
end
def convert_color
tag "color" do
convert_byte "blue"
convert_byte "green"
convert_byte "red"
convert_byte "alpha"
end
end
Advantages of Direct Conversion
Nice thing about this is that conversion back doesn't need to have any idea whatsoever what tags like color even are - other that most basic data types like strings, ints, floats, and booleans, the converter from XML back to binary needs nearly zero awareness of what those formats are.
So instead of describing every data structure 5 times, we do it just once. And any version specific logic can be handled by a single if @version >= 74 or such.
But there's more. Since we never need to construct any internal representation, if conversion crashes, the converter will give us full context of the error!
<model>
<s>composite_scene/porthole/troy_advisor_test.csc</s><!-- mesh path? -->
<s>standard_advisor</s><!-- mesh name? -->
<!-- some model data or anim header or sth -->
<data size="1">
01
</data>
<i>0</i><!-- 00:00:00:00 --><!-- anim count or something? -->
<s></s><!-- anim name? -->
<s></s><!-- anim path? -->
<!-- rest of anim stuff or sth -->
<data size="4">
00 80 3f 00
</data>
<!-- 2900 - end of model data -->
</model>
</models>
<no /><!-- end of uientry flag 5B? -->
<no /><!-- end of uientry flag 6B? -->
<error msg="Invalid boolean value: got 63" version="121">
Data before fail:
ne/porthole/troy 6e 65 2f 70 6f 72 74 68 6f 6c 65 2f 74 72 6f 79
_advisor_test.cs 5f 61 64 76 69 73 6f 72 5f 74 65 73 74 2e 63 73
c..standard_advi 63 10 00 73 74 61 6e 64 61 72 64 5f 61 64 76 69
sor...........?. 73 6f 72 01 00 00 00 00 00 00 00 00 00 80 3f 00
Data from fail 2900:
..?...?.....9..p 00 00 3f 00 00 00 3f 00 00 98 d1 bd 39 10 00 70
ortrait_minspec. 6f 72 74 72 61 69 74 5f 6d 69 6e 73 70 65 63 00
................ 00 00 00 00 00 00 00 00 00 00 00 00 00 01 01 00
................ 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00
Then all I need to do is look back from point of the crash to the last definitely correctly decoded part (in this case those two strings look perfectly fine). Then find where is the first definitely incorrectly decoded part (in this case 00 80 3f is clearly last 3 bytes of a float, so it was off by one at this point already).
Then I can adjust that specific data structure's method. I don't even need to guess what that extra data is. If I see five zeroes I don't have decoding for, I just tell the converter to expect five zero bytes.
Then if some other file has non-zeros at that position, I'll get nice exception like "Zero data expected, got 05 00 00 00 00", then I can pretty clearly see that first four bytes are an int32 - and the last remaining one is likely a boolean (but I'd still leave is as undecoded zero for now).
Debug mode
At some point I implemented a small modification to direct conversion process. There's debug flag to control printing of various extra information like structure offsets, hex values of ints and floats and so on.
Converter first converts binary to XML with debug flag off. If that process crashes - it turns debug flag on, and starts all over. This way normal XML isn't polluted by too much extra information useful only for debugging the converter, but in case of crash I get tons of extra information.
First three games
The first three games were easy enough. I already had a mostly working decoder, so I used it as a starting point, and used procedure described here to fix any issues.
Initially I thought about backporting fixes to the old converter, but I quickly gave up on this idea when I discovered just how extensive the changes would need to be.
In any case I got converter working far better than the old one without any major difficulty.
Next seven games
This is where my plan run into first problems. Starting from a working converter for version X and adding support for version X+1 is easy:
- run conversion anyway, ignoring that version is wrong
- identify where exactly it crashes (based on <error> tags and my analysis tool)
- try to fix those crashes, gated by some if @version >= x+1 checks
Unfortunately first three games used versions 25 to 54, then next seven games used version 74 to 129. So I had a 20 version gap with nothing in between, and really I looked like I'd need to decode from pretty much from scratch.
Cpecific's decoder
I'm sure I'd be able to figure out the decoding, but I found unexpected help. It turns out
Cpecific wrote a PHP-based UI layout decoder. It doesn't actually convert anything - just prints JSON-style output describing contents of various UI layout files.
I tried to run it on a bunch of files, and it seemed to have 80%-ish support for newer 7 games, similar to how well old decoder supported the older 3 games.
The main weakness of Cpecific's decoder is that it doesn't actually convert anything - so you're expected to do hex editing, and then check in the decoder that results are what you expected. Not exactly an ideal workflow, but it super beats hex editing blind.
I also couldn't fully trust its decoding, and it crashed on many files, but it was definitely a huge help at crossing the gap between Version054 and Version074, and once I crossed it, it was easy going to do one version at a time.
I also used it to annotate some fields with comments on what they could likely be.
I don't plan to do any further development of old converter, but in case Cpecific wants to continue with his, at some point I should write down a list of issues I found and their fixes.
Warhammer III
A new Total War game is coming out soon, so the converter will likely need an update. I don't expect this to be difficult - Rome 2 was the last time they did major format update.