Friday, October 14, 2016

Design patterns for parsing binary files in Ruby

Sand Cat by Tanya Durrant from flickr (CC-ND)

A while ago I wrote about how to analyze binary file formats.

Video games have a ton of custom binary file formats, and to mod them I kept writing parsers, which were in a way very similar.

All parsers were very similar in form, but they varied in too many details to make it possible to extract any viable gem from that.

So let's write a parser.

Just read the whole file

Even if file is fairly big, it's probably still far easier to just read it all in one go, and save yourself from the pointless file I/O.

require "pathname"
class FooParser
  def initialize(path)
    @path = Pathname(path)
    @data = @path.open("rb", &:read)
    @ofs = 0
  end
end

We're saving the @path mostly to make exception messages we'll throw more meaningful.

@data is all file's contents - and we read it in binary mode ("rb") to deal with binary vs Unicode and Windows silliness.

@ofs is current offset, as we'll be processing data mostly in linear way.

Some common helper functions

These aren't necessary for all parsers, but it's fairly common to see code like this:

class FooParser
  def bytes_left
    @data.size - @ofs
  end

  def eof?
    @data.size == @ofs
  end

  def fail!(message)
    raise "#{message} at #{@path}:#{@ofs}"
  end
end

Such functions aren't really doing parsing work, they're mostly there for debugging and for sanity checks.

Get some data

Now let's get some data:

class FooParser
  def get(n)
    fail! "Trying to read past end of file" if bytes_left < n
    result = @data[@ofs, n]
    @ofs += n
    result
  end
end

Most files are byte-oriented, and going through a helper function to get the next few bytes and manage offset and end of file will save us complexity elsewhere.

This kind of function is also a great place for cases where file format is unusual somehow. Vast majority of formats are byte-oriented and fit in memory just fine, but if your format in too big, or uses individual bits, or does anything else weird, you can generally encapsulate such logic in get method - as long as @ofs uniquely describes current location in the file.

Get integers

Most formats are full of numbers and strings, and it's useful to write get methods for all of them. Let's start with Integers as they're easiest.

I generally use concise byte count convention (so get_i4 means get signed integer with 4 bytes), but more verbose bit-based naming (like get_i32 or even get_int32) might be more to your liking.

For most formats there's standard #unpack for them, so their unpackers are very simple - usually the only difference is right unpack string based on big vs small endianness of your format:

class FooParser
  def get_u1
    get(1).unpack("C")[0]
  end

  def get_u2
    get(2).unpack("v")[0]
  end

  def get_u4
    get(4).unpack("V")[0]
  end

  def get_i4
    get(4).unpack("l")[0]
  end
end

Very rarely file format uses more unusual number like 3-byte or variable length integer. Their get_* methods will look different from format to format, but you should generally be able to encapsulate them.

Here's an examples - for 24-bit numbers we can just pad them with extra byte (0 for unsigned, either 0 or 255 for signed), and use proper unpack string for 32-bit numbers:

class FooParser
  def get_u3
    ("\x00".b+get(3)).unpack("N")[0]
  end

  def get_i3
    s = get(3)
    prefix = (s[0].ord >= 128 ? "\xFF".b : "\x00".b)
    (prefix+s).unpack("l>")[0]
  end
end

Usually you can either add some padding, or just read them byte at a time and do some math to get the right number.

Get floats

This is slightly more complicated for two reasons - first the most common format in files is single precision, but Ruby uses double precision. Unpacking floats generates questionable extra precision:

[0.1].pack("f").unpack("f")[0] #=> 0.10000000149011612

Especially if you plan to present the result to users, we often don't want that - we want the "nicest looking" double precision number which still decodes to the same single precision number.

Now I'm sure there's some more "precise" algorithm, but this prettification method worked well:
class Float
  def pretty_single
    result = round(6)
    return result if [self].pack("f") == [result].pack("f")
    self
  end
end

What we do here is round it to 6 decimal places, check if it equals self when converted to single precision, and if so, return rounded result, otherwise return self. It works with infinities, NaN, -0 and all other weirdness.

With that it's easy to write single precision float getter:
class FooParser
  def get_flt
    get(4).unpack("f")[0].pretty_single
  end
end

Second problem is that Ruby unpack still lacks half-precision floats, which are extremely popular in GPU world, so any data destined to go to GPU often uses it - or just formats trying to save on space.

In such case, just get_u2 and do bit manipulation yourself.

Get strings

Strings are just as ubiquitous as numbers, but there's a few more formats, including:
  • character count (ofter u2, 1 byte characters), followed by ASCII characters - get(get_u2)
  • character count (ofter u2, 2 byte characters), followed by UTF-16 characters - get(get_u2*2).unpack("v*").pack("U*")
  • NUL-terminated ASCII / UTF-8 - result = ""; result << c while (c = get(1)) != "\x00"; result
  • Constant size ASCII field (here 256 1-byte characters), padded by NULs - get(256).sub(/\x00*\z/, "")
  • Constant size UTF-16 field (here 256 2-byte characters), padded by NULs - get(512).pack("v*").unpack("U*").sub(/\x00*\z/, "")
Just create appropriate get_str method based on whichever encoding your format uses.

It's not uncommon for a single file format to have more than one string encoding - in such case create multiple such methods, or make it accept some argument to switch between them.

Get individual data structures

The generic part of the parser is done, so now for any data structure in your format you need to write appropriate parsing method, returning some Hash, Array, or proper object depending on the structure:

class FooParser
  def get_vector3d
    x = get_flt
    y = get_flt
    z = get_flt
    {x: x, y: y, z: z}
  end
end

For more concise code you can often shortcut this thank to ruby evaluating arguments left to right (like most languages nowadays, but there were dark times when undefined or backwards evaluation conventions existed too):

class FooParser
  def get_vector3d
    {x: get_flt, y: get_flt, z: get_flt}
  end
end

Complete the parser

And the last part is completing the parser. If the whole file consists of same data structures every time, just write some kind of get_everything following the same pattern and you're done.

Even then I'd recommend adding something like fail! "Parsing done, but #{bytes_left} bytes left" unless eof? sanity check just to be sure format is what we expected it to be - extra garbage bytes generally indicate that file is not quite in formats expected.

Of course things are rarely quite as simple. It's very common for file formats to have structure like:
  • header starting with some magic
  • header then includes some global data and counts for data structures to follow
  • after that individual data structures follow, based on data in the header

Example parser

To avoid complexity of real formats, let's try a fictional format with kittens and puppies:
  • magic marker "CUTE"
  • u4 number of kittens
  • u4 number of puppies
  • each kitten has star rating (float), number of lives left (u4) and photo url (str)
  • each puppy has star rating (float) and photo url (str)
class FooParser
  def get_kitten
    { rating: get_flt, lives_left: get_u4, url: get_str}
  end

  def get_puppy
    { rating: get_flt, url: get_str}
  end
  def parse!     fail! "Wrong magic number" unless get(4) == "CUTE"     kitten_count = get_u4     puppy_count = get_u4     result = {       kittens: kitten_count.times.map{ get_kitten },       puppies: puppy_count.times.map{ get_puppy },     }     fail! "Parsing done, but #{bytes_left} bytes left" unless eof?     result   end end

Offset-based formats

OK, that's one easy file format, but what if instead of data structures following each other, header contains a list of offsets instead, and data structures are at positions indexed.

That's going to take some jumping around, so let's write a helper for that:

class FooParser
  def at_ofs(ofs)
    saved_ofs, @ofs = @ofs, ofs
    yield
  ensure
    @ofs = saved_ofs
  end
end

Then it's fairly straightforward. In formats like that you usually won't end at end of file when you finish parsing, so that step is not necessary. Let's say our format is:
  • magic marker "CUTE"
  • u4 number of kittens
  • that many times u4 offset to kitten data structure (relative to beginning of file)
  • (rest of the file, presumably containing kitten data)

class FooParser
  def parse!
    fail! "Wrong magic number" unless get(4) == "CUTE"
    kitten_count = get_u4
    kittens = kitten_count.times.map do
      at_offset(get_u4) { get_kitten }
    end
    {kittens: kittens}
  end
end

There are weirder data structures, like offsets relative to current location etc. You can generally keep all the offset mangling in helper functions and keep your parsing code high level.

Lookahead

This is more a debugging thing than something you'll need to do often in actual parsing, but sometimes in your pry session you'd like to see if what's following is a kitten without affecting current parser state. It's very easy:

class FooParser
  def lookahead
    saved_ofs = @ofs
    yield
  ensure
    @ofs = saved_ofs
  end
end

Then in your pry session just try lookahead{ get_kitten } and it will give you a kitten (or throw an exception).

Limitations

Writing Ruby-based parsers following patterns described in the post have been far more convenient and flexible than anything else I tried, but there are limitations:
  • you need to have some clue about the format, or at least data structures at its beginning - there's some debugging helpers, but no autodetection, and the only "partial parsing" you can conveniently do is for beginning of the file - if you know some data structures in the middle of the file, it will still be hard to find them and partially decode them without knowing the rest of the format
  • it's inherently one-way - you'll need to write a separate encoder, even though some clever DSL might have given you 90% of it for free. I found that overhead of highly abstracting DSLs is generally much higher than just writing it twice, and it's relatively easy to test that they match by round trip testing - just grab some samples, and check if decoding followed by encoding gets you the sample back.
  • its performance is OK, but not amazing - we're working with very high level language, and doing byte manipulation with a lot of layers of abstraction above it - it's actually quite surprising how fast it works, but it's still nowhere near as fast as a parser written in very low level language
  • it's designed for full parsing in one go, not random access - this is mostly performance issue, but if you want to just read (let alone change) a single kitten in file with thousands, this pattern won't support you
  • if the format is crazy complex, you'll probably need a lot more sophisticated parser - this pattern is great for simple to moderately complex formats

4 comments:

  1. I'm author of bin_utils gem. It lacks support for floats (could be added), but faster a bit and has some interesting features.

    ReplyDelete
  2. funny_falcon: bin_utils has definitely more intuitive names than unpack's "l<" and friends, but has a lot less functionality.

    Like I don't see how I can do str.unpack("v*").pack("U*") in it, or unpack("f3") or any such things.
    As for extra functionality, slicing would be an antipattern, and appending works fine with <<.

    As for BinUtils.append_int32size_string_le! etc., I end up writing something like that every time, but there's a string encoding per format.

    So I don't think it would be of too much use. Maybe if BinUtils.get_* was at least as complete as unpack (+half-prec floats).

    ReplyDelete
  3. The bindata gem is pretty awesome, especially if you need to write as well as read the binary data. Let's you write purely descriptive code that does both. https://github.com/dmendel/bindata

    It doesn't work for everything, it can get hard to use for some kinds of formats, but it's definitely worth knowing about and investigating.

    ReplyDelete
  4. Jonathan Rochkind: This kind of declarative code definitely looks nice, but I'd be surprised if it was able to handle more complex formats. It seems more applicable to situations where you're in control of formats, not when you need to parse preexisting format with all the quirks.

    The patterns I show in this post is basically binary equivalent of recursive descent parser, which is generally the nicest way to write parsers for complex formats.

    ReplyDelete