The best kittens, technology, and video games blog in the world.

Thursday, July 15, 2010

Synchronized compressed logging the Unix way

tiny tiny kitten 4 weeks old by GeorgeH23 from flickr (CC-NC-ND)
In good Unix tradition if a program generates some data, in general it should write it to STDOUT, and you'll redirect it to the right file yourself.

There are two problems with that, both easily solvable in separation:
  • If it's a lot of data, you want to store it compressed. It would be bad Unix to put compression directly in the program - the right way is to pipe its output through gzip with program | gzip >logfile.gz. gzip is really fast, and usually adequate.
  • You want to be able to see what were the last lines written out by the program at any time. Especially if it appears frozen. Sounds trivial, but thanks to a horrible misdesign of libc, and everything else based on it, data you write gets buffered before being actually written - a totally reasonable thing - and there are no limits whatsoever how long it can stay in buffers! Fortunately it is possible to turn this misfeature off with a single line of STDOUT.sync=true or equivalent in other languages.
Unfortunately while both fixes involve a single line of obvious code - there's no easy way to solve them together. Even if you flushed all data from the program to gzip, gzip can hold onto it indefinitely. Now unlike libc which is simply broken, gzip has a good reason - compression doesn't work on one byte at a time - it takes a big chunk, compresses it, and only then writes it all out.

Still, even if it has good reasons not to flush data as soon as possible, it can and very much should flush it every now and then - with flushing every few seconds reduction in compression ratio will be insignificant, and it will be possible to find out why the program frozen almost right away. The underlying zlib library totally has this feature - unfortunately command line gzip utility doesn't expose it.

So I wrote this:

#!/usr/bin/env ruby

require 'thread'
require 'zlib'

def gzip_stream(io_in, io_out, flush_freq)
  fh = Zlib::GzipWriter.wrap(io_out)
  lock ={
    while true
        return if fh.closed?
        fh.flush if fh.pos > 0
      sleep flush_freq

gzip_stream(STDIN, STDOUT, 5)

It reads lines on stdin, writes them to stdout, and flushes every 5 seconds (or whatever you configure) in a separate Ruby thread. Ruby green threads are little more than a wrapper over select() in case you're wondering. The check that fh.pos is non-zero is required as flushing before you write something seems to result in invalid output.

Now you can program | gzip_stream >logfile.gz without worrying about data getting stuck on the way (if you flush in your program that is).


Anonymous said...

How about doing it the Unix way like so:

script -q /dev/null program | tee /dev/stderr | gzip >logfile.gz


taw said...

Anonymous: This doesn't work at all.

What your thing does ensures data is flushed from program to gzip process - this is trivial part you can get with STDOUT.sync=true or calling flush at relevant points or with equivalent technique in other language.

The hard part I'm solving here is getting data from gzip process to logfile.gz.

There is no way you can do that other than from within gzipping process. This has absolutely nothing to do with stdio buffering.

Anonymous said...

Sure, this shell one-liner does not modify the (internal) stdout buffering mechanism of gzip to logfile.gz at all.

However, tee'ing a copy of the line-buffered stdout stream of "command" to /dev/stderr will print all data to the console
which makes it possible to continually track the progress of the data flow at this point without using a temporary file or such. This way you just get an additional watchpoint. Printing to stderr is always unbuffered.

Of course, the same can be done within gzip_stream as well:



$stderr.puts(line) # added



How do you continually view the last lines written to logfile.gz ? Is there something like ztail?


taw said...

Anonymous: Program is not attached to any tty, it just runs an daemon.

There is no ztail, you can just zcat|tail when you see a problem.

Implementing ztail smarter than that would be possible if gzip_stream did a full flush every now and then, not just partial flush.

Trent said...

This was the perfect solution to my problem (not log file but database related). Thanks for the code. (I have never used Ruby before, but it works nice).