There are two problems with that, both easily solvable in separation:
- If it's a lot of data, you want to store it compressed. It would be bad Unix to put compression directly in the program - the right way is to pipe its output through gzip with program | gzip >logfile.gz. gzip is really fast, and usually adequate.
- You want to be able to see what were the last lines written out by the program at any time. Especially if it appears frozen. Sounds trivial, but thanks to a horrible misdesign of libc, and everything else based on it, data you write gets buffered before being actually written - a totally reasonable thing - and there are no limits whatsoever how long it can stay in buffers! Fortunately it is possible to turn this misfeature off with a single line of STDOUT.sync=true or equivalent in other languages.
Still, even if it has good reasons not to flush data as soon as possible, it can and very much should flush it every now and then - with flushing every few seconds reduction in compression ratio will be insignificant, and it will be possible to find out why the program frozen almost right away. The underlying zlib library totally has this feature - unfortunately command line gzip utility doesn't expose it.
So I wrote this:
#!/usr/bin/env ruby
require 'thread'
require 'zlib'
def gzip_stream(io_in, io_out, flush_freq)
fh = Zlib::GzipWriter.wrap(io_out)
lock = Mutex.new
Thread.new{
while true
lock.synchronize{
return if fh.closed?
fh.flush if fh.pos > 0
}
sleep flush_freq
end
}
io_in.each{|line|
lock.synchronize{
fh.print(line)
}
}
fh.close
end
gzip_stream(STDIN, STDOUT, 5)
It reads lines on stdin, writes them to stdout, and flushes every 5 seconds (or whatever you configure) in a separate Ruby thread. Ruby green threads are little more than a wrapper over select() in case you're wondering. The check that fh.pos is non-zero is required as flushing before you write something seems to result in invalid output.
Now you can program | gzip_stream >logfile.gz without worrying about data getting stuck on the way (if you flush in your program that is).
5 comments:
How about doing it the Unix way like so:
script -q /dev/null program | tee /dev/stderr | gzip >logfile.gz
See:
http://stackoverflow.com/questions/3332045/bash-force-execd-process-to-have-unbuffered-stdout
Anonymous: This doesn't work at all.
What your thing does ensures data is flushed from program to gzip process - this is trivial part you can get with STDOUT.sync=true or calling flush at relevant points or with equivalent technique in other language.
The hard part I'm solving here is getting data from gzip process to logfile.gz.
There is no way you can do that other than from within gzipping process. This has absolutely nothing to do with stdio buffering.
Sure, this shell one-liner does not modify the (internal) stdout buffering mechanism of gzip to logfile.gz at all.
However, tee'ing a copy of the line-buffered stdout stream of "command" to /dev/stderr will print all data to the console
which makes it possible to continually track the progress of the data flow at this point without using a temporary file or such. This way you just get an additional watchpoint. Printing to stderr is always unbuffered.
Of course, the same can be done within gzip_stream as well:
io_in.each{|line|
lock.synchronize{
$stderr.puts(line) # added
fh.print(line)
}
How do you continually view the last lines written to logfile.gz ? Is there something like ztail?
Thanks.
Anonymous: Program is not attached to any tty, it just runs an daemon.
There is no ztail, you can just zcat|tail when you see a problem.
Implementing ztail smarter than that would be possible if gzip_stream did a full flush every now and then, not just partial flush.
This was the perfect solution to my problem (not log file but database related). Thanks for the code. (I have never used Ruby before, but it works nice).
Trent
Post a Comment