The best kittens, technology, and video games blog in the world.

Wednesday, June 27, 2007

Regular expressions and strings with embedded objects

Minta by melisdramatic from flickr (CC-NC-SA)

Regular expressions are among the most powerful programming techniques ever invented. Real world "regular expressions" are only loosely related to Computer Science "regular expressions". Computer Science "regular expressions" can only provide yes/no answers to "does this string match that regular expression" type of questions. We are usually interested in much more than that - we want to extract data from strings, convert strings to other strings and so on. We call expressions used for this purpose "regular" too for historical reasons.

Regular expressions are extremely concise, but sometimes they don't suffice and we need to write a "real parser". Unfortunately even with the best parser generating tools parsers tend to be many times more complex and error prone than equivalent regular expressions. And if the problem is too complex for regular expressions, very often it is also too complex for whichever parser generator you're using, and needs a lot of nontrivial hacking around limitation of it, or even writing a parser by hand.

Fortunately for many problems there's an alternative to parsers and parser generating tools - regular expressions plus a few tricks. This blog post is about one of such trick.

For my bot which verifies links in Wikipedia I needed to extract data from SQL dumps. SQL dumps look something like that:

INSERT INTO `page` VALUES (1,0,'Astronomia','',1800,0,0,0.600461925007833,'20070601091320',8076762,8584,0), (2,0,'AWK','',329,0,0,0.487812640599732,'20070530195555',8058046,4265,0), (4,0,'Alergologia','',108,0,0,0.580574716050713,'20070520093413',7912844,292,0), ...
INSERT INTO `page` VALUES (14880,0,'Dźwignica_linotorowa','',26,0,0,0.597327036408081,'20060814072401',4282357,727,0), (14881,0,'Urządzenia_transportowe','',91,0,0,0.176666489966834,'20070527090143',2976610,1041,0), ...

Basically a bunch of INSERT statements with multiple tuples each. I wanted to iterate over the tuples. Extracting tuple data from the sql dump is a simple next unless line =~ /\AINSERT INTO `page` VALUES (.*)\Z/. Spliting this blob into tuples and tuple fields is almost trivial - /\),\(/ separates tuples and , separates tuple fields. Unfortunately there are many strings inside, and some of them might contain commas.

Sure, it's posisble to write a single monster regular expression to do just that, but it would be quite error-prone. Wouldn't it be easier to simply treat whole SQL strings as single "objects" inside the string ? That's exactly what we're going to do. First we need to get rid of \-escapes. That's not really necessary, as a regular expression for matching strings with \-escapes inside wouldn't really be that complicated, but we can make it even simpler this way. So every /\\(.)/ becomes "\x00" + esc[$1], where values of esc[...] are "safe" characters which won't interfere with further parsing, like A, B, C, and so on.

At this point every ' marks either a beginning or an end of some string. So we can replace all strings by object ids like "\x01<1234>", where 1234 is a suitable number. After we do that we can split on /\),\(/ and , just like we wanted. Afterwards we need to convert embedded objects (like \x01<1234> and \x00A) back to their original form.

The complete code is here:
def hash_or_die(kw)
Hash.new{|ht,k| raise "Unknown key: #{k}"}.merge(kw)
end

def parse(data)
esc = hash_or_die "\\" => "A", "\"" => "B", "n" => "C", "'" => "D"
rev_esc = hash_or_die "A" => "\\", 'B' => "\"", "C" => "n", "D" => "'"
data = data.gsub(/\\(.)/) {"\x00" + esc[$1]}
strs = []
data = data.gsub(/('[^']*')/) { # '
strs << $1
"\x01<#{strs.size-1}>"
}
records = []
data.scan(/\((.*?)\)/) {
records << $1.split(/,/).map{|field|
field.gsub(/\x01<(\d+)>/) {
strs[$1.to_i]}.gsub(/\x00(.)/){ rev_esc[$1]
}
}
}
records
end

def sql_str_unquote(str)
str =~ /\A'(.*)'\Z/ or raise "SQL string format is wrong: #{str}"
$1.gsub(/\\(.)/) {$1}
end

page_fn = Dir["plwiki-*-page.sql"].sort[-1]
externallinks_fn = Dir["plwiki-*-externallinks.sql"].sort[-1]

pages = {}

File.open(page_fn).each{|line|
next unless line =~ /\AINSERT INTO `page` VALUES (.*)\Z/
parse($1).each{|id,ns,title,*stuff|
next unless ns == "0"
title = sql_str_unquote(title)
pages[id] = title
}
}

File.open(externallinks_fn).each{|line|
next unless line =~ /\AINSERT INTO `externallinks` VALUES (.*)\Z/
parse($1).each{|from,to,index|
title = pages[from]
next unless title
to = sql_str_unquote(to)
next unless to =~ /\Ahttp:\/\//
puts "#{title}\t#{to}"
}
}


The same technique can be used to parse many other things like parsing Lisp code:
require 'pp'

lisp_code = '(a (b c) (d (e) f g) (((h))))'
nodes = []

lisp_code.gsub!(/([a-z]+)/) {
nodes << [:atom, $1]
"<#{nodes.size-1}>"
}
lisp_code.gsub!(/\s/,"")
true while lisp_code.gsub!(/\(((?:<\d+>)*)\)/) {
nodes << [:app, *$1.scan(/<(\d+)>/).map{|x,| nodes[x.to_i]}]
"<#{nodes.size-1}>"
}
lisp_code =~ /<(\d+)>/
pp nodes[$1.to_i]
# Output:
# [:app,
# [:atom, "a"],
# [:app, [:atom, "b"], [:atom, "c"]],
# [:app, [:atom, "d"], [:app, [:atom, "e"]], [:atom, "f"], [:atom, "g"]],
# [:app, [:app, [:app, [:atom, "h"]]]]]


and math expressions:
require 'pp'

math_code = '(2 + 2 * 2) / ((2 + 2) * 2)'
nodes = []

math_code.gsub!(/(\d+)/) {
nodes << $1.to_i
"<#{nodes.size-1}>"
}
math_code.gsub!(/\s/,"")

until math_code =~ /\A<(\d+)>\Z/
next if math_code.gsub!(/\((<\d+>)\)/) { $1 }
next if math_code.gsub!(/<(\d+)>([\*\/])<(\d+)>/) {
nodes << [$2, nodes[$1.to_i], nodes[$3.to_i]]
"<#{nodes.size-1}>"
}
next if math_code.gsub!(/<(\d+)>([\+\-])<(\d+)>/) {
nodes << [$2, nodes[$1.to_i], nodes[$3.to_i]]
"<#{nodes.size-1}>"
}
end
pp nodes[$1.to_i]
# Output:
# ["/", ["+", 2, ["*", 2, 2]], ["*", ["+", 2, 2], 2]]


Technique of embedding objects in strings and matching such strings with regular expressions is simple and very powerful. Objects can be represented in many ways. If numbers and some sort of brackets are not relevant for parsing, and "\x00" doesn't occur in the input (it almost never does), "\x00<ID>" is a good idea. Fancy Unicode private characters can be used too if the regular expression engine can handle them. If you want to treat different objects in different way you can represent them in some convenient form like \x00<CLASS;PROPERTY_A;PROPERTY_B;ID>. The technique works best in languages with integrated regular expression engines like Ruby, Perl, and RLisp. In others like Python and Java it's going to be somewhat uglier, but still better than full-blown "parsers".

4 comments:

arrêt sur image said...

Sometimes it's also easy to let the ruby parser do the job for you using eval. Just match the whole parenthesis expression, replace first and last parens with brackets and eval the resulting string.

Mathias Gaunard said...

That kind of stuff could probably be done more elegantly using C++ and boost.xpressive, because the engine itself already simply allows that by design, giving the ability to define recursive grammars.

Note however that since boost.xpressive makes regular expressions part of the programming language, the syntax is a bit altered to match C++'s.

Anonymous said...

I'm kind of surprised someone else does this. When I work with mediawiki dumps I generally only want the links tables (with bare information) and page_id<=>page_namespace,page_title mappings so I wrote tiny perl scripts to chop the sql dumps to bits. page is as simple as `gzip -c page.sql.gz | perl -ne 'print \"\$1\t\$2\t\$3\n\" while m/\\((\\d+),(\\d+),'(.*?)(?<\!\\)',.*?\)(?=[,;])/g;" > page.txt`. It is fine unless people make page titles that end in \ or contain tab characters (and probably some other edge cases) and then it is much easier to work with it tab-delimited or inserting it into a database much faster with load data infile.

taw said...

Anonymous: Complicated plain regular expressions like in your solution also work, but they are often more limited, more fragile, harder to write, and less readable.