Regular expressions are among the most powerful programming techniques ever invented. Real world "regular expressions" are only loosely related to Computer Science "regular expressions". Computer Science "regular expressions" can only provide yes/no answers to "does this string match that regular expression" type of questions. We are usually interested in much more than that - we want to extract data from strings, convert strings to other strings and so on. We call expressions used for this purpose "regular" too for historical reasons.
Regular expressions are extremely concise, but sometimes they don't suffice and we need to write a "real parser". Unfortunately even with the best parser generating tools parsers tend to be many times more complex and error prone than equivalent regular expressions. And if the problem is too complex for regular expressions, very often it is also too complex for whichever parser generator you're using, and needs a lot of nontrivial hacking around limitation of it, or even writing a parser by hand.
Fortunately for many problems there's an alternative to parsers and parser generating tools - regular expressions plus a few tricks. This blog post is about one of such trick.
For my bot which verifies links in Wikipedia I needed to extract data from SQL dumps. SQL dumps look something like that:
INSERT INTO `page` VALUES (1,0,'Astronomia','',1800,0,0,0.600461925007833,'20070601091320',8076762,8584,0), (2,0,'AWK','',329,0,0,0.487812640599732,'20070530195555',8058046,4265,0), (4,0,'Alergologia','',108,0,0,0.580574716050713,'20070520093413',7912844,292,0), ...
INSERT INTO `page` VALUES (14880,0,'Dźwignica_linotorowa','',26,0,0,0.597327036408081,'20060814072401',4282357,727,0), (14881,0,'Urządzenia_transportowe','',91,0,0,0.176666489966834,'20070527090143',2976610,1041,0), ...
Basically a bunch of INSERT statements with multiple tuples each. I wanted to iterate over the tuples. Extracting tuple data from the sql dump is a simple
next unless line =~ /\AINSERT INTO `page` VALUES (.*)\Z/
. Spliting this blob into tuples and tuple fields is almost trivial - /\),\(/
separates tuples and ,
separates tuple fields. Unfortunately there are many strings inside, and some of them might contain commas.Sure, it's posisble to write a single monster regular expression to do just that, but it would be quite error-prone. Wouldn't it be easier to simply treat whole SQL strings as single "objects" inside the string ? That's exactly what we're going to do. First we need to get rid of \-escapes. That's not really necessary, as a regular expression for matching strings with \-escapes inside wouldn't really be that complicated, but we can make it even simpler this way. So every
/\\(.)/
becomes "\x00" + esc[$1]
, where values of esc[...]
are "safe" characters which won't interfere with further parsing, like A
, B
, C
, and so on.At this point every
'
marks either a beginning or an end of some string. So we can replace all strings by object ids like "\x01<1234>"
, where 1234 is a suitable number. After we do that we can split on /\),\(/
and ,
just like we wanted. Afterwards we need to convert embedded objects (like \x01<1234>
and \x00A
) back to their original form.The complete code is here:
def hash_or_die(kw)
Hash.new{|ht,k| raise "Unknown key: #{k}"}.merge(kw)
end
def parse(data)
esc = hash_or_die "\\" => "A", "\"" => "B", "n" => "C", "'" => "D"
rev_esc = hash_or_die "A" => "\\", 'B' => "\"", "C" => "n", "D" => "'"
data = data.gsub(/\\(.)/) {"\x00" + esc[$1]}
strs = []
data = data.gsub(/('[^']*')/) { # '
strs << $1
"\x01<#{strs.size-1}>"
}
records = []
data.scan(/\((.*?)\)/) {
records << $1.split(/,/).map{|field|
field.gsub(/\x01<(\d+)>/) {
strs[$1.to_i]}.gsub(/\x00(.)/){ rev_esc[$1]
}
}
}
records
end
def sql_str_unquote(str)
str =~ /\A'(.*)'\Z/ or raise "SQL string format is wrong: #{str}"
$1.gsub(/\\(.)/) {$1}
end
page_fn = Dir["plwiki-*-page.sql"].sort[-1]
externallinks_fn = Dir["plwiki-*-externallinks.sql"].sort[-1]
pages = {}
File.open(page_fn).each{|line|
next unless line =~ /\AINSERT INTO `page` VALUES (.*)\Z/
parse($1).each{|id,ns,title,*stuff|
next unless ns == "0"
title = sql_str_unquote(title)
pages[id] = title
}
}
File.open(externallinks_fn).each{|line|
next unless line =~ /\AINSERT INTO `externallinks` VALUES (.*)\Z/
parse($1).each{|from,to,index|
title = pages[from]
next unless title
to = sql_str_unquote(to)
next unless to =~ /\Ahttp:\/\//
puts "#{title}\t#{to}"
}
}
The same technique can be used to parse many other things like parsing Lisp code:
require 'pp'
lisp_code = '(a (b c) (d (e) f g) (((h))))'
nodes = []
lisp_code.gsub!(/([a-z]+)/) {
nodes << [:atom, $1]
"<#{nodes.size-1}>"
}
lisp_code.gsub!(/\s/,"")
true while lisp_code.gsub!(/\(((?:<\d+>)*)\)/) {
nodes << [:app, *$1.scan(/<(\d+)>/).map{|x,| nodes[x.to_i]}]
"<#{nodes.size-1}>"
}
lisp_code =~ /<(\d+)>/
pp nodes[$1.to_i]
# Output:
# [:app,
# [:atom, "a"],
# [:app, [:atom, "b"], [:atom, "c"]],
# [:app, [:atom, "d"], [:app, [:atom, "e"]], [:atom, "f"], [:atom, "g"]],
# [:app, [:app, [:app, [:atom, "h"]]]]]
and math expressions:
require 'pp'
math_code = '(2 + 2 * 2) / ((2 + 2) * 2)'
nodes = []
math_code.gsub!(/(\d+)/) {
nodes << $1.to_i
"<#{nodes.size-1}>"
}
math_code.gsub!(/\s/,"")
until math_code =~ /\A<(\d+)>\Z/
next if math_code.gsub!(/\((<\d+>)\)/) { $1 }
next if math_code.gsub!(/<(\d+)>([\*\/])<(\d+)>/) {
nodes << [$2, nodes[$1.to_i], nodes[$3.to_i]]
"<#{nodes.size-1}>"
}
next if math_code.gsub!(/<(\d+)>([\+\-])<(\d+)>/) {
nodes << [$2, nodes[$1.to_i], nodes[$3.to_i]]
"<#{nodes.size-1}>"
}
end
pp nodes[$1.to_i]
# Output:
# ["/", ["+", 2, ["*", 2, 2]], ["*", ["+", 2, 2], 2]]
Technique of embedding objects in strings and matching such strings with regular expressions is simple and very powerful. Objects can be represented in many ways. If numbers and some sort of brackets are not relevant for parsing, and
"\x00"
doesn't occur in the input (it almost never does), "\x00<ID>"
is a good idea. Fancy Unicode private characters can be used too if the regular expression engine can handle them. If you want to treat different objects in different way you can represent them in some convenient form like \x00<CLASS;PROPERTY_A;PROPERTY_B;ID>
. The technique works best in languages with integrated regular expression engines like Ruby, Perl, and RLisp. In others like Python and Java it's going to be somewhat uglier, but still better than full-blown "parsers".