The best kittens, technology, and video games blog in the world.

Tuesday, May 29, 2007

Regular expression matching in RLisp

fluffy ruff by jiva from flickr (CC-NC)
Ruby copies Perl regular expression matching semantics:

  • If a_string matches /a_rx/, return true and set $~ to match data
  • Otherwise return false and set $~ to nil
  • $~ is "Do-What-I-Mean-scoped". It is not local or global variable. "Do-What-I-Mean-scoping" is different in Perl and Ruby.
  • $1, $2, $`, and related other "variables" are not variables at all - they're simply parser-level macros which expand into $~.do_something.
  • Very often $1 etc. need to be converted to numbers. In Perl strings and numbers are unified and explicit conversion is not needed. In Ruby you often need to follow the match by something like a, b, c = $1.to_i, $2, $3.to_f.
I don't think it would be a good idea to copy all that to RLisp - semantically it's unbelievably ugly. It's also too damn convenient compared to the lengths one needs to go to parse something in Python, so something equally convenient but cleaner is needed instead.

Without any special text processing macros RLisp code to do something as simple as match IP numbers looks ugly, but at least it works:
(defun parse-ip (s)
(let m [s match /\A(\d+)\.(\d+)\.(\d+)\.(\d+)\Z/])
(if m
(let a [[m get 1] to_i])
(let b [[m get 2] to_i])
(let c [[m get 3] to_i])
(let d [[m get 4] to_i])
(list a b c d))
(raise "Cannot parse IP"))
(test-suite Text_Processing
(test parse_ip
(assert (parse-ip "") == '(1 2 3 4))
(assert (parse-ip "") == '(64 233 183 104)))
The solution - macros (who might have guessed that ...). Very simple pair of macros lets us use (rx-match a_string /a_rx/ a b c d) just like a_string =~ /a_rx/; a, b, c, d = $1, $2, $3, $4 would be used in Ruby. It returns true or false just like Ruby =~.

(defmacro rx-results (m . args)
(let res '(do))
[args each_with_index & (fn (x i)
[res push `(let ,x [,m get ,[i + 1]])]
[res push 'true]

(defmacro rx-match (s rx . args)
(let tmp (gensym))
(let ,tmp [,s match ,rx])
(if ,tmp
(rx-results ,tmp ,@args)

(defun parse-ip-2 (s)
(if (rx-match s /\A(\d+)\.(\d+)\.(\d+)\.(\d+)\Z/ a b c d)
(list [a to_i] [b to_i] [c to_i] [d to_i])
(raise "Cannot parse IP"))
That's pretty much as good as Ruby already, but we can do a bit better, integrating regexp matching and conversion to numbers. (i varname) and (f varname) mean convert the variable, while plain varname leaves it as a string.

(defmacro rx-results (m . args)
(let res '(do))
[args each_with_index & (fn (x i)
(match x
('f v) [res push `(let ,v [[,m get ,[i + 1]] to_f])]
('i v) [res push `(let ,v [[,m get ,[i + 1]] to_i])]
[res push `(let ,x [,m get ,[i + 1]])])
[res push 'true]
(defun parse-ip-3 (s)
(if (rx-match s /\A(\d+)\.(\d+)\.(\d+)\.(\d+)\Z/ (i a) (i b) (i c) (i d))
(list a b c d)
(raise "Cannot parse IP"))
By the way it would be very difficult to build such macros in Scheme or Common Lisp, not only due to their lack of builtin Perl-compatible regular expressions (a library can solve that problem), but also due to (let ...) in these Lisps being much more restrictive.

(let variable value) in RLisp sets variable in scope of current function. (let ((variable value)) code) in Scheme sets variable only in scope of code. It makes RLisp code clearer and reduces number of nested parentheses, but it also does something far more important - it makes writing RLisp macros a lot easier than Scheme/Common Lisp macros.


Fabien said...

About (let ...) extending the current scope instead of creating its own: I agree that it makes human-written code easier to read, by avoiding code nestings which don't correspond to the programmer's mental image. I also acknowledge that in some cases, it makes some macros easier to write. However, my experience is that it bits you more often than it helps when writing metastuff (metalua has essentially the same structure, inherited from Lua, which extends the conatining block's scope: `Local{ vars_list, values_list }).

As soon as a macro needs a tight control of scope, variable shadowing etc. I need to handle code blocks in custom ways in order to parse local declarations correctly. That's especially the case with code walkers: alpha renaming, runtime type-checking insertion...

IIRC, Paul Graham came to the same sort of conclusion in his design of Arc: he initially thought that inserting local variables in their containing scope was nicer, but found it to be a nightmare for macro writing and backed down. I'll try to find his quote.

Finally, your statement that "it makes macros much easier to write" should be slightly narrowed: having an "extend containing scope" semantics made it easy for you to write a regexp parser which also respects "extend containing scope" semantics. Schemers would naturally have written a macro which creates its own scope, so they would have completely avoided the problem you solved.

taw said...

Fabien: If you need to create new scope in RLisp you can do ((fn() code)), there's even standard macro for that - (local code).

Paul Graham did something else - every (do ...) created a new scope, and many macros accidentally created new scopes by using (do ...) when they didn't mean to. I don't think any macro will accidentally do ((fn() ...)).

So far it never bit me, and it helped me in most of the macros I wrote. If it ever does, I will certainly post about it.