The best kittens, technology, and video games blog in the world.

Wednesday, January 23, 2008

Really strange quirk of Ruby and Perl regular expressions

Sage's Kittens by T. Keller from flickr (CC-BY)

I've found a weird quirk of Ruby's regular expression engine. The same quirk is present in Ruby 1.8.6, Ruby 1.9.0, and Perl 5.8.8, but it is not present in Python 2.5.1. I'm going to let you decide if it's a bug or not.

I tried to replace all space at the end of a string by a single newline. The correct way to do it is str.sub(/\s*\Z/, "\n"). However I've done str.gsub(/\s*\Z/, "\n") instead. String has only one end-of-string, so there should be no way String#gsub could possibly match twice. But it does - the result is two newlines if there was any whitespace at the end, or one newline if there wasn't. I kinda see how it might be getting these results from implementation point of view, and it's not a big deal because there's a simple workaround of replacing #gsub by #sub, but it doesn't make much semantic sense to me.

Here are a few code snippets in Ruby, Perl and Python, with "foo" instead of whitespace for better readability.

# Ruby
"f".gsub(/o*\Z/, "o") # => "fo"
"fo".gsub(/o*\Z/, "o") # => "foo"
"foo".gsub(/o*\Z/, "o") # => "foo"
"fooo".gsub(/o*\Z/, "o") # => "foo"
"f".sub(/o*\Z/, "o") # => "fo"
"fo".sub(/o*\Z/, "o") # => "fo"
"foo".sub(/o*\Z/, "o") # => "fo"
"fooo".sub(/o*\Z/, "o") # => "fo"
"f".gsub(/o*\Z/, "x") # => "fx"
"fo".gsub(/o*\Z/, "x") # => "fxx"
"foo".gsub(/o*\Z/, "x") # => "fxx"
"fooo".gsub(/o*\Z/, "x") # => "fxx"

# Perl
perl -le '$_="f"; s/o*$/o/g; print $_' # => fo
perl -le '$_="fo"; s/o*$/o/g; print $_' # => foo
perl -le '$_="foo"; s/o*$/o/g; print $_' # => foo
perl -le '$_="fooo"; s/o*$/o/g; print $_' # => foo

# Python
re.compile(r'o*$').sub("o", "f") # => 'fo'
re.compile(r'o*$').sub("o", "fo") # => 'fo'
re.compile(r'o*$').sub("o", "foo") # => 'fo'
re.compile(r'o*$').sub("o", "fooo") # => 'fo'


For all you Ruby/Perl programmers out there, Python sub is equivalent of gsub or s///g not sub or s///.

Which behaviour makes more sense ? I think Python's much more intuitive. Should we treat it as a bug in Ruby/Perl and fix it, or accept it as an unintended feature ?

19 comments:

Anonymous said...

Yes, I'd also expect the Python behaviour and can't think of a reason why the outcome is different in Ruby and Perl.

By the way, the Python examples could also be written like this for shorter code: re.sub(r'o*$', "o", "f")

8jean said...

After some playing around and reading perlre, I think rubys/perls behaviour might indeed be correct, although it seems counterintuitive.

The problem is that your pattern (o*) *also* matches *the empty string*, which is not what you intended, but it does.

Just check:

perl -le 'print "matches: ", join ", ", map { ">$_<" } "foo" =~ /(o*\z)/g'

# output:
matches: >oo<, ><

So your pattern is interpreted as "any number of 'o's, or none, followed by the end of the string". Now, the first match ('oo') is what you wanted, but the second match is also valid, because you string (and indeed *any* string) also includes "nothing" (the empty string) right before the end of the line, which is dutifully matched.

taw said...

8jean: But it matches exactly twice instead of either once or infinite number of time. Why is "foooo".scan(/o*\Z/) returning ["oooo", ""], but not ["ooo", "oo", "o"] ?

8jean said...

taw: thats because of the 'greedy' behaviour of '*', always matching as much as it can.

The real problem is, as you wrote: how can the pattern match more than once under 'g', when there exists only one end of the string?

And since the '*' modifier is greedy, it should cover all of the cases '$', 'o$', 'oo$', ... an so on, matching only once.

I think the reason for the multiple matches is this:

Things like '$', '\z', etc. are "assertions", i.e. they can be used to "anchor" regexp patterns, but they are never a part of the actual match.

So, in your example, you ask for *every* occurrence of 'o*' followed by the end of the line.

So the regexp engine starts at the first 'o' it finds, and goes right up to the last one. Since EOL is next, the match ('ooo') is returned.

But you asked for *all* matches, so the search continues to look for 'o*$' after the end of the last match (the last 'o' character), where it finds "nothing", also followed by EOL. Since you said '*', nothing is OK => 2nd match.

Now my head hurts. ;)

Eridius said...

At first I thought this was a bug, but I have to agree that this is, in fact, correct behavior.

The first search for /o*$/ against "foo" matches the "oo". The second search will match the empty string "". The third time it searches, it will only find the exact same empty string, which it's already matched, so it finishes.

Anonymous said...

Eridius: Normally, the matching resumes at the position *after* the last match (it doesn't start over), and because the first match already matched the end of the string, how can there be a second match?

Anonymous said...

Actually, after rereading 8jean's explanation, I can see why this is the result. But it definitely doesn't seem to be clearly defined behaviour. How do other RE implementations behave?

Here's another example of the same problem:

"ha hu".gsub(/a*\b/, "x") #=> "xhxx xhux"

re.sub(r'a*\b', "x", "ha hu") #=> 'xhx xhux'

taw said...

8jean: It doesn't depend on greediness. "foo".scan(/o*?\Z/) returns the same two matches as "foo".scan(/o*\Z/). On the other hand "symmetrical" cases of "oof".scan(/\Ao*/) and "oof".scan(/\Ao*?/) return only ["oo"] and [""].

Is there any case where it's actually useful ?

Another interesting thing - if you scan "ofoo" with /\b/ or /\bo*/ you get two matches. With /o*\b/ and /\o*\b\o*/ you get three. It doesn't seem very intuitive.

sapphirecat said...

Is this because \Z and friends are "zero-width" assertions, so they match without consumption? If so, then the symmetrical case doesn't match, because the o* moves past \A, so that \A is never considered again at the beginning of the string where it would match. For the o*\bo* cases, I'm pretty sure the same thing is going on.

I'd really expect the Python behavior; the only reason I haven't run into this in the past is because usually I'm removing all whitespace, so deleting the 'extra' zero-width match doesn't affect anything.

lawrencepit said...

To me this seems like a bug in ruby and perl. 8jean explains it very well: he explains why this is a bug. ;-)

> "so the search continues to look for 'o*$' after the end of the last match (the last 'o' character)"

To be clear, the last match isn't the 'o' character. It's the \Z position. Ruby and perl should understand that after a \Z position there can't be anything else.

The same should apply for \b. In taw's last example, if you scan "foo" with /o*\b/ you get three matches, and if you scan "foo foo" with /o*\b/ you get six matches. Here too ruby and perl apparently seem to continue after the character match instead of after the position match.

Dave Doyle said...

@lawrencepit: Perl is not matching after \Z, it's matching a zero-width o* immediatly before \Z (which, as 8jean said, is an assertion) and after the o's that were replaced in the first match. This is a result of the /g modifier (global) combined with the o* that makes it global. You gotta be careful when you're dealing with greediness and the * qualifier in conjunction with assertions ($) and global substitution.

It's not a bug, it's doing exactly what it was asked. :) Python apparently chose to implement it differently, but the match is valid, in Perl at least. However, I think it should be noted that a * match used with the $ anchor and /g global modifer is not something you'd see in common practice anywhere. I get that it's a demonstration but it'd be much more common to see someone do:

s/o+$//; $_ .= 'o';

Because that's basically the task you're trying to accomplish. Further, if you have a multiline string it can be accomplished by doing:

s/o+$//gm; s/$/o/;

Not the point of the exercise I'm sure, but it's doing what you thought you were doing in the first place.

lawrencepit said...

@davedoyle:

> and after the o's that were replaced in the first match

I will again suggest this is exactly the bug in perl. It should not match after the o's. It should match after the \Z.

The command is to match "o*\Z". The command is not to match characters, but characters and /position/. The first match in the global search should include the position. The global search for a second match should resume after the position of \Z, not after the last o character, which should result in no match. Instead perl resumes it's search just before the position of \Z. That is a bug, because now it considers the position before \Z for a second time, even though this was /already matched/.

The key here is that a regular expression consists of characters and positions. What taw shows is that perl and ruby resume pattern matching in a global search ignoring position.

robmueller said...

@lawrencepit

> The command is not to match characters, but characters and /position/

I think talking in terms of positions is incorrect. The documentation says.

perldoc perlre

...
Perl defines the following zero-width assertions:

\z Match only at end of string
...

It's not a position, it's an "assertion". So \z matches *at* the end of the string, but doesn't match *the* end of the string, so restarting again because it's a \g expression is the right thing to do.

That seems terribly pedantic, but it does explain the behavior in this case and perl is correct with regard to it's documentation if you read it that way.

Paolo Bonzini said...

sed agrees with python

interestingly, split does not have the same behavior as gsub:

irb(main):001:0> "fooooo".split(/o*\Z/)
=> ["f"]
irb(main):002:0> "fooooo".split(/o*/)
=> ["f"]

(I put a parallel between split and gsub, because I just fixed GNU Smalltalk to behave like Python and I had to fix both split and gsub).

lawrencepit said...

@paolo: indeed, echo "foooo" | sed -e "s/o*$/x/g" results in fx.

@robmueller: the perlre doc says:

"The \A and \Z are just like ^ and $, except that they won't match multiple times when the /m modifier is used".

Yet it does match \Z multiple times when the /m modifier is used (with the option /g). So I'm still suggesting it's a bug in perl. ;-)

-----------------------------

In PHP:

echo preg_replace('/o*\Z/', "x", "f");
echo preg_replace('/o*\Z/', "x", "fo");
echo preg_replace('/o*\Z/', "x", "foo");
echo preg_replace('/o*\Z/', "x", "fooo");
echo preg_replace('/o*\Z/', "x", "foooo");

also results in:

fx
fxx
fxx
fxx
fxx

Using ereg_replace instead of preg_replace gives the same result.

trolling said...

The perl/ruby behavior is correct if you understand how assertions work. Take this example:

"abc".gsub(/b*(?=c)/, "z") #=> "azzc"

Looking at that, it should be obvious that this bahvior is right. There are two things that pattern matches:

1) the zero-width position before a "c"
2) any number of "b" characters before a "c"

Each of those two incidences occur once in the string, a total of two matches, so the pattern should match twice.

Think of it this way, if you were to try to match /$/ to "abc", it should find one match, correct? Now say you searched "abc" for /c?$/, that is effectively a search for two things: /c$/ or /$/ (a.k.a. /c$|$/, a pattern that ought to be functionally identical to /c?$/). Now, how many times does /c$/ and /$/ exist in "abc". Two.

The confusion perhapas arises because people don't understand that assertions don't consume characters. They expect a match on /c$/ will consume the end of the line and the /$/ won't match after. That's not how assertions work. Assertions do not consume characters. After /c$/ matches the cursor is placed after the "c" but before the end of the string. From this position it can still match /$/.

If the cursor being "after" the end of the string confuses you, replace $ with any other assertion like \b or (?=x) or whatever, the situation is the same.

trolling said...

So I've ben investigating this issue further.

From what little I've tested it seems like python's regex engine won't match a zero-width position if it matched the character before it. I'm not all too familiar with python's various tools so I can't really test this all that effectively, but from what I can gather that seems to be the cause of this inconsistency.

Here's an example that looks really odd to me (being used to perl/ruby):

>>> re.compile(r'(?= )').sub("z", "abc d")
'abcz d'
>>> re.compile(r'c?(?= )').sub("z", "abc d")
'abz d'

As you can see, python knows it can match zero-width strings. It knows there's a valid zero-width position for matching right before that space. But if you match the "c" before it, it suddenly gets ignored. This cannot be because the space itself has been consumed by the pattern (assertions don't eat characters), but rather it sort of consumes the empty position after it.

So what it actually seems like is that to python, zero-width positions somehow belong to the character before them. If you match the "c" the zero-width after it isn't a valid target any more. I suppose this is a sort of valid stance, and it can lead to less accidental matches, but I don't know if it's quite right. At least to me, it makes things feel less consistent: I know what my assertion ought to match, the fact that matching another character can eat the position I may want to use doesn't seem like a good thing.

sjs said...

Python also exhibits some inconsistency between re.sub and re.findall.

>>> re.findall(r'a?\b', "ha hu")
['', 'a', '', '', '']

Therefore when doing re.sub(r'a?\b', "x", "ha hu"), shouldn't the result be "xhxx xhux"?

Or, to use the original example:

>>> re.sub(r'o*\Z', "o", "f")
'fo'
>>> re.sub(r'o*\Z', "o", "fo")
'fo'
>>> re.sub(r'o*\Z', "o", "foo")
'fo'

and then

>>> re.findall(r'o*\Z', "f")
['']
>>> re.findall(r'o*\Z', "fo")
['o', '']
>>> re.findall(r'o*\Z', "foo")
['oo', '']

taw said...

sjs: And I thought that Python got it right, while Perl/Ruby were wrong. Now I see that Python is in fact even less consistent than Perl/Ruby. :-/