Regex Arcana

The Perl Journal August, 2004

By Jeff Pinyan

Jeff "japhy" Pinyan, author of Regexp::Parser, is finishing up at RPI in Troy, NY. He can be contacted at japhy@perlmonk.org.

Perl's regular expressions (hereafter known as "regexes") have come a long way since Version 1.0. They're like the eccentric uncle who spends a few years sojourning somewhere exotic and comes back older, wiser, and more worldly. Or perhaps, in Perl's case, more out-of-this-worldly. Perl's regexes aren't what computer science purists would call "regular"—that idea was thrown out the window when backreferences became something you could include in the pattern itself, as shown in this simple dynamic regex:

# find any any doubled words
my $text = "Four score and and seven years ago";
while ($text =~ /\b(\w+)\W+\1\b/g) {
  print "'$1' repeats itself";
  # 'and' repeats itself
}

This regex is self-referential: The backreference \1 tells the regex engine to match whatever has been captured into $1. This means the regex doesn't know for certain what it will match until runtime. And this is only the tip of the iceberg.

I'm going to share two pieces of regex arcana with you—features of the Perl regex engine that aren't used often, but offer those who can master them great power.

Code Evaluation

The code evaluation assertion (?{ CODE }) was introduced in Perl 5.005, but it was rough around the edges. In Perl 5.6, it was fine-tuned and more adjustments were made by 5.8. The code evaluation assertion allows a regex to execute an arbitrary block of Perl code during the course of its pattern matching. This device becomes particularly interesting when one realizes that the backtracking stack is much like a variable scoping stack.

Delayed Execution

The delayed execution assertion (??{ CODE }) is the heart of dynamic regexes. Using them is like walking across a bridge as you're building it. Their true beauty is revealed when you create a Regexp object (via qr//) with a delayed execution assertion in it. Inside that assertion is the self-same Regexp object. Then you have created a recursive regex, one that is capable of matching nested data structures or acting like a proper grammar.

Getting Up to Speed

This article is going to use some regex variables you might not be familiar with, so I introduce them to you in Table 1.

Here is a simple example exhibiting these variables:

"perl" =~ /(.(..).)/;
# $1  = "perl"
# $2  = "er"
# $+  = "er" (in this case, $2)
# $^N = "perl" (in this case, $1)
# @-  = (0, 0, 1)
# @+  = (4, 4, 3)

After a successful pattern match against $str, the pair of values $-[X] and $+[X] hold the offsets in $str of the Xth capture group. $-[0] and $+[0] are the offsets for $&:

substr($str, 0, $-[0]);		# produces $'
substr($str, $-[0], $+[0] - $-[0]);	# produces $&
substr($str, $+[0]);			# produces $'
substr($str, $-[1], $+[1] - $-[1]);	# produces $1
substr($str, $-[2], $+[2] - $-[2]);	# produces $2 (etc.)

You are probably aware that $& and its friends are bad news. They cause reductions in the speed of your programs, so use them only when debugging. If you use them once in your code, you might as well use them as many times as you can in that code, because you've already suffered the hit.

Using (?{ ... })

Our first incantation is the code evaluation assertion. It will execute the code inside when it is encountered, and then continue with the regex; they always succeed, unless their contents interrupt the execution of your program.

Let's look at some applications of this assertion.

Inspecting a Regex's State

It has been said that "you can't observe the behavior of a system without affecting the system's behavior." Not so, at least in Perl 5.8. The following regex uses code assertions to show how a particular optimization in the engine works:

# the /x modifier allows embedded whitespace
# to help improve readability and clarity
"salad" =~ m{
  .* (?{ print "[$&] " }) a
  .* (?{ print "($&) " }) a
}x;
# output:
# [sal] [s] (sal)

We often see backtracking as a character-at-a-time procedure, but it's really much more efficient than that. In this case, something like .*a makes the engine jump from one "a" to the next when matching or backtracking. The code assertion doesn't get in the way of this, even though it's in-between the .* and a. (Sadly, in Perl 5.6, it did get in the way. Try that code in 5.6 and you'll see it stops the optimization from occurring.)

You have access to the regex variables inside a code assertion; my example above uses $&, although it's general practice to avoid that variable. You can also access $1 and other digit variables, $^N, $+, and the arrays @- and @+. One caveat, though, is that you can't access a capture group until it has been completed; that is, the closing parenthesis for $1 must be passed before you can see the contents of $1 in a code assertion, otherwise you'll be seeing its previous value. This code won't print anything inside the quotes because $1 hasn't been closed before it's printed:

"perl" =~ m{
  ( (?: . (?{ print "'$1'\n" }) )+ )
}x;

In order to inspect the digit variables as they're being created, you need to be cunning: Create a variable inside a code evaluation immediately prior to the capture group that stores the current location in the string. Then, use substr() on $_. This is a copy of the string you're matching against inside a code evaluation:

"perl" =~ m{
  (?{ $p = $-[0] })
  ( (?:
    .
    (?{ print substr($_, $p, $+[0] - $p), "\n" })
  )* )
}x;

That $_ trick is undocumented as far as I can tell, which means it might not stay that way in the future. (But considering Perl 6 regexes will be radically different, we should have fun while we can.)

In this way, code evaluation assertions are good debugging tools because you can use them without interfering with the execution of the regex and you can get helpful diagnostic information, such as where you are in the string, what you've captured, and so on.

Capturing Repetitions

I've often been asked why a regex like /(\w)+/ only returns the last word character captured:

"japhy" =~ /(\w)+/ and print "<$1>";  # 'y'

The reason is because the regex is continually storing what \w matches to $1, and each time, it's overwriting the old contents. It won't create $2, $3, and so forth, and it doesn't create a @1 array. So the question remains: How can I keep track of repeated capture groups?

The answer works on the principle that the matching and backtracking in a regex is very similar to a variable scoping stack if you use local() on your variables inside code evaluations. local() works differently in a regex; it gives a variable a value that remains until that point is backtracked past. Let's say we wanted to count how many characters we matched before the last "r" in a string. Here's code that doesn't use local():

# not local()ized
"perl" =~ m{
  (?{ $x = 0 })
  (?: \w (?{ ++$x }) )*
  r
}x and print "x = $x\n";

It prints "4" as the value of $x, even though there are only two. This time, we'll use local() on a temporary variable and assign its value to $x only after the regex has succeeded:

# local()ized
"perl" =~ m{
  (?{ local $_x = 0 })
  (?: \w (?{ local $_x = $_x + 1 }) )*
  r
  (?{ $x = $_x })
}x and print "x = $x\n";

This time the value is reported correctly as "2."

The first code prints "4" because when the backtracking occurs, $x's value isn't rewound or reverted to its previous one. The second code prints "2" for the opposite reason: Because of the use of local(), when backtracking occurs, the scoped variable $_x reverts to its previous value. Think of your regex as being a recursive function.

Although it requires a bit more coding on your part, this technique can be extended to let you keep track of capturing groups that have quantifiers on them:

"age=22 name=jeff state=NJ" =~ m{
  (?{ local (@_1, @_2) })
  (?:
    ([^=]+) = (\S+) \s*
    (?{ local @_1 = (@_1, $1);
        local @_2 = (@_2, $2); })
  )*
  (?{ (@1 = @_1), (@2 = @_2) })
}x;

The code above produces two arrays, @1 and @2, whose contents are "age," "name," "state," and "22," "jeff," "NJ," respectively.

If you're wondering why I didn't just write push(@_1, $1), it's because I have to make a new locally scoped array, and push() wouldn't do that. You must use local() every time you need a variable to revert to its previous value when it's backtracked past.

Using $^R to Simplify Things

There's another regex variable, $^R, which stores the value of the last (?{ ... }) assertion. Perl is nice and makes $^R scope properly, so if your regex backtracks, $^R will revert to its previous value. This means we can write a much simpler looking regex:

"perl" =~ m{
  (?{ 0 })	# sets $^R to 0
  (?:
    \w
    (?{ $^R + 1 })	# sets $^R to $^R + 1
  )*
  r
  (?{ $x = $^R })	# sets $x to $^R
}x and print "x = $x\n";

Here, we don't need to worry about localizing our variables because $^R is properly dealt with by Perl. You need to be careful about how you modify it, though. We could not have written (?{ $^R++ }) or (?{ ++$^R }) in our regex because both of those explicitly assign to $^R. You usually don't want to assign to $^R explicitly; just make the last value in your code assertion be the value you want $^R to have. Perl builds up a stack of the values returned by code evaluations and rolls that stack back when backtracking occurs. Setting $^R explicitly interrupts that stacking procedure.

We can use $^R with our example that creates @1 and @2 as well. We have to be careful in how our code assertions work:

"age=22 name=jeff state=NJ" =~ m{
  #     @1  @2
  (?{ [ [], [] ] })
  (?:
    ([^=]+) = (\S+) \s*
    (?{ [ 
      [ @{$^R->[0]}, $1 ],
      [ @{$^R->[1]}, $2 ],
    ] })
  )*
  (?{
    @1 = @{ shift @{$^R} };
    @2 = @{ shift @{$^R} };
  })
}x;

You cannot rely on what value $^R will have outside of your regex, so you should always get the data from it at the very end of your regex, once you are sure it has succeeded.

If-Then Assertions

You can also use code assertions as the condition to an if-then assertion. If you're not familiar with them, they look like (?(cond)true|false). If the cond part evaluates to a true value, then the true pattern is matched; otherwise, the false pattern is matched. If the selected pattern fails to match, the if-then assertion fails altogether. The cond part can be a look-ahead, a look-behind, a number referring to a capture group, or a code assertion. If it is a code assertion, $^R does not get its value changed.

Here's a regex that matches $_ if it is an integer less than 100:

my $less_than_100 = qr{
  ^ ( -?\d+ ) $       # match the entire number
  (?(?{ $1 >= 100 })	# if $1 >= 100...
    (?!)		#   then fail
  )		# otherwise it succeeds
}x;
if ($n =~ $less_than_100) {
  # $n is less than 100
}

Using (??{ ... })

Our second gimmick is the delayed execution assertion. It acts very much like a code evaluation assertion, but the return value is returned to the regex as something to be matched. Consider our first example, which finds words that are repeated—here's a way to write it using a delayed execution assertion:

while ($text =~ /\b(\w+)\W+(??{ $1 })\b/g) {
  print "'$1' is repeated\n";
}

Because $1 only contains word characters, I didn't need to worry about any regex-specific characters being interpolated incorrectly. However, if we change from "words" to "non-whitespace chunks," we'll need to be safer:

while ($text =~ m{
  (?<!\S) (\S+)		# a chunk of non-spaces
  \s+			# some whitespace
  (??{ quotemeta $1 }) (?!\S)	# that chunk again
}xg) {
  print "'$1' is repeated\n";
}

If you're curious why I didn't use \b in this regex, it's because I'm not looking for word boundaries, so I can't count on it working properly.

But that's a very silly use of a very powerful feature. It's not very economically sound to use your car to drive one city block and this assertion is an SUV. So let's go off-roading.

Self-Constructing Regexes

One of the first uses I found for this assertion was to match a chunk of unique characters in a string. Take, for example, "the quick brown fox jumped over the lazy dog." If I want to grab the first chunk of unique characters, I use an assertion that builds a character class based on the characters it has already matched:

my $unique = qr{
  .		# match any character
  (??{		# evaluate and match...
    "[^\Q$&\E]"	# a self-modifying char class
  })*		# zero or more times
}sx;
my ($uniq_str) = $string =~ /($unique)/;

It's important to realize that we're not matching the same character class over and over again. The delayed execution assertion is being reexecuted each time. We can see what the character class looks like if we want:

my $unique = qr{
  .
  (??{
    print my $cc = "[^\Q$&\E]";
    $cc;
  })*
}sx;
my ($uniq_str) = $string =~ /($unique)/;

On "the quick brown...," this gives the output:

[^t]
[^th]
[^the]
[^the\ ]
[^the\ q]
[^the\ qu]
[^the\ qui]
[^the\ quic]
[^the\ quick]

It stops when it reaches the second occurrence of a space because the character class fails to match something it hasn't seen before. If we wanted to ignore spaces, we could simply add the ability to match any whitespace characters before we try matching a new character:

qr{
  . (?: \s+ | (??{ "[^\Q$&\E]" }) )*
}xs

This time we would match "the quick brown f."

Dynamic Quantifiers

You can use this assertion to make a quantifier based on text you've matched. If you have a string like "2aaaa 6a 3aaa 5 1a," and you want to match a number if it is followed by exactly that many a's, you capture the number and then make use of a delayed execution assertion, which uses that number as its quantifier:

@matches = $data =~ m{
  \b (\d+)
  (??{ "a{$1}" })
  (?! a )
}gx;

You could be more efficient and use "a" x $1 instead of "a{$1}", but the example can be expanded to match a lower and upper limit:

# match X,Y, then match between X and Y a's
$data = "1,3aa  2,5aaaaaaa 2,7aaaa";
@matches = $data =~ m{
  \b (\d+) , (\d+)
  (??{ "a{$1,$2}" })
  (?! a )
}xg;

Of course, you'd have to make sure $1 isn't greater than $2, but that's left as an exercise to the reader.

Let's look back to our "is X less than 100?" example. We can rewrite this with (??{ ... }) instead:

if ($num =~ m{
  ^ (-?\d+) $
  (??{ $1 >= 100 and '(?!)' })
}x) {
  # $1 is less than 100
}

Here, if $1 is equal to or greater than 100, we insert (?!) into the regex, an empty negative look-ahead, which always fails. If $1 is less than 100, then our >= operation returns "" as its false value, and the regex succeeds.

Recursive Regexes

More often than not, you'll see these assertions used to create recursive regexes, ones that can match arbitrarily deeply nested parentheses, etc. To get this to work, you create a Regexp object and then put that object inside a delayed execution assertion:

# if you're using a lexical, it MUST be
# declared before it is assigned to
my $px;
$px = qr{
  (?:
    (?> [^\\()]+ | \\. )
    |
    \( (??{ $px }) \)
  )*
}xs;

Now we can make sure a string is properly ()-balanced:

my $good = 'a + (b / (c - 2)) * (d ^ (e+f))';
my $bad1 = 'a + (b / (c - 2) * (d ^ (e+f))';
my $bad2 = 'a + (b / (c - 2)) * (d) ^ (e+f))';
print "not " if $good !~ /^$px$/; print "ok\n";
print "not " if $bad1 !~ /^$px$/; print "ok\n";
print "not " if $bad2 !~ /^$px$/; print "ok\n";

The first is reported OK, the second and third are reported as not OK (which is what we would expect).

Using this recursive mechanism, we can create a regex that matches a palindrome. A palindrome is a string that reads the same forwards and backwards, like "toot," or "madam, I'm adam." Punctuation and case are ignored.

my $pdrome;
$pdrome = qr{
  (\w) \W*
  (??{ $pdrome })?
  \W* \1
}xi;

This matches "toot," but not "rotor." Can you see why?

There is a boundary case: If the string has an odd number of letters, then the turning point letter won't be repeated. Let's add support for that:

my $pdrome;
$pdrome = qr{
  (\w) \W*
  (??{ $pdrome })?
  \W* \1
  |
  \w
}xi;

Now we can match "Sit on a potato pan, Otis."

The Future: Perl 6

So those are the two workhorses of Perl's eccentric regex implementation. In Perl 6 (http://dev.perl.org/perl6/), these features won't disappear. In fact, the code evaluation assertion will probably become increasingly prevalent. Here's a quick taste of how Perl 6 will make these assertions easier to write and simpler to understand.

Code Evaluation

In Perl 6, regex metacharacters will change meaning to make things clearer. Code evaluation assertions are now just enclosed by curly braces. Our regex earlier that matches a number that is less than 100 could be written in Perl 6 like this:

/ ^ (\d+) $ { $1 < 100 or fail } /

The first thing you should notice is that, with all that whitespace, there is no /x modifier on my regex. That's because it will be the default. Next, you can see the code evaluation is { code }, which looks like a closure (because it is). Finally, even though this is merely a code evaluation, it can affect the regex by calling fail.

Delayed Execution

But the syntax { COND or fail } has a shorthand in Perl 6, by way of the new look of the delayed execution assertion: <...>. You'll be able to use <$rx> to get the same effect as (??{ $rx }). A special case of this assertion is <(...)>, which regards its internals as an expression that causes the match to fail at its current point if it evaluates to false:

/ ^ (\d+) $ <( $1 < 100 )> /

Backtracking Control

Finally, a note on efficiency. Our regex here is currently less efficient than it should be. After a number like "300" fails, Perl backtracks the \d+ to match "30," which fails because "30" is followed by "0," not the end of the string. But Perl still does this needless backtracking. In Perl 5, we would use (?> (\d+) ), the "cut" assertion, which prevents internal backtracking.

In Perl 6, you'll be able to say it more succinctly with a colon (or two, or three). In our case, we'll want two colons, which says that this set of alternatives is all to fail if we have to backtrack past the ::. Since our regex has no alternatives in it, it means this match will fail if the :: is backtracked past.

/ ^ (\d+) $ :: <( $1 < 100 )> /

We can get even sleeker—if we use <commit> instead of ::, the entire regex will fail, not just the match at the current location. This will eliminate our need for anchors:

/ (\d+) <commit> <( $1 < 100 )> /

Food for Thought

With your new knowledge, you should try tackling these two tasks. First, construct a regex that determines the longest sequence of unique characters in a string. Now, decipher the goal of this regex. It uses both kinds of code assertions, the if-then assertion, and the $^R variable:

my %data = $str =~ m{
  (?{ [ "", "", "" ] })
  ( [^\s=]+ ) \s* = \s*
  (?(?=["'])
    (?{
      my $c = substr($_, $+[0], 1);
      [ $c, "[^\Q$c\E]*", $c ]
    })
    |
    (?{ [ "", '\S+', "" ] })
  )
  (??{ $^R->[0] })
  ((??{ $^R->[1] }))
  (??{ $^R->[2] })
}xg;

To see its dissection, go to http://japhy.perlmonk.org/regexes/.

TPJ