diff options
Diffstat (limited to 'contrib/perl5/pod/perlretut.pod')
-rw-r--r-- | contrib/perl5/pod/perlretut.pod | 2504 |
1 files changed, 0 insertions, 2504 deletions
diff --git a/contrib/perl5/pod/perlretut.pod b/contrib/perl5/pod/perlretut.pod deleted file mode 100644 index fa6479c..0000000 --- a/contrib/perl5/pod/perlretut.pod +++ /dev/null @@ -1,2504 +0,0 @@ -=head1 NAME - -perlretut - Perl regular expressions tutorial - -=head1 DESCRIPTION - -This page provides a basic tutorial on understanding, creating and -using regular expressions in Perl. It serves as a complement to the -reference page on regular expressions L<perlre>. Regular expressions -are an integral part of the C<m//>, C<s///>, C<qr//> and C<split> -operators and so this tutorial also overlaps with -L<perlop/"Regexp Quote-Like Operators"> and L<perlfunc/split>. - -Perl is widely renowned for excellence in text processing, and regular -expressions are one of the big factors behind this fame. Perl regular -expressions display an efficiency and flexibility unknown in most -other computer languages. Mastering even the basics of regular -expressions will allow you to manipulate text with surprising ease. - -What is a regular expression? A regular expression is simply a string -that describes a pattern. Patterns are in common use these days; -examples are the patterns typed into a search engine to find web pages -and the patterns used to list files in a directory, e.g., C<ls *.txt> -or C<dir *.*>. In Perl, the patterns described by regular expressions -are used to search strings, extract desired parts of strings, and to -do search and replace operations. - -Regular expressions have the undeserved reputation of being abstract -and difficult to understand. Regular expressions are constructed using -simple concepts like conditionals and loops and are no more difficult -to understand than the corresponding C<if> conditionals and C<while> -loops in the Perl language itself. In fact, the main challenge in -learning regular expressions is just getting used to the terse -notation used to express these concepts. - -This tutorial flattens the learning curve by discussing regular -expression concepts, along with their notation, one at a time and with -many examples. The first part of the tutorial will progress from the -simplest word searches to the basic regular expression concepts. If -you master the first part, you will have all the tools needed to solve -about 98% of your needs. The second part of the tutorial is for those -comfortable with the basics and hungry for more power tools. It -discusses the more advanced regular expression operators and -introduces the latest cutting edge innovations in 5.6.0. - -A note: to save time, 'regular expression' is often abbreviated as -regexp or regex. Regexp is a more natural abbreviation than regex, but -is harder to pronounce. The Perl pod documentation is evenly split on -regexp vs regex; in Perl, there is more than one way to abbreviate it. -We'll use regexp in this tutorial. - -=head1 Part 1: The basics - -=head2 Simple word matching - -The simplest regexp is simply a word, or more generally, a string of -characters. A regexp consisting of a word matches any string that -contains that word: - - "Hello World" =~ /World/; # matches - -What is this perl statement all about? C<"Hello World"> is a simple -double quoted string. C<World> is the regular expression and the -C<//> enclosing C</World/> tells perl to search a string for a match. -The operator C<=~> associates the string with the regexp match and -produces a true value if the regexp matched, or false if the regexp -did not match. In our case, C<World> matches the second word in -C<"Hello World">, so the expression is true. Expressions like this -are useful in conditionals: - - if ("Hello World" =~ /World/) { - print "It matches\n"; - } - else { - print "It doesn't match\n"; - } - -There are useful variations on this theme. The sense of the match can -be reversed by using C<!~> operator: - - if ("Hello World" !~ /World/) { - print "It doesn't match\n"; - } - else { - print "It matches\n"; - } - -The literal string in the regexp can be replaced by a variable: - - $greeting = "World"; - if ("Hello World" =~ /$greeting/) { - print "It matches\n"; - } - else { - print "It doesn't match\n"; - } - -If you're matching against the special default variable C<$_>, the -C<$_ =~> part can be omitted: - - $_ = "Hello World"; - if (/World/) { - print "It matches\n"; - } - else { - print "It doesn't match\n"; - } - -And finally, the C<//> default delimiters for a match can be changed -to arbitrary delimiters by putting an C<'m'> out front: - - "Hello World" =~ m!World!; # matches, delimited by '!' - "Hello World" =~ m{World}; # matches, note the matching '{}' - "/usr/bin/perl" =~ m"/perl"; # matches after '/usr/bin', - # '/' becomes an ordinary char - -C</World/>, C<m!World!>, and C<m{World}> all represent the -same thing. When, e.g., C<""> is used as a delimiter, the forward -slash C<'/'> becomes an ordinary character and can be used in a regexp -without trouble. - -Let's consider how different regexps would match C<"Hello World">: - - "Hello World" =~ /world/; # doesn't match - "Hello World" =~ /o W/; # matches - "Hello World" =~ /oW/; # doesn't match - "Hello World" =~ /World /; # doesn't match - -The first regexp C<world> doesn't match because regexps are -case-sensitive. The second regexp matches because the substring -S<C<'o W'> > occurs in the string S<C<"Hello World"> >. The space -character ' ' is treated like any other character in a regexp and is -needed to match in this case. The lack of a space character is the -reason the third regexp C<'oW'> doesn't match. The fourth regexp -C<'World '> doesn't match because there is a space at the end of the -regexp, but not at the end of the string. The lesson here is that -regexps must match a part of the string I<exactly> in order for the -statement to be true. - -If a regexp matches in more than one place in the string, perl will -always match at the earliest possible point in the string: - - "Hello World" =~ /o/; # matches 'o' in 'Hello' - "That hat is red" =~ /hat/; # matches 'hat' in 'That' - -With respect to character matching, there are a few more points you -need to know about. First of all, not all characters can be used 'as -is' in a match. Some characters, called B<metacharacters>, are reserved -for use in regexp notation. The metacharacters are - - {}[]()^$.|*+?\ - -The significance of each of these will be explained -in the rest of the tutorial, but for now, it is important only to know -that a metacharacter can be matched by putting a backslash before it: - - "2+2=4" =~ /2+2/; # doesn't match, + is a metacharacter - "2+2=4" =~ /2\+2/; # matches, \+ is treated like an ordinary + - "The interval is [0,1)." =~ /[0,1)./ # is a syntax error! - "The interval is [0,1)." =~ /\[0,1\)\./ # matches - "/usr/bin/perl" =~ /\/usr\/local\/bin\/perl/; # matches - -In the last regexp, the forward slash C<'/'> is also backslashed, -because it is used to delimit the regexp. This can lead to LTS -(leaning toothpick syndrome), however, and it is often more readable -to change delimiters. - - -The backslash character C<'\'> is a metacharacter itself and needs to -be backslashed: - - 'C:\WIN32' =~ /C:\\WIN/; # matches - -In addition to the metacharacters, there are some ASCII characters -which don't have printable character equivalents and are instead -represented by B<escape sequences>. Common examples are C<\t> for a -tab, C<\n> for a newline, C<\r> for a carriage return and C<\a> for a -bell. If your string is better thought of as a sequence of arbitrary -bytes, the octal escape sequence, e.g., C<\033>, or hexadecimal escape -sequence, e.g., C<\x1B> may be a more natural representation for your -bytes. Here are some examples of escapes: - - "1000\t2000" =~ m(0\t2) # matches - "1000\n2000" =~ /0\n20/ # matches - "1000\t2000" =~ /\000\t2/ # doesn't match, "0" ne "\000" - "cat" =~ /\143\x61\x74/ # matches, but a weird way to spell cat - -If you've been around Perl a while, all this talk of escape sequences -may seem familiar. Similar escape sequences are used in double-quoted -strings and in fact the regexps in Perl are mostly treated as -double-quoted strings. This means that variables can be used in -regexps as well. Just like double-quoted strings, the values of the -variables in the regexp will be substituted in before the regexp is -evaluated for matching purposes. So we have: - - $foo = 'house'; - 'housecat' =~ /$foo/; # matches - 'cathouse' =~ /cat$foo/; # matches - 'housecat' =~ /${foo}cat/; # matches - -So far, so good. With the knowledge above you can already perform -searches with just about any literal string regexp you can dream up. -Here is a I<very simple> emulation of the Unix grep program: - - % cat > simple_grep - #!/usr/bin/perl - $regexp = shift; - while (<>) { - print if /$regexp/; - } - ^D - - % chmod +x simple_grep - - % simple_grep abba /usr/dict/words - Babbage - cabbage - cabbages - sabbath - Sabbathize - Sabbathizes - sabbatical - scabbard - scabbards - -This program is easy to understand. C<#!/usr/bin/perl> is the standard -way to invoke a perl program from the shell. -S<C<$regexp = shift;> > saves the first command line argument as the -regexp to be used, leaving the rest of the command line arguments to -be treated as files. S<C<< while (<>) >> > loops over all the lines in -all the files. For each line, S<C<print if /$regexp/;> > prints the -line if the regexp matches the line. In this line, both C<print> and -C</$regexp/> use the default variable C<$_> implicitly. - -With all of the regexps above, if the regexp matched anywhere in the -string, it was considered a match. Sometimes, however, we'd like to -specify I<where> in the string the regexp should try to match. To do -this, we would use the B<anchor> metacharacters C<^> and C<$>. The -anchor C<^> means match at the beginning of the string and the anchor -C<$> means match at the end of the string, or before a newline at the -end of the string. Here is how they are used: - - "housekeeper" =~ /keeper/; # matches - "housekeeper" =~ /^keeper/; # doesn't match - "housekeeper" =~ /keeper$/; # matches - "housekeeper\n" =~ /keeper$/; # matches - -The second regexp doesn't match because C<^> constrains C<keeper> to -match only at the beginning of the string, but C<"housekeeper"> has -keeper starting in the middle. The third regexp does match, since the -C<$> constrains C<keeper> to match only at the end of the string. - -When both C<^> and C<$> are used at the same time, the regexp has to -match both the beginning and the end of the string, i.e., the regexp -matches the whole string. Consider - - "keeper" =~ /^keep$/; # doesn't match - "keeper" =~ /^keeper$/; # matches - "" =~ /^$/; # ^$ matches an empty string - -The first regexp doesn't match because the string has more to it than -C<keep>. Since the second regexp is exactly the string, it -matches. Using both C<^> and C<$> in a regexp forces the complete -string to match, so it gives you complete control over which strings -match and which don't. Suppose you are looking for a fellow named -bert, off in a string by himself: - - "dogbert" =~ /bert/; # matches, but not what you want - - "dilbert" =~ /^bert/; # doesn't match, but .. - "bertram" =~ /^bert/; # matches, so still not good enough - - "bertram" =~ /^bert$/; # doesn't match, good - "dilbert" =~ /^bert$/; # doesn't match, good - "bert" =~ /^bert$/; # matches, perfect - -Of course, in the case of a literal string, one could just as easily -use the string equivalence S<C<$string eq 'bert'> > and it would be -more efficient. The C<^...$> regexp really becomes useful when we -add in the more powerful regexp tools below. - -=head2 Using character classes - -Although one can already do quite a lot with the literal string -regexps above, we've only scratched the surface of regular expression -technology. In this and subsequent sections we will introduce regexp -concepts (and associated metacharacter notations) that will allow a -regexp to not just represent a single character sequence, but a I<whole -class> of them. - -One such concept is that of a B<character class>. A character class -allows a set of possible characters, rather than just a single -character, to match at a particular point in a regexp. Character -classes are denoted by brackets C<[...]>, with the set of characters -to be possibly matched inside. Here are some examples: - - /cat/; # matches 'cat' - /[bcr]at/; # matches 'bat, 'cat', or 'rat' - /item[0123456789]/; # matches 'item0' or ... or 'item9' - "abc" =~ /[cab]/; # matches 'a' - -In the last statement, even though C<'c'> is the first character in -the class, C<'a'> matches because the first character position in the -string is the earliest point at which the regexp can match. - - /[yY][eE][sS]/; # match 'yes' in a case-insensitive way - # 'yes', 'Yes', 'YES', etc. - -This regexp displays a common task: perform a a case-insensitive -match. Perl provides away of avoiding all those brackets by simply -appending an C<'i'> to the end of the match. Then C</[yY][eE][sS]/;> -can be rewritten as C</yes/i;>. The C<'i'> stands for -case-insensitive and is an example of a B<modifier> of the matching -operation. We will meet other modifiers later in the tutorial. - -We saw in the section above that there were ordinary characters, which -represented themselves, and special characters, which needed a -backslash C<\> to represent themselves. The same is true in a -character class, but the sets of ordinary and special characters -inside a character class are different than those outside a character -class. The special characters for a character class are C<-]\^$>. C<]> -is special because it denotes the end of a character class. C<$> is -special because it denotes a scalar variable. C<\> is special because -it is used in escape sequences, just like above. Here is how the -special characters C<]$\> are handled: - - /[\]c]def/; # matches ']def' or 'cdef' - $x = 'bcr'; - /[$x]at/; # matches 'bat', 'cat', or 'rat' - /[\$x]at/; # matches '$at' or 'xat' - /[\\$x]at/; # matches '\at', 'bat, 'cat', or 'rat' - -The last two are a little tricky. in C<[\$x]>, the backslash protects -the dollar sign, so the character class has two members C<$> and C<x>. -In C<[\\$x]>, the backslash is protected, so C<$x> is treated as a -variable and substituted in double quote fashion. - -The special character C<'-'> acts as a range operator within character -classes, so that a contiguous set of characters can be written as a -range. With ranges, the unwieldy C<[0123456789]> and C<[abc...xyz]> -become the svelte C<[0-9]> and C<[a-z]>. Some examples are - - /item[0-9]/; # matches 'item0' or ... or 'item9' - /[0-9bx-z]aa/; # matches '0aa', ..., '9aa', - # 'baa', 'xaa', 'yaa', or 'zaa' - /[0-9a-fA-F]/; # matches a hexadecimal digit - /[0-9a-zA-Z_]/; # matches a "word" character, - # like those in a perl variable name - -If C<'-'> is the first or last character in a character class, it is -treated as an ordinary character; C<[-ab]>, C<[ab-]> and C<[a\-b]> are -all equivalent. - -The special character C<^> in the first position of a character class -denotes a B<negated character class>, which matches any character but -those in the brackets. Both C<[...]> and C<[^...]> must match a -character, or the match fails. Then - - /[^a]at/; # doesn't match 'aat' or 'at', but matches - # all other 'bat', 'cat, '0at', '%at', etc. - /[^0-9]/; # matches a non-numeric character - /[a^]at/; # matches 'aat' or '^at'; here '^' is ordinary - -Now, even C<[0-9]> can be a bother the write multiple times, so in the -interest of saving keystrokes and making regexps more readable, Perl -has several abbreviations for common character classes: - -=over 4 - -=item * - -\d is a digit and represents [0-9] - -=item * - -\s is a whitespace character and represents [\ \t\r\n\f] - -=item * - -\w is a word character (alphanumeric or _) and represents [0-9a-zA-Z_] - -=item * - -\D is a negated \d; it represents any character but a digit [^0-9] - -=item * - -\S is a negated \s; it represents any non-whitespace character [^\s] - -=item * - -\W is a negated \w; it represents any non-word character [^\w] - -=item * - -The period '.' matches any character but "\n" - -=back - -The C<\d\s\w\D\S\W> abbreviations can be used both inside and outside -of character classes. Here are some in use: - - /\d\d:\d\d:\d\d/; # matches a hh:mm:ss time format - /[\d\s]/; # matches any digit or whitespace character - /\w\W\w/; # matches a word char, followed by a - # non-word char, followed by a word char - /..rt/; # matches any two chars, followed by 'rt' - /end\./; # matches 'end.' - /end[.]/; # same thing, matches 'end.' - -Because a period is a metacharacter, it needs to be escaped to match -as an ordinary period. Because, for example, C<\d> and C<\w> are sets -of characters, it is incorrect to think of C<[^\d\w]> as C<[\D\W]>; in -fact C<[^\d\w]> is the same as C<[^\w]>, which is the same as -C<[\W]>. Think DeMorgan's laws. - -An anchor useful in basic regexps is the S<B<word anchor> > -C<\b>. This matches a boundary between a word character and a non-word -character C<\w\W> or C<\W\w>: - - $x = "Housecat catenates house and cat"; - $x =~ /cat/; # matches cat in 'housecat' - $x =~ /\bcat/; # matches cat in 'catenates' - $x =~ /cat\b/; # matches cat in 'housecat' - $x =~ /\bcat\b/; # matches 'cat' at end of string - -Note in the last example, the end of the string is considered a word -boundary. - -You might wonder why C<'.'> matches everything but C<"\n"> - why not -every character? The reason is that often one is matching against -lines and would like to ignore the newline characters. For instance, -while the string C<"\n"> represents one line, we would like to think -of as empty. Then - - "" =~ /^$/; # matches - "\n" =~ /^$/; # matches, "\n" is ignored - - "" =~ /./; # doesn't match; it needs a char - "" =~ /^.$/; # doesn't match; it needs a char - "\n" =~ /^.$/; # doesn't match; it needs a char other than "\n" - "a" =~ /^.$/; # matches - "a\n" =~ /^.$/; # matches, ignores the "\n" - -This behavior is convenient, because we usually want to ignore -newlines when we count and match characters in a line. Sometimes, -however, we want to keep track of newlines. We might even want C<^> -and C<$> to anchor at the beginning and end of lines within the -string, rather than just the beginning and end of the string. Perl -allows us to choose between ignoring and paying attention to newlines -by using the C<//s> and C<//m> modifiers. C<//s> and C<//m> stand for -single line and multi-line and they determine whether a string is to -be treated as one continuous string, or as a set of lines. The two -modifiers affect two aspects of how the regexp is interpreted: 1) how -the C<'.'> character class is defined, and 2) where the anchors C<^> -and C<$> are able to match. Here are the four possible combinations: - -=over 4 - -=item * - -no modifiers (//): Default behavior. C<'.'> matches any character -except C<"\n">. C<^> matches only at the beginning of the string and -C<$> matches only at the end or before a newline at the end. - -=item * - -s modifier (//s): Treat string as a single long line. C<'.'> matches -any character, even C<"\n">. C<^> matches only at the beginning of -the string and C<$> matches only at the end or before a newline at the -end. - -=item * - -m modifier (//m): Treat string as a set of multiple lines. C<'.'> -matches any character except C<"\n">. C<^> and C<$> are able to match -at the start or end of I<any> line within the string. - -=item * - -both s and m modifiers (//sm): Treat string as a single long line, but -detect multiple lines. C<'.'> matches any character, even -C<"\n">. C<^> and C<$>, however, are able to match at the start or end -of I<any> line within the string. - -=back - -Here are examples of C<//s> and C<//m> in action: - - $x = "There once was a girl\nWho programmed in Perl\n"; - - $x =~ /^Who/; # doesn't match, "Who" not at start of string - $x =~ /^Who/s; # doesn't match, "Who" not at start of string - $x =~ /^Who/m; # matches, "Who" at start of second line - $x =~ /^Who/sm; # matches, "Who" at start of second line - - $x =~ /girl.Who/; # doesn't match, "." doesn't match "\n" - $x =~ /girl.Who/s; # matches, "." matches "\n" - $x =~ /girl.Who/m; # doesn't match, "." doesn't match "\n" - $x =~ /girl.Who/sm; # matches, "." matches "\n" - -Most of the time, the default behavior is what is want, but C<//s> and -C<//m> are occasionally very useful. If C<//m> is being used, the start -of the string can still be matched with C<\A> and the end of string -can still be matched with the anchors C<\Z> (matches both the end and -the newline before, like C<$>), and C<\z> (matches only the end): - - $x =~ /^Who/m; # matches, "Who" at start of second line - $x =~ /\AWho/m; # doesn't match, "Who" is not at start of string - - $x =~ /girl$/m; # matches, "girl" at end of first line - $x =~ /girl\Z/m; # doesn't match, "girl" is not at end of string - - $x =~ /Perl\Z/m; # matches, "Perl" is at newline before end - $x =~ /Perl\z/m; # doesn't match, "Perl" is not at end of string - -We now know how to create choices among classes of characters in a -regexp. What about choices among words or character strings? Such -choices are described in the next section. - -=head2 Matching this or that - -Sometimes we would like to our regexp to be able to match different -possible words or character strings. This is accomplished by using -the B<alternation> metacharacter C<|>. To match C<dog> or C<cat>, we -form the regexp C<dog|cat>. As before, perl will try to match the -regexp at the earliest possible point in the string. At each -character position, perl will first try to match the first -alternative, C<dog>. If C<dog> doesn't match, perl will then try the -next alternative, C<cat>. If C<cat> doesn't match either, then the -match fails and perl moves to the next position in the string. Some -examples: - - "cats and dogs" =~ /cat|dog|bird/; # matches "cat" - "cats and dogs" =~ /dog|cat|bird/; # matches "cat" - -Even though C<dog> is the first alternative in the second regexp, -C<cat> is able to match earlier in the string. - - "cats" =~ /c|ca|cat|cats/; # matches "c" - "cats" =~ /cats|cat|ca|c/; # matches "cats" - -Here, all the alternatives match at the first string position, so the -first alternative is the one that matches. If some of the -alternatives are truncations of the others, put the longest ones first -to give them a chance to match. - - "cab" =~ /a|b|c/ # matches "c" - # /a|b|c/ == /[abc]/ - -The last example points out that character classes are like -alternations of characters. At a given character position, the first -alternative that allows the regexp match to succeed wil be the one -that matches. - -=head2 Grouping things and hierarchical matching - -Alternation allows a regexp to choose among alternatives, but by -itself it unsatisfying. The reason is that each alternative is a whole -regexp, but sometime we want alternatives for just part of a -regexp. For instance, suppose we want to search for housecats or -housekeepers. The regexp C<housecat|housekeeper> fits the bill, but is -inefficient because we had to type C<house> twice. It would be nice to -have parts of the regexp be constant, like C<house>, and and some -parts have alternatives, like C<cat|keeper>. - -The B<grouping> metacharacters C<()> solve this problem. Grouping -allows parts of a regexp to be treated as a single unit. Parts of a -regexp are grouped by enclosing them in parentheses. Thus we could solve -the C<housecat|housekeeper> by forming the regexp as -C<house(cat|keeper)>. The regexp C<house(cat|keeper)> means match -C<house> followed by either C<cat> or C<keeper>. Some more examples -are - - /(a|b)b/; # matches 'ab' or 'bb' - /(ac|b)b/; # matches 'acb' or 'bb' - /(^a|b)c/; # matches 'ac' at start of string or 'bc' anywhere - /(a|[bc])d/; # matches 'ad', 'bd', or 'cd' - - /house(cat|)/; # matches either 'housecat' or 'house' - /house(cat(s|)|)/; # matches either 'housecats' or 'housecat' or - # 'house'. Note groups can be nested. - - /(19|20|)\d\d/; # match years 19xx, 20xx, or the Y2K problem, xx - "20" =~ /(19|20|)\d\d/; # matches the null alternative '()\d\d', - # because '20\d\d' can't match - -Alternations behave the same way in groups as out of them: at a given -string position, the leftmost alternative that allows the regexp to -match is taken. So in the last example at tth first string position, -C<"20"> matches the second alternative, but there is nothing left over -to match the next two digits C<\d\d>. So perl moves on to the next -alternative, which is the null alternative and that works, since -C<"20"> is two digits. - -The process of trying one alternative, seeing if it matches, and -moving on to the next alternative if it doesn't, is called -B<backtracking>. The term 'backtracking' comes from the idea that -matching a regexp is like a walk in the woods. Successfully matching -a regexp is like arriving at a destination. There are many possible -trailheads, one for each string position, and each one is tried in -order, left to right. From each trailhead there may be many paths, -some of which get you there, and some which are dead ends. When you -walk along a trail and hit a dead end, you have to backtrack along the -trail to an earlier point to try another trail. If you hit your -destination, you stop immediately and forget about trying all the -other trails. You are persistent, and only if you have tried all the -trails from all the trailheads and not arrived at your destination, do -you declare failure. To be concrete, here is a step-by-step analysis -of what perl does when it tries to match the regexp - - "abcde" =~ /(abd|abc)(df|d|de)/; - -=over 4 - -=item 0 - -Start with the first letter in the string 'a'. - -=item 1 - -Try the first alternative in the first group 'abd'. - -=item 2 - -Match 'a' followed by 'b'. So far so good. - -=item 3 - -'d' in the regexp doesn't match 'c' in the string - a dead -end. So backtrack two characters and pick the second alternative in -the first group 'abc'. - -=item 4 - -Match 'a' followed by 'b' followed by 'c'. We are on a roll -and have satisfied the first group. Set $1 to 'abc'. - -=item 5 - -Move on to the second group and pick the first alternative -'df'. - -=item 6 - -Match the 'd'. - -=item 7 - -'f' in the regexp doesn't match 'e' in the string, so a dead -end. Backtrack one character and pick the second alternative in the -second group 'd'. - -=item 8 - -'d' matches. The second grouping is satisfied, so set $2 to -'d'. - -=item 9 - -We are at the end of the regexp, so we are done! We have -matched 'abcd' out of the string "abcde". - -=back - -There are a couple of things to note about this analysis. First, the -third alternative in the second group 'de' also allows a match, but we -stopped before we got to it - at a given character position, leftmost -wins. Second, we were able to get a match at the first character -position of the string 'a'. If there were no matches at the first -position, perl would move to the second character position 'b' and -attempt the match all over again. Only when all possible paths at all -possible character positions have been exhausted does perl give give -up and declare S<C<$string =~ /(abd|abc)(df|d|de)/;> > to be false. - -Even with all this work, regexp matching happens remarkably fast. To -speed things up, during compilation stage, perl compiles the regexp -into a compact sequence of opcodes that can often fit inside a -processor cache. When the code is executed, these opcodes can then run -at full throttle and search very quickly. - -=head2 Extracting matches - -The grouping metacharacters C<()> also serve another completely -different function: they allow the extraction of the parts of a string -that matched. This is very useful to find out what matched and for -text processing in general. For each grouping, the part that matched -inside goes into the special variables C<$1>, C<$2>, etc. They can be -used just as ordinary variables: - - # extract hours, minutes, seconds - $time =~ /(\d\d):(\d\d):(\d\d)/; # match hh:mm:ss format - $hours = $1; - $minutes = $2; - $seconds = $3; - -Now, we know that in scalar context, -S<C<$time =~ /(\d\d):(\d\d):(\d\d)/> > returns a true or false -value. In list context, however, it returns the list of matched values -C<($1,$2,$3)>. So we could write the code more compactly as - - # extract hours, minutes, seconds - ($hours, $minutes, $second) = ($time =~ /(\d\d):(\d\d):(\d\d)/); - -If the groupings in a regexp are nested, C<$1> gets the group with the -leftmost opening parenthesis, C<$2> the next opening parenthesis, -etc. For example, here is a complex regexp and the matching variables -indicated below it: - - /(ab(cd|ef)((gi)|j))/; - 1 2 34 - -so that if the regexp matched, e.g., C<$2> would contain 'cd' or 'ef'. -For convenience, perl sets C<$+> to the highest numbered C<$1>, C<$2>, -... that got assigned. - -Closely associated with the matching variables C<$1>, C<$2>, ... are -the B<backreferences> C<\1>, C<\2>, ... . Backreferences are simply -matching variables that can be used I<inside> a regexp. This is a -really nice feature - what matches later in a regexp can depend on -what matched earlier in the regexp. Suppose we wanted to look -for doubled words in text, like 'the the'. The following regexp finds -all 3-letter doubles with a space in between: - - /(\w\w\w)\s\1/; - -The grouping assigns a value to \1, so that the same 3 letter sequence -is used for both parts. Here are some words with repeated parts: - - % simple_grep '^(\w\w\w\w|\w\w\w|\w\w|\w)\1$' /usr/dict/words - beriberi - booboo - coco - mama - murmur - papa - -The regexp has a single grouping which considers 4-letter -combinations, then 3-letter combinations, etc. and uses C<\1> to look for -a repeat. Although C<$1> and C<\1> represent the same thing, care should be -taken to use matched variables C<$1>, C<$2>, ... only outside a regexp -and backreferences C<\1>, C<\2>, ... only inside a regexp; not doing -so may lead to surprising and/or undefined results. - -In addition to what was matched, Perl 5.6.0 also provides the -positions of what was matched with the C<@-> and C<@+> -arrays. C<$-[0]> is the position of the start of the entire match and -C<$+[0]> is the position of the end. Similarly, C<$-[n]> is the -position of the start of the C<$n> match and C<$+[n]> is the position -of the end. If C<$n> is undefined, so are C<$-[n]> and C<$+[n]>. Then -this code - - $x = "Mmm...donut, thought Homer"; - $x =~ /^(Mmm|Yech)\.\.\.(donut|peas)/; # matches - foreach $expr (1..$#-) { - print "Match $expr: '${$expr}' at position ($-[$expr],$+[$expr])\n"; - } - -prints - - Match 1: 'Mmm' at position (0,3) - Match 2: 'donut' at position (6,11) - -Even if there are no groupings in a regexp, it is still possible to -find out what exactly matched in a string. If you use them, perl -will set C<$`> to the part of the string before the match, will set C<$&> -to the part of the string that matched, and will set C<$'> to the part -of the string after the match. An example: - - $x = "the cat caught the mouse"; - $x =~ /cat/; # $` = 'the ', $& = 'cat', $' = ' caught the mouse' - $x =~ /the/; # $` = '', $& = 'the', $' = ' cat caught the mouse' - -In the second match, S<C<$` = ''> > because the regexp matched at the -first character position in the string and stopped, it never saw the -second 'the'. It is important to note that using C<$`> and C<$'> -slows down regexp matching quite a bit, and C< $& > slows it down to a -lesser extent, because if they are used in one regexp in a program, -they are generated for <all> regexps in the program. So if raw -performance is a goal of your application, they should be avoided. -If you need them, use C<@-> and C<@+> instead: - - $` is the same as substr( $x, 0, $-[0] ) - $& is the same as substr( $x, $-[0], $+[0]-$-[0] ) - $' is the same as substr( $x, $+[0] ) - -=head2 Matching repetitions - -The examples in the previous section display an annoying weakness. We -were only matching 3-letter words, or syllables of 4 letters or -less. We'd like to be able to match words or syllables of any length, -without writing out tedious alternatives like -C<\w\w\w\w|\w\w\w|\w\w|\w>. - -This is exactly the problem the B<quantifier> metacharacters C<?>, -C<*>, C<+>, and C<{}> were created for. They allow us to determine the -number of repeats of a portion of a regexp we consider to be a -match. Quantifiers are put immediately after the character, character -class, or grouping that we want to specify. They have the following -meanings: - -=over 4 - -=item * - -C<a?> = match 'a' 1 or 0 times - -=item * - -C<a*> = match 'a' 0 or more times, i.e., any number of times - -=item * - -C<a+> = match 'a' 1 or more times, i.e., at least once - -=item * - -C<a{n,m}> = match at least C<n> times, but not more than C<m> -times. - -=item * - -C<a{n,}> = match at least C<n> or more times - -=item * - -C<a{n}> = match exactly C<n> times - -=back - -Here are some examples: - - /[a-z]+\s+\d*/; # match a lowercase word, at least some space, and - # any number of digits - /(\w+)\s+\1/; # match doubled words of arbitrary length - /y(es)?/i; # matches 'y', 'Y', or a case-insensitive 'yes' - $year =~ /\d{2,4}/; # make sure year is at least 2 but not more - # than 4 digits - $year =~ /\d{4}|\d{2}/; # better match; throw out 3 digit dates - $year =~ /\d{2}(\d{2})?/; # same thing written differently. However, - # this produces $1 and the other does not. - - % simple_grep '^(\w+)\1$' /usr/dict/words # isn't this easier? - beriberi - booboo - coco - mama - murmur - papa - -For all of these quantifiers, perl will try to match as much of the -string as possible, while still allowing the regexp to succeed. Thus -with C</a?.../>, perl will first try to match the regexp with the C<a> -present; if that fails, perl will try to match the regexp without the -C<a> present. For the quantifier C<*>, we get the following: - - $x = "the cat in the hat"; - $x =~ /^(.*)(cat)(.*)$/; # matches, - # $1 = 'the ' - # $2 = 'cat' - # $3 = ' in the hat' - -Which is what we might expect, the match finds the only C<cat> in the -string and locks onto it. Consider, however, this regexp: - - $x =~ /^(.*)(at)(.*)$/; # matches, - # $1 = 'the cat in the h' - # $2 = 'at' - # $3 = '' (0 matches) - -One might initially guess that perl would find the C<at> in C<cat> and -stop there, but that wouldn't give the longest possible string to the -first quantifier C<.*>. Instead, the first quantifier C<.*> grabs as -much of the string as possible while still having the regexp match. In -this example, that means having the C<at> sequence with the final C<at> -in the string. The other important principle illustrated here is that -when there are two or more elements in a regexp, the I<leftmost> -quantifier, if there is one, gets to grab as much the string as -possible, leaving the rest of the regexp to fight over scraps. Thus in -our example, the first quantifier C<.*> grabs most of the string, while -the second quantifier C<.*> gets the empty string. Quantifiers that -grab as much of the string as possible are called B<maximal match> or -B<greedy> quantifiers. - -When a regexp can match a string in several different ways, we can use -the principles above to predict which way the regexp will match: - -=over 4 - -=item * - -Principle 0: Taken as a whole, any regexp will be matched at the -earliest possible position in the string. - -=item * - -Principle 1: In an alternation C<a|b|c...>, the leftmost alternative -that allows a match for the whole regexp will be the one used. - -=item * - -Principle 2: The maximal matching quantifiers C<?>, C<*>, C<+> and -C<{n,m}> will in general match as much of the string as possible while -still allowing the whole regexp to match. - -=item * - -Principle 3: If there are two or more elements in a regexp, the -leftmost greedy quantifier, if any, will match as much of the string -as possible while still allowing the whole regexp to match. The next -leftmost greedy quantifier, if any, will try to match as much of the -string remaining available to it as possible, while still allowing the -whole regexp to match. And so on, until all the regexp elements are -satisfied. - -=back - -As we have seen above, Principle 0 overrides the others - the regexp -will be matched as early as possible, with the other principles -determining how the regexp matches at that earliest character -position. - -Here is an example of these principles in action: - - $x = "The programming republic of Perl"; - $x =~ /^(.+)(e|r)(.*)$/; # matches, - # $1 = 'The programming republic of Pe' - # $2 = 'r' - # $3 = 'l' - -This regexp matches at the earliest string position, C<'T'>. One -might think that C<e>, being leftmost in the alternation, would be -matched, but C<r> produces the longest string in the first quantifier. - - $x =~ /(m{1,2})(.*)$/; # matches, - # $1 = 'mm' - # $2 = 'ing republic of Perl' - -Here, The earliest possible match is at the first C<'m'> in -C<programming>. C<m{1,2}> is the first quantifier, so it gets to match -a maximal C<mm>. - - $x =~ /.*(m{1,2})(.*)$/; # matches, - # $1 = 'm' - # $2 = 'ing republic of Perl' - -Here, the regexp matches at the start of the string. The first -quantifier C<.*> grabs as much as possible, leaving just a single -C<'m'> for the second quantifier C<m{1,2}>. - - $x =~ /(.?)(m{1,2})(.*)$/; # matches, - # $1 = 'a' - # $2 = 'mm' - # $3 = 'ing republic of Perl' - -Here, C<.?> eats its maximal one character at the earliest possible -position in the string, C<'a'> in C<programming>, leaving C<m{1,2}> -the opportunity to match both C<m>'s. Finally, - - "aXXXb" =~ /(X*)/; # matches with $1 = '' - -because it can match zero copies of C<'X'> at the beginning of the -string. If you definitely want to match at least one C<'X'>, use -C<X+>, not C<X*>. - -Sometimes greed is not good. At times, we would like quantifiers to -match a I<minimal> piece of string, rather than a maximal piece. For -this purpose, Larry Wall created the S<B<minimal match> > or -B<non-greedy> quantifiers C<??>,C<*?>, C<+?>, and C<{}?>. These are -the usual quantifiers with a C<?> appended to them. They have the -following meanings: - -=over 4 - -=item * - -C<a??> = match 'a' 0 or 1 times. Try 0 first, then 1. - -=item * - -C<a*?> = match 'a' 0 or more times, i.e., any number of times, -but as few times as possible - -=item * - -C<a+?> = match 'a' 1 or more times, i.e., at least once, but -as few times as possible - -=item * - -C<a{n,m}?> = match at least C<n> times, not more than C<m> -times, as few times as possible - -=item * - -C<a{n,}?> = match at least C<n> times, but as few times as -possible - -=item * - -C<a{n}?> = match exactly C<n> times. Because we match exactly -C<n> times, C<a{n}?> is equivalent to C<a{n}> and is just there for -notational consistency. - -=back - -Let's look at the example above, but with minimal quantifiers: - - $x = "The programming republic of Perl"; - $x =~ /^(.+?)(e|r)(.*)$/; # matches, - # $1 = 'Th' - # $2 = 'e' - # $3 = ' programming republic of Perl' - -The minimal string that will allow both the start of the string C<^> -and the alternation to match is C<Th>, with the alternation C<e|r> -matching C<e>. The second quantifier C<.*> is free to gobble up the -rest of the string. - - $x =~ /(m{1,2}?)(.*?)$/; # matches, - # $1 = 'm' - # $2 = 'ming republic of Perl' - -The first string position that this regexp can match is at the first -C<'m'> in C<programming>. At this position, the minimal C<m{1,2}?> -matches just one C<'m'>. Although the second quantifier C<.*?> would -prefer to match no characters, it is constrained by the end-of-string -anchor C<$> to match the rest of the string. - - $x =~ /(.*?)(m{1,2}?)(.*)$/; # matches, - # $1 = 'The progra' - # $2 = 'm' - # $3 = 'ming republic of Perl' - -In this regexp, you might expect the first minimal quantifier C<.*?> -to match the empty string, because it is not constrained by a C<^> -anchor to match the beginning of the word. Principle 0 applies here, -however. Because it is possible for the whole regexp to match at the -start of the string, it I<will> match at the start of the string. Thus -the first quantifier has to match everything up to the first C<m>. The -second minimal quantifier matches just one C<m> and the third -quantifier matches the rest of the string. - - $x =~ /(.??)(m{1,2})(.*)$/; # matches, - # $1 = 'a' - # $2 = 'mm' - # $3 = 'ing republic of Perl' - -Just as in the previous regexp, the first quantifier C<.??> can match -earliest at position C<'a'>, so it does. The second quantifier is -greedy, so it matches C<mm>, and the third matches the rest of the -string. - -We can modify principle 3 above to take into account non-greedy -quantifiers: - -=over 4 - -=item * - -Principle 3: If there are two or more elements in a regexp, the -leftmost greedy (non-greedy) quantifier, if any, will match as much -(little) of the string as possible while still allowing the whole -regexp to match. The next leftmost greedy (non-greedy) quantifier, if -any, will try to match as much (little) of the string remaining -available to it as possible, while still allowing the whole regexp to -match. And so on, until all the regexp elements are satisfied. - -=back - -Just like alternation, quantifiers are also susceptible to -backtracking. Here is a step-by-step analysis of the example - - $x = "the cat in the hat"; - $x =~ /^(.*)(at)(.*)$/; # matches, - # $1 = 'the cat in the h' - # $2 = 'at' - # $3 = '' (0 matches) - -=over 4 - -=item 0 - -Start with the first letter in the string 't'. - -=item 1 - -The first quantifier '.*' starts out by matching the whole -string 'the cat in the hat'. - -=item 2 - -'a' in the regexp element 'at' doesn't match the end of the -string. Backtrack one character. - -=item 3 - -'a' in the regexp element 'at' still doesn't match the last -letter of the string 't', so backtrack one more character. - -=item 4 - -Now we can match the 'a' and the 't'. - -=item 5 - -Move on to the third element '.*'. Since we are at the end of -the string and '.*' can match 0 times, assign it the empty string. - -=item 6 - -We are done! - -=back - -Most of the time, all this moving forward and backtracking happens -quickly and searching is fast. There are some pathological regexps, -however, whose execution time exponentially grows with the size of the -string. A typical structure that blows up in your face is of the form - - /(a|b+)*/; - -The problem is the nested indeterminate quantifiers. There are many -different ways of partitioning a string of length n between the C<+> -and C<*>: one repetition with C<b+> of length n, two repetitions with -the first C<b+> length k and the second with length n-k, m repetitions -whose bits add up to length n, etc. In fact there are an exponential -number of ways to partition a string as a function of length. A -regexp may get lucky and match early in the process, but if there is -no match, perl will try I<every> possibility before giving up. So be -careful with nested C<*>'s, C<{n,m}>'s, and C<+>'s. The book -I<Mastering regular expressions> by Jeffrey Friedl gives a wonderful -discussion of this and other efficiency issues. - -=head2 Building a regexp - -At this point, we have all the basic regexp concepts covered, so let's -give a more involved example of a regular expression. We will build a -regexp that matches numbers. - -The first task in building a regexp is to decide what we want to match -and what we want to exclude. In our case, we want to match both -integers and floating point numbers and we want to reject any string -that isn't a number. - -The next task is to break the problem down into smaller problems that -are easily converted into a regexp. - -The simplest case is integers. These consist of a sequence of digits, -with an optional sign in front. The digits we can represent with -C<\d+> and the sign can be matched with C<[+-]>. Thus the integer -regexp is - - /[+-]?\d+/; # matches integers - -A floating point number potentially has a sign, an integral part, a -decimal point, a fractional part, and an exponent. One or more of these -parts is optional, so we need to check out the different -possibilities. Floating point numbers which are in proper form include -123., 0.345, .34, -1e6, and 25.4E-72. As with integers, the sign out -front is completely optional and can be matched by C<[+-]?>. We can -see that if there is no exponent, floating point numbers must have a -decimal point, otherwise they are integers. We might be tempted to -model these with C<\d*\.\d*>, but this would also match just a single -decimal point, which is not a number. So the three cases of floating -point number sans exponent are - - /[+-]?\d+\./; # 1., 321., etc. - /[+-]?\.\d+/; # .1, .234, etc. - /[+-]?\d+\.\d+/; # 1.0, 30.56, etc. - -These can be combined into a single regexp with a three-way alternation: - - /[+-]?(\d+\.\d+|\d+\.|\.\d+)/; # floating point, no exponent - -In this alternation, it is important to put C<'\d+\.\d+'> before -C<'\d+\.'>. If C<'\d+\.'> were first, the regexp would happily match that -and ignore the fractional part of the number. - -Now consider floating point numbers with exponents. The key -observation here is that I<both> integers and numbers with decimal -points are allowed in front of an exponent. Then exponents, like the -overall sign, are independent of whether we are matching numbers with -or without decimal points, and can be 'decoupled' from the -mantissa. The overall form of the regexp now becomes clear: - - /^(optional sign)(integer | f.p. mantissa)(optional exponent)$/; - -The exponent is an C<e> or C<E>, followed by an integer. So the -exponent regexp is - - /[eE][+-]?\d+/; # exponent - -Putting all the parts together, we get a regexp that matches numbers: - - /^[+-]?(\d+\.\d+|\d+\.|\.\d+|\d+)([eE][+-]?\d+)?$/; # Ta da! - -Long regexps like this may impress your friends, but can be hard to -decipher. In complex situations like this, the C<//x> modifier for a -match is invaluable. It allows one to put nearly arbitrary whitespace -and comments into a regexp without affecting their meaning. Using it, -we can rewrite our 'extended' regexp in the more pleasing form - - /^ - [+-]? # first, match an optional sign - ( # then match integers or f.p. mantissas: - \d+\.\d+ # mantissa of the form a.b - |\d+\. # mantissa of the form a. - |\.\d+ # mantissa of the form .b - |\d+ # integer of the form a - ) - ([eE][+-]?\d+)? # finally, optionally match an exponent - $/x; - -If whitespace is mostly irrelevant, how does one include space -characters in an extended regexp? The answer is to backslash it -S<C<'\ '> > or put it in a character class S<C<[ ]> >. The same thing -goes for pound signs, use C<\#> or C<[#]>. For instance, Perl allows -a space between the sign and the mantissa/integer, and we could add -this to our regexp as follows: - - /^ - [+-]?\ * # first, match an optional sign *and space* - ( # then match integers or f.p. mantissas: - \d+\.\d+ # mantissa of the form a.b - |\d+\. # mantissa of the form a. - |\.\d+ # mantissa of the form .b - |\d+ # integer of the form a - ) - ([eE][+-]?\d+)? # finally, optionally match an exponent - $/x; - -In this form, it is easier to see a way to simplify the -alternation. Alternatives 1, 2, and 4 all start with C<\d+>, so it -could be factored out: - - /^ - [+-]?\ * # first, match an optional sign - ( # then match integers or f.p. mantissas: - \d+ # start out with a ... - ( - \.\d* # mantissa of the form a.b or a. - )? # ? takes care of integers of the form a - |\.\d+ # mantissa of the form .b - ) - ([eE][+-]?\d+)? # finally, optionally match an exponent - $/x; - -or written in the compact form, - - /^[+-]?\ *(\d+(\.\d*)?|\.\d+)([eE][+-]?\d+)?$/; - -This is our final regexp. To recap, we built a regexp by - -=over 4 - -=item * - -specifying the task in detail, - -=item * - -breaking down the problem into smaller parts, - -=item * - -translating the small parts into regexps, - -=item * - -combining the regexps, - -=item * - -and optimizing the final combined regexp. - -=back - -These are also the typical steps involved in writing a computer -program. This makes perfect sense, because regular expressions are -essentially programs written a little computer language that specifies -patterns. - -=head2 Using regular expressions in Perl - -The last topic of Part 1 briefly covers how regexps are used in Perl -programs. Where do they fit into Perl syntax? - -We have already introduced the matching operator in its default -C</regexp/> and arbitrary delimiter C<m!regexp!> forms. We have used -the binding operator C<=~> and its negation C<!~> to test for string -matches. Associated with the matching operator, we have discussed the -single line C<//s>, multi-line C<//m>, case-insensitive C<//i> and -extended C<//x> modifiers. - -There are a few more things you might want to know about matching -operators. First, we pointed out earlier that variables in regexps are -substituted before the regexp is evaluated: - - $pattern = 'Seuss'; - while (<>) { - print if /$pattern/; - } - -This will print any lines containing the word C<Seuss>. It is not as -efficient as it could be, however, because perl has to re-evaluate -C<$pattern> each time through the loop. If C<$pattern> won't be -changing over the lifetime of the script, we can add the C<//o> -modifier, which directs perl to only perform variable substitutions -once: - - #!/usr/bin/perl - # Improved simple_grep - $regexp = shift; - while (<>) { - print if /$regexp/o; # a good deal faster - } - -If you change C<$pattern> after the first substitution happens, perl -will ignore it. If you don't want any substitutions at all, use the -special delimiter C<m''>: - - $pattern = 'Seuss'; - while (<>) { - print if m'$pattern'; # matches '$pattern', not 'Seuss' - } - -C<m''> acts like single quotes on a regexp; all other C<m> delimiters -act like double quotes. If the regexp evaluates to the empty string, -the regexp in the I<last successful match> is used instead. So we have - - "dog" =~ /d/; # 'd' matches - "dogbert =~ //; # this matches the 'd' regexp used before - -The final two modifiers C<//g> and C<//c> concern multiple matches. -The modifier C<//g> stands for global matching and allows the the -matching operator to match within a string as many times as possible. -In scalar context, successive invocations against a string will have -`C<//g> jump from match to match, keeping track of position in the -string as it goes along. You can get or set the position with the -C<pos()> function. - -The use of C<//g> is shown in the following example. Suppose we have -a string that consists of words separated by spaces. If we know how -many words there are in advance, we could extract the words using -groupings: - - $x = "cat dog house"; # 3 words - $x =~ /^\s*(\w+)\s+(\w+)\s+(\w+)\s*$/; # matches, - # $1 = 'cat' - # $2 = 'dog' - # $3 = 'house' - -But what if we had an indeterminate number of words? This is the sort -of task C<//g> was made for. To extract all words, form the simple -regexp C<(\w+)> and loop over all matches with C</(\w+)/g>: - - while ($x =~ /(\w+)/g) { - print "Word is $1, ends at position ", pos $x, "\n"; - } - -prints - - Word is cat, ends at position 3 - Word is dog, ends at position 7 - Word is house, ends at position 13 - -A failed match or changing the target string resets the position. If -you don't want the position reset after failure to match, add the -C<//c>, as in C</regexp/gc>. The current position in the string is -associated with the string, not the regexp. This means that different -strings have different positions and their respective positions can be -set or read independently. - -In list context, C<//g> returns a list of matched groupings, or if -there are no groupings, a list of matches to the whole regexp. So if -we wanted just the words, we could use - - @words = ($x =~ /(\w+)/g); # matches, - # $word[0] = 'cat' - # $word[1] = 'dog' - # $word[2] = 'house' - -Closely associated with the C<//g> modifier is the C<\G> anchor. The -C<\G> anchor matches at the point where the previous C<//g> match left -off. C<\G> allows us to easily do context-sensitive matching: - - $metric = 1; # use metric units - ... - $x = <FILE>; # read in measurement - $x =~ /^([+-]?\d+)\s*/g; # get magnitude - $weight = $1; - if ($metric) { # error checking - print "Units error!" unless $x =~ /\Gkg\./g; - } - else { - print "Units error!" unless $x =~ /\Glbs\./g; - } - $x =~ /\G\s+(widget|sprocket)/g; # continue processing - -The combination of C<//g> and C<\G> allows us to process the string a -bit at a time and use arbitrary Perl logic to decide what to do next. - -C<\G> is also invaluable in processing fixed length records with -regexps. Suppose we have a snippet of coding region DNA, encoded as -base pair letters C<ATCGTTGAAT...> and we want to find all the stop -codons C<TGA>. In a coding region, codons are 3-letter sequences, so -we can think of the DNA snippet as a sequence of 3-letter records. The -naive regexp - - # expanded, this is "ATC GTT GAA TGC AAA TGA CAT GAC" - $dna = "ATCGTTGAATGCAAATGACATGAC"; - $dna =~ /TGA/; - -doesn't work; it may match an C<TGA>, but there is no guarantee that -the match is aligned with codon boundaries, e.g., the substring -S<C<GTT GAA> > gives a match. A better solution is - - while ($dna =~ /(\w\w\w)*?TGA/g) { # note the minimal *? - print "Got a TGA stop codon at position ", pos $dna, "\n"; - } - -which prints - - Got a TGA stop codon at position 18 - Got a TGA stop codon at position 23 - -Position 18 is good, but position 23 is bogus. What happened? - -The answer is that our regexp works well until we get past the last -real match. Then the regexp will fail to match a synchronized C<TGA> -and start stepping ahead one character position at a time, not what we -want. The solution is to use C<\G> to anchor the match to the codon -alignment: - - while ($dna =~ /\G(\w\w\w)*?TGA/g) { - print "Got a TGA stop codon at position ", pos $dna, "\n"; - } - -This prints - - Got a TGA stop codon at position 18 - -which is the correct answer. This example illustrates that it is -important not only to match what is desired, but to reject what is not -desired. - -B<search and replace> - -Regular expressions also play a big role in B<search and replace> -operations in Perl. Search and replace is accomplished with the -C<s///> operator. The general form is -C<s/regexp/replacement/modifiers>, with everything we know about -regexps and modifiers applying in this case as well. The -C<replacement> is a Perl double quoted string that replaces in the -string whatever is matched with the C<regexp>. The operator C<=~> is -also used here to associate a string with C<s///>. If matching -against C<$_>, the S<C<$_ =~> > can be dropped. If there is a match, -C<s///> returns the number of substitutions made, otherwise it returns -false. Here are a few examples: - - $x = "Time to feed the cat!"; - $x =~ s/cat/hacker/; # $x contains "Time to feed the hacker!" - if ($x =~ s/^(Time.*hacker)!$/$1 now!/) { - $more_insistent = 1; - } - $y = "'quoted words'"; - $y =~ s/^'(.*)'$/$1/; # strip single quotes, - # $y contains "quoted words" - -In the last example, the whole string was matched, but only the part -inside the single quotes was grouped. With the C<s///> operator, the -matched variables C<$1>, C<$2>, etc. are immediately available for use -in the replacement expression, so we use C<$1> to replace the quoted -string with just what was quoted. With the global modifier, C<s///g> -will search and replace all occurrences of the regexp in the string: - - $x = "I batted 4 for 4"; - $x =~ s/4/four/; # doesn't do it all: - # $x contains "I batted four for 4" - $x = "I batted 4 for 4"; - $x =~ s/4/four/g; # does it all: - # $x contains "I batted four for four" - -If you prefer 'regex' over 'regexp' in this tutorial, you could use -the following program to replace it: - - % cat > simple_replace - #!/usr/bin/perl - $regexp = shift; - $replacement = shift; - while (<>) { - s/$regexp/$replacement/go; - print; - } - ^D - - % simple_replace regexp regex perlretut.pod - -In C<simple_replace> we used the C<s///g> modifier to replace all -occurrences of the regexp on each line and the C<s///o> modifier to -compile the regexp only once. As with C<simple_grep>, both the -C<print> and the C<s/$regexp/$replacement/go> use C<$_> implicitly. - -A modifier available specifically to search and replace is the -C<s///e> evaluation modifier. C<s///e> wraps an C<eval{...}> around -the replacement string and the evaluated result is substituted for the -matched substring. C<s///e> is useful if you need to do a bit of -computation in the process of replacing text. This example counts -character frequencies in a line: - - $x = "Bill the cat"; - $x =~ s/(.)/$chars{$1}++;$1/eg; # final $1 replaces char with itself - print "frequency of '$_' is $chars{$_}\n" - foreach (sort {$chars{$b} <=> $chars{$a}} keys %chars); - -This prints - - frequency of ' ' is 2 - frequency of 't' is 2 - frequency of 'l' is 2 - frequency of 'B' is 1 - frequency of 'c' is 1 - frequency of 'e' is 1 - frequency of 'h' is 1 - frequency of 'i' is 1 - frequency of 'a' is 1 - -As with the match C<m//> operator, C<s///> can use other delimiters, -such as C<s!!!> and C<s{}{}>, and even C<s{}//>. If single quotes are -used C<s'''>, then the regexp and replacement are treated as single -quoted strings and there are no substitutions. C<s///> in list context -returns the same thing as in scalar context, i.e., the number of -matches. - -B<The split operator> - -The B<C<split> > function can also optionally use a matching operator -C<m//> to split a string. C<split /regexp/, string, limit> splits -C<string> into a list of substrings and returns that list. The regexp -is used to match the character sequence that the C<string> is split -with respect to. The C<limit>, if present, constrains splitting into -no more than C<limit> number of strings. For example, to split a -string into words, use - - $x = "Calvin and Hobbes"; - @words = split /\s+/, $x; # $word[0] = 'Calvin' - # $word[1] = 'and' - # $word[2] = 'Hobbes' - -If the empty regexp C<//> is used, the regexp always matches and -the string is split into individual characters. If the regexp has -groupings, then list produced contains the matched substrings from the -groupings as well. For instance, - - $x = "/usr/bin/perl"; - @dirs = split m!/!, $x; # $dirs[0] = '' - # $dirs[1] = 'usr' - # $dirs[2] = 'bin' - # $dirs[3] = 'perl' - @parts = split m!(/)!, $x; # $parts[0] = '' - # $parts[1] = '/' - # $parts[2] = 'usr' - # $parts[3] = '/' - # $parts[4] = 'bin' - # $parts[5] = '/' - # $parts[6] = 'perl' - -Since the first character of $x matched the regexp, C<split> prepended -an empty initial element to the list. - -If you have read this far, congratulations! You now have all the basic -tools needed to use regular expressions to solve a wide range of text -processing problems. If this is your first time through the tutorial, -why not stop here and play around with regexps a while... S<Part 2> -concerns the more esoteric aspects of regular expressions and those -concepts certainly aren't needed right at the start. - -=head1 Part 2: Power tools - -OK, you know the basics of regexps and you want to know more. If -matching regular expressions is analogous to a walk in the woods, then -the tools discussed in Part 1 are analogous to topo maps and a -compass, basic tools we use all the time. Most of the tools in part 2 -are are analogous to flare guns and satellite phones. They aren't used -too often on a hike, but when we are stuck, they can be invaluable. - -What follows are the more advanced, less used, or sometimes esoteric -capabilities of perl regexps. In Part 2, we will assume you are -comfortable with the basics and concentrate on the new features. - -=head2 More on characters, strings, and character classes - -There are a number of escape sequences and character classes that we -haven't covered yet. - -There are several escape sequences that convert characters or strings -between upper and lower case. C<\l> and C<\u> convert the next -character to lower or upper case, respectively: - - $x = "perl"; - $string =~ /\u$x/; # matches 'Perl' in $string - $x = "M(rs?|s)\\."; # note the double backslash - $string =~ /\l$x/; # matches 'mr.', 'mrs.', and 'ms.', - -C<\L> and C<\U> converts a whole substring, delimited by C<\L> or -C<\U> and C<\E>, to lower or upper case: - - $x = "This word is in lower case:\L SHOUT\E"; - $x =~ /shout/; # matches - $x = "I STILL KEYPUNCH CARDS FOR MY 360" - $x =~ /\Ukeypunch/; # matches punch card string - -If there is no C<\E>, case is converted until the end of the -string. The regexps C<\L\u$word> or C<\u\L$word> convert the first -character of C<$word> to uppercase and the rest of the characters to -lowercase. - -Control characters can be escaped with C<\c>, so that a control-Z -character would be matched with C<\cZ>. The escape sequence -C<\Q>...C<\E> quotes, or protects most non-alphabetic characters. For -instance, - - $x = "\QThat !^*&%~& cat!"; - $x =~ /\Q!^*&%~&\E/; # check for rough language - -It does not protect C<$> or C<@>, so that variables can still be -substituted. - -With the advent of 5.6.0, perl regexps can handle more than just the -standard ASCII character set. Perl now supports B<Unicode>, a standard -for encoding the character sets from many of the world's written -languages. Unicode does this by allowing characters to be more than -one byte wide. Perl uses the UTF-8 encoding, in which ASCII characters -are still encoded as one byte, but characters greater than C<chr(127)> -may be stored as two or more bytes. - -What does this mean for regexps? Well, regexp users don't need to know -much about perl's internal representation of strings. But they do need -to know 1) how to represent Unicode characters in a regexp and 2) when -a matching operation will treat the string to be searched as a -sequence of bytes (the old way) or as a sequence of Unicode characters -(the new way). The answer to 1) is that Unicode characters greater -than C<chr(127)> may be represented using the C<\x{hex}> notation, -with C<hex> a hexadecimal integer: - - use utf8; # We will be doing Unicode processing - /\x{263a}/; # match a Unicode smiley face :) - -Unicode characters in the range of 128-255 use two hexadecimal digits -with braces: C<\x{ab}>. Note that this is different than C<\xab>, -which is just a hexadecimal byte with no Unicode -significance. - -Figuring out the hexadecimal sequence of a Unicode character you want -or deciphering someone else's hexadecimal Unicode regexp is about as -much fun as programming in machine code. So another way to specify -Unicode characters is to use the S<B<named character> > escape -sequence C<\N{name}>. C<name> is a name for the Unicode character, as -specified in the Unicode standard. For instance, if we wanted to -represent or match the astrological sign for the planet Mercury, we -could use - - use utf8; # We will be doing Unicode processing - use charnames ":full"; # use named chars with Unicode full names - $x = "abc\N{MERCURY}def"; - $x =~ /\N{MERCURY}/; # matches - -One can also use short names or restrict names to a certain alphabet: - - use utf8; # We will be doing Unicode processing - - use charnames ':full'; - print "\N{GREEK SMALL LETTER SIGMA} is called sigma.\n"; - - use charnames ":short"; - print "\N{greek:Sigma} is an upper-case sigma.\n"; - - use charnames qw(greek); - print "\N{sigma} is Greek sigma\n"; - -A list of full names is found in the file Names.txt in the -lib/perl5/5.6.0/unicode directory. - -The answer to requirement 2), as of 5.6.0, is that if a regexp -contains Unicode characters, the string is searched as a sequence of -Unicode characters. Otherwise, the string is searched as a sequence of -bytes. If the string is being searched as a sequence of Unicode -characters, but matching a single byte is required, we can use the C<\C> -escape sequence. C<\C> is a character class akin to C<.> except that -it matches I<any> byte 0-255. So - - use utf8; # We will be doing Unicode processing - use charnames ":full"; # use named chars with Unicode full names - $x = "a"; - $x =~ /\C/; # matches 'a', eats one byte - $x = ""; - $x =~ /\C/; # doesn't match, no bytes to match - $x = "\N{MERCURY}"; # two-byte Unicode character - $x =~ /\C/; # matches, but dangerous! - -The last regexp matches, but is dangerous because the string -I<character> position is no longer synchronized to the string I<byte> -position. This generates the warning 'Malformed UTF-8 -character'. C<\C> is best used for matching the binary data in strings -with binary data intermixed with Unicode characters. - -Let us now discuss the rest of the character classes. Just as with -Unicode characters, there are named Unicode character classes -represented by the C<\p{name}> escape sequence. Closely associated is -the C<\P{name}> character class, which is the negation of the -C<\p{name}> class. For example, to match lower and uppercase -characters, - - use utf8; # We will be doing Unicode processing - use charnames ":full"; # use named chars with Unicode full names - $x = "BOB"; - $x =~ /^\p{IsUpper}/; # matches, uppercase char class - $x =~ /^\P{IsUpper}/; # doesn't match, char class sans uppercase - $x =~ /^\p{IsLower}/; # doesn't match, lowercase char class - $x =~ /^\P{IsLower}/; # matches, char class sans lowercase - -Here is the association between some Perl named classes and the -traditional Unicode classes: - - Perl class name Unicode class name or regular expression - - IsAlpha /^[LM]/ - IsAlnum /^[LMN]/ - IsASCII $code <= 127 - IsCntrl /^C/ - IsBlank $code =~ /^(0020|0009)$/ || /^Z[^lp]/ - IsDigit Nd - IsGraph /^([LMNPS]|Co)/ - IsLower Ll - IsPrint /^([LMNPS]|Co|Zs)/ - IsPunct /^P/ - IsSpace /^Z/ || ($code =~ /^(0009|000A|000B|000C|000D)$/ - IsSpacePerl /^Z/ || ($code =~ /^(0009|000A|000C|000D)$/ - IsUpper /^L[ut]/ - IsWord /^[LMN]/ || $code eq "005F" - IsXDigit $code =~ /^00(3[0-9]|[46][1-6])$/ - -You can also use the official Unicode class names with the C<\p> and -C<\P>, like C<\p{L}> for Unicode 'letters', or C<\p{Lu}> for uppercase -letters, or C<\P{Nd}> for non-digits. If a C<name> is just one -letter, the braces can be dropped. For instance, C<\pM> is the -character class of Unicode 'marks'. - -C<\X> is an abbreviation for a character class sequence that includes -the Unicode 'combining character sequences'. A 'combining character -sequence' is a base character followed by any number of combining -characters. An example of a combining character is an accent. Using -the Unicode full names, e.g., S<C<A + COMBINING RING> > is a combining -character sequence with base character C<A> and combining character -S<C<COMBINING RING> >, which translates in Danish to A with the circle -atop it, as in the word Angstrom. C<\X> is equivalent to C<\PM\pM*}>, -i.e., a non-mark followed by one or more marks. - -As if all those classes weren't enough, Perl also defines POSIX style -character classes. These have the form C<[:name:]>, with C<name> the -name of the POSIX class. The POSIX classes are C<alpha>, C<alnum>, -C<ascii>, C<cntrl>, C<digit>, C<graph>, C<lower>, C<print>, C<punct>, -C<space>, C<upper>, and C<xdigit>, and two extensions, C<word> (a Perl -extension to match C<\w>), and C<blank> (a GNU extension). If C<utf8> -is being used, then these classes are defined the same as their -corresponding perl Unicode classes: C<[:upper:]> is the same as -C<\p{IsUpper}>, etc. The POSIX character classes, however, don't -require using C<utf8>. The C<[:digit:]>, C<[:word:]>, and -C<[:space:]> correspond to the familiar C<\d>, C<\w>, and C<\s> -character classes. To negate a POSIX class, put a C<^> in front of -the name, so that, e.g., C<[:^digit:]> corresponds to C<\D> and under -C<utf8>, C<\P{IsDigit}>. The Unicode and POSIX character classes can -be used just like C<\d>, both inside and outside of character classes: - - /\s+[abc[:digit:]xyz]\s*/; # match a,b,c,x,y,z, or a digit - /^=item\s[:digit:]/; # match '=item', - # followed by a space and a digit - use utf8; - use charnames ":full"; - /\s+[abc\p{IsDigit}xyz]\s+/; # match a,b,c,x,y,z, or a digit - /^=item\s\p{IsDigit}/; # match '=item', - # followed by a space and a digit - -Whew! That is all the rest of the characters and character classes. - -=head2 Compiling and saving regular expressions - -In Part 1 we discussed the C<//o> modifier, which compiles a regexp -just once. This suggests that a compiled regexp is some data structure -that can be stored once and used again and again. The regexp quote -C<qr//> does exactly that: C<qr/string/> compiles the C<string> as a -regexp and transforms the result into a form that can be assigned to a -variable: - - $reg = qr/foo+bar?/; # reg contains a compiled regexp - -Then C<$reg> can be used as a regexp: - - $x = "fooooba"; - $x =~ $reg; # matches, just like /foo+bar?/ - $x =~ /$reg/; # same thing, alternate form - -C<$reg> can also be interpolated into a larger regexp: - - $x =~ /(abc)?$reg/; # still matches - -As with the matching operator, the regexp quote can use different -delimiters, e.g., C<qr!!>, C<qr{}> and C<qr~~>. The single quote -delimiters C<qr''> prevent any interpolation from taking place. - -Pre-compiled regexps are useful for creating dynamic matches that -don't need to be recompiled each time they are encountered. Using -pre-compiled regexps, C<simple_grep> program can be expanded into a -program that matches multiple patterns: - - % cat > multi_grep - #!/usr/bin/perl - # multi_grep - match any of <number> regexps - # usage: multi_grep <number> regexp1 regexp2 ... file1 file2 ... - - $number = shift; - $regexp[$_] = shift foreach (0..$number-1); - @compiled = map qr/$_/, @regexp; - while ($line = <>) { - foreach $pattern (@compiled) { - if ($line =~ /$pattern/) { - print $line; - last; # we matched, so move onto the next line - } - } - } - ^D - - % multi_grep 2 last for multi_grep - $regexp[$_] = shift foreach (0..$number-1); - foreach $pattern (@compiled) { - last; - -Storing pre-compiled regexps in an array C<@compiled> allows us to -simply loop through the regexps without any recompilation, thus gaining -flexibility without sacrificing speed. - -=head2 Embedding comments and modifiers in a regular expression - -Starting with this section, we will be discussing Perl's set of -B<extended patterns>. These are extensions to the traditional regular -expression syntax that provide powerful new tools for pattern -matching. We have already seen extensions in the form of the minimal -matching constructs C<??>, C<*?>, C<+?>, C<{n,m}?>, and C<{n,}?>. The -rest of the extensions below have the form C<(?char...)>, where the -C<char> is a character that determines the type of extension. - -The first extension is an embedded comment C<(?#text)>. This embeds a -comment into the regular expression without affecting its meaning. The -comment should not have any closing parentheses in the text. An -example is - - /(?# Match an integer:)[+-]?\d+/; - -This style of commenting has been largely superseded by the raw, -freeform commenting that is allowed with the C<//x> modifier. - -The modifiers C<//i>, C<//m>, C<//s>, and C<//x> can also embedded in -a regexp using C<(?i)>, C<(?m)>, C<(?s)>, and C<(?x)>. For instance, - - /(?i)yes/; # match 'yes' case insensitively - /yes/i; # same thing - /(?x)( # freeform version of an integer regexp - [+-]? # match an optional sign - \d+ # match a sequence of digits - ) - /x; - -Embedded modifiers can have two important advantages over the usual -modifiers. Embedded modifiers allow a custom set of modifiers to -I<each> regexp pattern. This is great for matching an array of regexps -that must have different modifiers: - - $pattern[0] = '(?i)doctor'; - $pattern[1] = 'Johnson'; - ... - while (<>) { - foreach $patt (@pattern) { - print if /$patt/; - } - } - -The second advantage is that embedded modifiers only affect the regexp -inside the group the embedded modifier is contained in. So grouping -can be used to localize the modifier's effects: - - /Answer: ((?i)yes)/; # matches 'Answer: yes', 'Answer: YES', etc. - -Embedded modifiers can also turn off any modifiers already present -by using, e.g., C<(?-i)>. Modifiers can also be combined into -a single expression, e.g., C<(?s-i)> turns on single line mode and -turns off case insensitivity. - -=head2 Non-capturing groupings - -We noted in Part 1 that groupings C<()> had two distinct functions: 1) -group regexp elements together as a single unit, and 2) extract, or -capture, substrings that matched the regexp in the -grouping. Non-capturing groupings, denoted by C<(?:regexp)>, allow the -regexp to be treated as a single unit, but don't extract substrings or -set matching variables C<$1>, etc. Both capturing and non-capturing -groupings are allowed to co-exist in the same regexp. Because there is -no extraction, non-capturing groupings are faster than capturing -groupings. Non-capturing groupings are also handy for choosing exactly -which parts of a regexp are to be extracted to matching variables: - - # match a number, $1-$4 are set, but we only want $1 - /([+-]?\ *(\d+(\.\d*)?|\.\d+)([eE][+-]?\d+)?)/; - - # match a number faster , only $1 is set - /([+-]?\ *(?:\d+(?:\.\d*)?|\.\d+)(?:[eE][+-]?\d+)?)/; - - # match a number, get $1 = whole number, $2 = exponent - /([+-]?\ *(?:\d+(?:\.\d*)?|\.\d+)(?:[eE]([+-]?\d+))?)/; - -Non-capturing groupings are also useful for removing nuisance -elements gathered from a split operation: - - $x = '12a34b5'; - @num = split /(a|b)/, $x; # @num = ('12','a','34','b','5') - @num = split /(?:a|b)/, $x; # @num = ('12','34','5') - -Non-capturing groupings may also have embedded modifiers: -C<(?i-m:regexp)> is a non-capturing grouping that matches C<regexp> -case insensitively and turns off multi-line mode. - -=head2 Looking ahead and looking behind - -This section concerns the lookahead and lookbehind assertions. First, -a little background. - -In Perl regular expressions, most regexp elements 'eat up' a certain -amount of string when they match. For instance, the regexp element -C<[abc}]> eats up one character of the string when it matches, in the -sense that perl moves to the next character position in the string -after the match. There are some elements, however, that don't eat up -characters (advance the character position) if they match. The examples -we have seen so far are the anchors. The anchor C<^> matches the -beginning of the line, but doesn't eat any characters. Similarly, the -word boundary anchor C<\b> matches, e.g., if the character to the left -is a word character and the character to the right is a non-word -character, but it doesn't eat up any characters itself. Anchors are -examples of 'zero-width assertions'. Zero-width, because they consume -no characters, and assertions, because they test some property of the -string. In the context of our walk in the woods analogy to regexp -matching, most regexp elements move us along a trail, but anchors have -us stop a moment and check our surroundings. If the local environment -checks out, we can proceed forward. But if the local environment -doesn't satisfy us, we must backtrack. - -Checking the environment entails either looking ahead on the trail, -looking behind, or both. C<^> looks behind, to see that there are no -characters before. C<$> looks ahead, to see that there are no -characters after. C<\b> looks both ahead and behind, to see if the -characters on either side differ in their 'word'-ness. - -The lookahead and lookbehind assertions are generalizations of the -anchor concept. Lookahead and lookbehind are zero-width assertions -that let us specify which characters we want to test for. The -lookahead assertion is denoted by C<(?=regexp)> and the lookbehind -assertion is denoted by C<< (?<=fixed-regexp) >>. Some examples are - - $x = "I catch the housecat 'Tom-cat' with catnip"; - $x =~ /cat(?=\s+)/; # matches 'cat' in 'housecat' - @catwords = ($x =~ /(?<=\s)cat\w+/g); # matches, - # $catwords[0] = 'catch' - # $catwords[1] = 'catnip' - $x =~ /\bcat\b/; # matches 'cat' in 'Tom-cat' - $x =~ /(?<=\s)cat(?=\s)/; # doesn't match; no isolated 'cat' in - # middle of $x - -Note that the parentheses in C<(?=regexp)> and C<< (?<=regexp) >> are -non-capturing, since these are zero-width assertions. Thus in the -second regexp, the substrings captured are those of the whole regexp -itself. Lookahead C<(?=regexp)> can match arbitrary regexps, but -lookbehind C<< (?<=fixed-regexp) >> only works for regexps of fixed -width, i.e., a fixed number of characters long. Thus -C<< (?<=(ab|bc)) >> is fine, but C<< (?<=(ab)*) >> is not. The -negated versions of the lookahead and lookbehind assertions are -denoted by C<(?!regexp)> and C<< (?<!fixed-regexp) >> respectively. -They evaluate true if the regexps do I<not> match: - - $x = "foobar"; - $x =~ /foo(?!bar)/; # doesn't match, 'bar' follows 'foo' - $x =~ /foo(?!baz)/; # matches, 'baz' doesn't follow 'foo' - $x =~ /(?<!\s)foo/; # matches, there is no \s before 'foo' - -=head2 Using independent subexpressions to prevent backtracking - -The last few extended patterns in this tutorial are experimental as of -5.6.0. Play with them, use them in some code, but don't rely on them -just yet for production code. - -S<B<Independent subexpressions> > are regular expressions, in the -context of a larger regular expression, that function independently of -the larger regular expression. That is, they consume as much or as -little of the string as they wish without regard for the ability of -the larger regexp to match. Independent subexpressions are represented -by C<< (?>regexp) >>. We can illustrate their behavior by first -considering an ordinary regexp: - - $x = "ab"; - $x =~ /a*ab/; # matches - -This obviously matches, but in the process of matching, the -subexpression C<a*> first grabbed the C<a>. Doing so, however, -wouldn't allow the whole regexp to match, so after backtracking, C<a*> -eventually gave back the C<a> and matched the empty string. Here, what -C<a*> matched was I<dependent> on what the rest of the regexp matched. - -Contrast that with an independent subexpression: - - $x =~ /(?>a*)ab/; # doesn't match! - -The independent subexpression C<< (?>a*) >> doesn't care about the rest -of the regexp, so it sees an C<a> and grabs it. Then the rest of the -regexp C<ab> cannot match. Because C<< (?>a*) >> is independent, there -is no backtracking and and the independent subexpression does not give -up its C<a>. Thus the match of the regexp as a whole fails. A similar -behavior occurs with completely independent regexps: - - $x = "ab"; - $x =~ /a*/g; # matches, eats an 'a' - $x =~ /\Gab/g; # doesn't match, no 'a' available - -Here C<//g> and C<\G> create a 'tag team' handoff of the string from -one regexp to the other. Regexps with an independent subexpression are -much like this, with a handoff of the string to the independent -subexpression, and a handoff of the string back to the enclosing -regexp. - -The ability of an independent subexpression to prevent backtracking -can be quite useful. Suppose we want to match a non-empty string -enclosed in parentheses up to two levels deep. Then the following -regexp matches: - - $x = "abc(de(fg)h"; # unbalanced parentheses - $x =~ /\( ( [^()]+ | \([^()]*\) )+ \)/x; - -The regexp matches an open parenthesis, one or more copies of an -alternation, and a close parenthesis. The alternation is two-way, with -the first alternative C<[^()]+> matching a substring with no -parentheses and the second alternative C<\([^()]*\)> matching a -substring delimited by parentheses. The problem with this regexp is -that it is pathological: it has nested indeterminate quantifiers - of the form C<(a+|b)+>. We discussed in Part 1 how nested quantifiers -like this could take an exponentially long time to execute if there -was no match possible. To prevent the exponential blowup, we need to -prevent useless backtracking at some point. This can be done by -enclosing the inner quantifier as an independent subexpression: - - $x =~ /\( ( (?>[^()]+) | \([^()]*\) )+ \)/x; - -Here, C<< (?>[^()]+) >> breaks the degeneracy of string partitioning -by gobbling up as much of the string as possible and keeping it. Then -match failures fail much more quickly. - -=head2 Conditional expressions - -A S<B<conditional expression> > is a form of if-then-else statement -that allows one to choose which patterns are to be matched, based on -some condition. There are two types of conditional expression: -C<(?(condition)yes-regexp)> and -C<(?(condition)yes-regexp|no-regexp)>. C<(?(condition)yes-regexp)> is -like an S<C<'if () {}'> > statement in Perl. If the C<condition> is true, -the C<yes-regexp> will be matched. If the C<condition> is false, the -C<yes-regexp> will be skipped and perl will move onto the next regexp -element. The second form is like an S<C<'if () {} else {}'> > statement -in Perl. If the C<condition> is true, the C<yes-regexp> will be -matched, otherwise the C<no-regexp> will be matched. - -The C<condition> can have two forms. The first form is simply an -integer in parentheses C<(integer)>. It is true if the corresponding -backreference C<\integer> matched earlier in the regexp. The second -form is a bare zero width assertion C<(?...)>, either a -lookahead, a lookbehind, or a code assertion (discussed in the next -section). - -The integer form of the C<condition> allows us to choose, with more -flexibility, what to match based on what matched earlier in the -regexp. This searches for words of the form C<"$x$x"> or -C<"$x$y$y$x">: - - % simple_grep '^(\w+)(\w+)?(?(2)\2\1|\1)$' /usr/dict/words - beriberi - coco - couscous - deed - ... - toot - toto - tutu - -The lookbehind C<condition> allows, along with backreferences, -an earlier part of the match to influence a later part of the -match. For instance, - - /[ATGC]+(?(?<=AA)G|C)$/; - -matches a DNA sequence such that it either ends in C<AAG>, or some -other base pair combination and C<C>. Note that the form is -C<< (?(?<=AA)G|C) >> and not C<< (?((?<=AA))G|C) >>; for the -lookahead, lookbehind or code assertions, the parentheses around the -conditional are not needed. - -=head2 A bit of magic: executing Perl code in a regular expression - -Normally, regexps are a part of Perl expressions. -S<B<Code evaluation> > expressions turn that around by allowing -arbitrary Perl code to be a part of of a regexp. A code evaluation -expression is denoted C<(?{code})>, with C<code> a string of Perl -statements. - -Code expressions are zero-width assertions, and the value they return -depends on their environment. There are two possibilities: either the -code expression is used as a conditional in a conditional expression -C<(?(condition)...)>, or it is not. If the code expression is a -conditional, the code is evaluated and the result (i.e., the result of -the last statement) is used to determine truth or falsehood. If the -code expression is not used as a conditional, the assertion always -evaluates true and the result is put into the special variable -C<$^R>. The variable C<$^R> can then be used in code expressions later -in the regexp. Here are some silly examples: - - $x = "abcdef"; - $x =~ /abc(?{print "Hi Mom!";})def/; # matches, - # prints 'Hi Mom!' - $x =~ /aaa(?{print "Hi Mom!";})def/; # doesn't match, - # no 'Hi Mom!' - -Pay careful attention to the next example: - - $x =~ /abc(?{print "Hi Mom!";})ddd/; # doesn't match, - # no 'Hi Mom!' - # but why not? - -At first glance, you'd think that it shouldn't print, because obviously -the C<ddd> isn't going to match the target string. But look at this -example: - - $x =~ /abc(?{print "Hi Mom!";})[d]dd/; # doesn't match, - # but _does_ print - -Hmm. What happened here? If you've been following along, you know that -the above pattern should be effectively the same as the last one -- -enclosing the d in a character class isn't going to change what it -matches. So why does the first not print while the second one does? - -The answer lies in the optimizations the REx engine makes. In the first -case, all the engine sees are plain old characters (aside from the -C<?{}> construct). It's smart enough to realize that the string 'ddd' -doesn't occur in our target string before actually running the pattern -through. But in the second case, we've tricked it into thinking that our -pattern is more complicated than it is. It takes a look, sees our -character class, and decides that it will have to actually run the -pattern to determine whether or not it matches, and in the process of -running it hits the print statement before it discovers that we don't -have a match. - -To take a closer look at how the engine does optimizations, see the -section L<"Pragmas and debugging"> below. - -More fun with C<?{}>: - - $x =~ /(?{print "Hi Mom!";})/; # matches, - # prints 'Hi Mom!' - $x =~ /(?{$c = 1;})(?{print "$c";})/; # matches, - # prints '1' - $x =~ /(?{$c = 1;})(?{print "$^R";})/; # matches, - # prints '1' - -The bit of magic mentioned in the section title occurs when the regexp -backtracks in the process of searching for a match. If the regexp -backtracks over a code expression and if the variables used within are -localized using C<local>, the changes in the variables produced by the -code expression are undone! Thus, if we wanted to count how many times -a character got matched inside a group, we could use, e.g., - - $x = "aaaa"; - $count = 0; # initialize 'a' count - $c = "bob"; # test if $c gets clobbered - $x =~ /(?{local $c = 0;}) # initialize count - ( a # match 'a' - (?{local $c = $c + 1;}) # increment count - )* # do this any number of times, - aa # but match 'aa' at the end - (?{$count = $c;}) # copy local $c var into $count - /x; - print "'a' count is $count, \$c variable is '$c'\n"; - -This prints - - 'a' count is 2, $c variable is 'bob' - -If we replace the S<C< (?{local $c = $c + 1;})> > with -S<C< (?{$c = $c + 1;})> >, the variable changes are I<not> undone -during backtracking, and we get - - 'a' count is 4, $c variable is 'bob' - -Note that only localized variable changes are undone. Other side -effects of code expression execution are permanent. Thus - - $x = "aaaa"; - $x =~ /(a(?{print "Yow\n";}))*aa/; - -produces - - Yow - Yow - Yow - Yow - -The result C<$^R> is automatically localized, so that it will behave -properly in the presence of backtracking. - -This example uses a code expression in a conditional to match the -article 'the' in either English or German: - - $lang = 'DE'; # use German - ... - $text = "das"; - print "matched\n" - if $text =~ /(?(?{ - $lang eq 'EN'; # is the language English? - }) - the | # if so, then match 'the' - (die|das|der) # else, match 'die|das|der' - ) - /xi; - -Note that the syntax here is C<(?(?{...})yes-regexp|no-regexp)>, not -C<(?((?{...}))yes-regexp|no-regexp)>. In other words, in the case of a -code expression, we don't need the extra parentheses around the -conditional. - -If you try to use code expressions with interpolating variables, perl -may surprise you: - - $bar = 5; - $pat = '(?{ 1 })'; - /foo(?{ $bar })bar/; # compiles ok, $bar not interpolated - /foo(?{ 1 })$bar/; # compile error! - /foo${pat}bar/; # compile error! - - $pat = qr/(?{ $foo = 1 })/; # precompile code regexp - /foo${pat}bar/; # compiles ok - -If a regexp has (1) code expressions and interpolating variables,or -(2) a variable that interpolates a code expression, perl treats the -regexp as an error. If the code expression is precompiled into a -variable, however, interpolating is ok. The question is, why is this -an error? - -The reason is that variable interpolation and code expressions -together pose a security risk. The combination is dangerous because -many programmers who write search engines often take user input and -plug it directly into a regexp: - - $regexp = <>; # read user-supplied regexp - $chomp $regexp; # get rid of possible newline - $text =~ /$regexp/; # search $text for the $regexp - -If the C<$regexp> variable contains a code expression, the user could -then execute arbitrary Perl code. For instance, some joker could -search for S<C<system('rm -rf *');> > to erase your files. In this -sense, the combination of interpolation and code expressions B<taints> -your regexp. So by default, using both interpolation and code -expressions in the same regexp is not allowed. If you're not -concerned about malicious users, it is possible to bypass this -security check by invoking S<C<use re 'eval'> >: - - use re 'eval'; # throw caution out the door - $bar = 5; - $pat = '(?{ 1 })'; - /foo(?{ 1 })$bar/; # compiles ok - /foo${pat}bar/; # compiles ok - -Another form of code expression is the S<B<pattern code expression> >. -The pattern code expression is like a regular code expression, except -that the result of the code evaluation is treated as a regular -expression and matched immediately. A simple example is - - $length = 5; - $char = 'a'; - $x = 'aaaaabb'; - $x =~ /(??{$char x $length})/x; # matches, there are 5 of 'a' - - -This final example contains both ordinary and pattern code -expressions. It detects if a binary string C<1101010010001...> has a -Fibonacci spacing 0,1,1,2,3,5,... of the C<1>'s: - - $s0 = 0; $s1 = 1; # initial conditions - $x = "1101010010001000001"; - print "It is a Fibonacci sequence\n" - if $x =~ /^1 # match an initial '1' - ( - (??{'0' x $s0}) # match $s0 of '0' - 1 # and then a '1' - (?{ - $largest = $s0; # largest seq so far - $s2 = $s1 + $s0; # compute next term - $s0 = $s1; # in Fibonacci sequence - $s1 = $s2; - }) - )+ # repeat as needed - $ # that is all there is - /x; - print "Largest sequence matched was $largest\n"; - -This prints - - It is a Fibonacci sequence - Largest sequence matched was 5 - -Ha! Try that with your garden variety regexp package... - -Note that the variables C<$s0> and C<$s1> are not substituted when the -regexp is compiled, as happens for ordinary variables outside a code -expression. Rather, the code expressions are evaluated when perl -encounters them during the search for a match. - -The regexp without the C<//x> modifier is - - /^1((??{'0'x$s0})1(?{$largest=$s0;$s2=$s1+$s0$s0=$s1;$s1=$s2;}))+$/; - -and is a great start on an Obfuscated Perl entry :-) When working with -code and conditional expressions, the extended form of regexps is -almost necessary in creating and debugging regexps. - -=head2 Pragmas and debugging - -Speaking of debugging, there are several pragmas available to control -and debug regexps in Perl. We have already encountered one pragma in -the previous section, S<C<use re 'eval';> >, that allows variable -interpolation and code expressions to coexist in a regexp. The other -pragmas are - - use re 'taint'; - $tainted = <>; - @parts = ($tainted =~ /(\w+)\s+(\w+)/; # @parts is now tainted - -The C<taint> pragma causes any substrings from a match with a tainted -variable to be tainted as well. This is not normally the case, as -regexps are often used to extract the safe bits from a tainted -variable. Use C<taint> when you are not extracting safe bits, but are -performing some other processing. Both C<taint> and C<eval> pragmas -are lexically scoped, which means they are in effect only until -the end of the block enclosing the pragmas. - - use re 'debug'; - /^(.*)$/s; # output debugging info - - use re 'debugcolor'; - /^(.*)$/s; # output debugging info in living color - -The global C<debug> and C<debugcolor> pragmas allow one to get -detailed debugging info about regexp compilation and -execution. C<debugcolor> is the same as debug, except the debugging -information is displayed in color on terminals that can display -termcap color sequences. Here is example output: - - % perl -e 'use re "debug"; "abc" =~ /a*b+c/;' - Compiling REx `a*b+c' - size 9 first at 1 - 1: STAR(4) - 2: EXACT <a>(0) - 4: PLUS(7) - 5: EXACT <b>(0) - 7: EXACT <c>(9) - 9: END(0) - floating `bc' at 0..2147483647 (checking floating) minlen 2 - Guessing start of match, REx `a*b+c' against `abc'... - Found floating substr `bc' at offset 1... - Guessed: match at offset 0 - Matching REx `a*b+c' against `abc' - Setting an EVAL scope, savestack=3 - 0 <> <abc> | 1: STAR - EXACT <a> can match 1 times out of 32767... - Setting an EVAL scope, savestack=3 - 1 <a> <bc> | 4: PLUS - EXACT <b> can match 1 times out of 32767... - Setting an EVAL scope, savestack=3 - 2 <ab> <c> | 7: EXACT <c> - 3 <abc> <> | 9: END - Match successful! - Freeing REx: `a*b+c' - -If you have gotten this far into the tutorial, you can probably guess -what the different parts of the debugging output tell you. The first -part - - Compiling REx `a*b+c' - size 9 first at 1 - 1: STAR(4) - 2: EXACT <a>(0) - 4: PLUS(7) - 5: EXACT <b>(0) - 7: EXACT <c>(9) - 9: END(0) - -describes the compilation stage. C<STAR(4)> means that there is a -starred object, in this case C<'a'>, and if it matches, goto line 4, -i.e., C<PLUS(7)>. The middle lines describe some heuristics and -optimizations performed before a match: - - floating `bc' at 0..2147483647 (checking floating) minlen 2 - Guessing start of match, REx `a*b+c' against `abc'... - Found floating substr `bc' at offset 1... - Guessed: match at offset 0 - -Then the match is executed and the remaining lines describe the -process: - - Matching REx `a*b+c' against `abc' - Setting an EVAL scope, savestack=3 - 0 <> <abc> | 1: STAR - EXACT <a> can match 1 times out of 32767... - Setting an EVAL scope, savestack=3 - 1 <a> <bc> | 4: PLUS - EXACT <b> can match 1 times out of 32767... - Setting an EVAL scope, savestack=3 - 2 <ab> <c> | 7: EXACT <c> - 3 <abc> <> | 9: END - Match successful! - Freeing REx: `a*b+c' - -Each step is of the form S<C<< n <x> <y> >> >, with C<< <x> >> the -part of the string matched and C<< <y> >> the part not yet -matched. The S<C<< | 1: STAR >> > says that perl is at line number 1 -n the compilation list above. See -L<perldebguts/"Debugging regular expressions"> for much more detail. - -An alternative method of debugging regexps is to embed C<print> -statements within the regexp. This provides a blow-by-blow account of -the backtracking in an alternation: - - "that this" =~ m@(?{print "Start at position ", pos, "\n";}) - t(?{print "t1\n";}) - h(?{print "h1\n";}) - i(?{print "i1\n";}) - s(?{print "s1\n";}) - | - t(?{print "t2\n";}) - h(?{print "h2\n";}) - a(?{print "a2\n";}) - t(?{print "t2\n";}) - (?{print "Done at position ", pos, "\n";}) - @x; - -prints - - Start at position 0 - t1 - h1 - t2 - h2 - a2 - t2 - Done at position 4 - -=head1 BUGS - -Code expressions, conditional expressions, and independent expressions -are B<experimental>. Don't use them in production code. Yet. - -=head1 SEE ALSO - -This is just a tutorial. For the full story on perl regular -expressions, see the L<perlre> regular expressions reference page. - -For more information on the matching C<m//> and substitution C<s///> -operators, see L<perlop/"Regexp Quote-Like Operators">. For -information on the C<split> operation, see L<perlfunc/split>. - -For an excellent all-around resource on the care and feeding of -regular expressions, see the book I<Mastering Regular Expressions> by -Jeffrey Friedl (published by O'Reilly, ISBN 1556592-257-3). - -=head1 AUTHOR AND COPYRIGHT - -Copyright (c) 2000 Mark Kvale -All rights reserved. - -This document may be distributed under the same terms as Perl itself. - -=head2 Acknowledgments - -The inspiration for the stop codon DNA example came from the ZIP -code example in chapter 7 of I<Mastering Regular Expressions>. - -The author would like to thank Jeff Pinyan, Andrew Johnson, Peter -Haworth, Ronald J Kimball, and Joe Smith for all their helpful -comments. - -=cut - |