diff options
Diffstat (limited to 'contrib/perl5/pod/perlrequick.pod')
-rw-r--r-- | contrib/perl5/pod/perlrequick.pod | 503 |
1 files changed, 0 insertions, 503 deletions
diff --git a/contrib/perl5/pod/perlrequick.pod b/contrib/perl5/pod/perlrequick.pod deleted file mode 100644 index 5b72a35..0000000 --- a/contrib/perl5/pod/perlrequick.pod +++ /dev/null @@ -1,503 +0,0 @@ -=head1 NAME - -perlrequick - Perl regular expressions quick start - -=head1 DESCRIPTION - -This page covers the very basics of understanding, creating and -using regular expressions ('regexes') in Perl. - - -=head1 The Guide - -=head2 Simple word matching - -The simplest regex is simply a word, or more generally, a string of -characters. A regex consisting of a word matches any string that -contains that word: - - "Hello World" =~ /World/; # matches - -In this statement, C<World> is a regex and the C<//> enclosing -C</World/> tells perl to search a string for a match. The operator -C<=~> associates the string with the regex match and produces a true -value if the regex matched, or false if the regex did not match. In -our case, C<World> matches the second word in C<"Hello World">, so the -expression is true. This idea has several variations. - -Expressions like this are useful in conditionals: - - print "It matches\n" if "Hello World" =~ /World/; - -The sense of the match can be reversed by using C<!~> operator: - - print "It doesn't match\n" if "Hello World" !~ /World/; - -The literal string in the regex can be replaced by a variable: - - $greeting = "World"; - print "It matches\n" if "Hello World" =~ /$greeting/; - -If you're matching against C<$_>, the C<$_ =~> part can be omitted: - - $_ = "Hello World"; - print "It matches\n" if /World/; - -Finally, the C<//> default delimiters for a match can be changed to -arbitrary delimiters by putting an C<'m'> out front: - - "Hello World" =~ m!World!; # matches, delimited by '!' - "Hello World" =~ m{World}; # matches, note the matching '{}' - "/usr/bin/perl" =~ m"/perl"; # matches after '/usr/bin', - # '/' becomes an ordinary char - -Regexes must match a part of the string I<exactly> in order for the -statement to be true: - - "Hello World" =~ /world/; # doesn't match, case sensitive - "Hello World" =~ /o W/; # matches, ' ' is an ordinary char - "Hello World" =~ /World /; # doesn't match, no ' ' at end - -perl will always match at the earliest possible point in the string: - - "Hello World" =~ /o/; # matches 'o' in 'Hello' - "That hat is red" =~ /hat/; # matches 'hat' in 'That' - -Not all characters can be used 'as is' in a match. Some characters, -called B<metacharacters>, are reserved for use in regex notation. -The metacharacters are - - {}[]()^$.|*+?\ - -A metacharacter can be matched by putting a backslash before it: - - "2+2=4" =~ /2+2/; # doesn't match, + is a metacharacter - "2+2=4" =~ /2\+2/; # matches, \+ is treated like an ordinary + - 'C:\WIN32' =~ /C:\\WIN/; # matches - "/usr/bin/perl" =~ /\/usr\/local\/bin\/perl/; # matches - -In the last regex, the forward slash C<'/'> is also backslashed, -because it is used to delimit the regex. - -Non-printable ASCII characters are represented by B<escape sequences>. -Common examples are C<\t> for a tab, C<\n> for a newline, and C<\r> -for a carriage return. Arbitrary bytes are represented by octal -escape sequences, e.g., C<\033>, or hexadecimal escape sequences, -e.g., C<\x1B>: - - "1000\t2000" =~ m(0\t2) # matches - "cat" =~ /\143\x61\x74/ # matches, but a weird way to spell cat - -Regexes are treated mostly as double quoted strings, so variable -substitution works: - - $foo = 'house'; - 'cathouse' =~ /cat$foo/; # matches - 'housecat' =~ /${foo}cat/; # matches - -With all of the regexes above, if the regex matched anywhere in the -string, it was considered a match. To specify I<where> it should -match, we would use the B<anchor> metacharacters C<^> and C<$>. The -anchor C<^> means match at the beginning of the string and the anchor -C<$> means match at the end of the string, or before a newline at the -end of the string. Some examples: - - "housekeeper" =~ /keeper/; # matches - "housekeeper" =~ /^keeper/; # doesn't match - "housekeeper" =~ /keeper$/; # matches - "housekeeper\n" =~ /keeper$/; # matches - "housekeeper" =~ /^housekeeper$/; # matches - -=head2 Using character classes - -A B<character class> allows a set of possible characters, rather than -just a single character, to match at a particular point in a regex. -Character classes are denoted by brackets C<[...]>, with the set of -characters to be possibly matched inside. Here are some examples: - - /cat/; # matches 'cat' - /[bcr]at/; # matches 'bat', 'cat', or 'rat' - "abc" =~ /[cab]/; # matches 'a' - -In the last statement, even though C<'c'> is the first character in -the class, the earliest point at which the regex can match is C<'a'>. - - /[yY][eE][sS]/; # match 'yes' in a case-insensitive way - # 'yes', 'Yes', 'YES', etc. - /yes/i; # also match 'yes' in a case-insensitive way - -The last example shows a match with an C<'i'> B<modifier>, which makes -the match case-insensitive. - -Character classes also have ordinary and special characters, but the -sets of ordinary and special characters inside a character class are -different than those outside a character class. The special -characters for a character class are C<-]\^$> and are matched using an -escape: - - /[\]c]def/; # matches ']def' or 'cdef' - $x = 'bcr'; - /[$x]at/; # matches 'bat, 'cat', or 'rat' - /[\$x]at/; # matches '$at' or 'xat' - /[\\$x]at/; # matches '\at', 'bat, 'cat', or 'rat' - -The special character C<'-'> acts as a range operator within character -classes, so that the unwieldy C<[0123456789]> and C<[abc...xyz]> -become the svelte C<[0-9]> and C<[a-z]>: - - /item[0-9]/; # matches 'item0' or ... or 'item9' - /[0-9a-fA-F]/; # matches a hexadecimal digit - -If C<'-'> is the first or last character in a character class, it is -treated as an ordinary character. - -The special character C<^> in the first position of a character class -denotes a B<negated character class>, which matches any character but -those in the brackets. Both C<[...]> and C<[^...]> must match a -character, or the match fails. Then - - /[^a]at/; # doesn't match 'aat' or 'at', but matches - # all other 'bat', 'cat, '0at', '%at', etc. - /[^0-9]/; # matches a non-numeric character - /[a^]at/; # matches 'aat' or '^at'; here '^' is ordinary - -Perl has several abbreviations for common character classes: - -=over 4 - -=item * - -\d is a digit and represents [0-9] - -=item * - -\s is a whitespace character and represents [\ \t\r\n\f] - -=item * - -\w is a word character (alphanumeric or _) and represents [0-9a-zA-Z_] - -=item * - -\D is a negated \d; it represents any character but a digit [^0-9] - -=item * - -\S is a negated \s; it represents any non-whitespace character [^\s] - -=item * - -\W is a negated \w; it represents any non-word character [^\w] - -=item * - -The period '.' matches any character but "\n" - -=back - -The C<\d\s\w\D\S\W> abbreviations can be used both inside and outside -of character classes. Here are some in use: - - /\d\d:\d\d:\d\d/; # matches a hh:mm:ss time format - /[\d\s]/; # matches any digit or whitespace character - /\w\W\w/; # matches a word char, followed by a - # non-word char, followed by a word char - /..rt/; # matches any two chars, followed by 'rt' - /end\./; # matches 'end.' - /end[.]/; # same thing, matches 'end.' - -The S<B<word anchor> > C<\b> matches a boundary between a word -character and a non-word character C<\w\W> or C<\W\w>: - - $x = "Housecat catenates house and cat"; - $x =~ /\bcat/; # matches cat in 'catenates' - $x =~ /cat\b/; # matches cat in 'housecat' - $x =~ /\bcat\b/; # matches 'cat' at end of string - -In the last example, the end of the string is considered a word -boundary. - -=head2 Matching this or that - -We can match match different character strings with the B<alternation> -metacharacter C<'|'>. To match C<dog> or C<cat>, we form the regex -C<dog|cat>. As before, perl will try to match the regex at the -earliest possible point in the string. At each character position, -perl will first try to match the the first alternative, C<dog>. If -C<dog> doesn't match, perl will then try the next alternative, C<cat>. -If C<cat> doesn't match either, then the match fails and perl moves to -the next position in the string. Some examples: - - "cats and dogs" =~ /cat|dog|bird/; # matches "cat" - "cats and dogs" =~ /dog|cat|bird/; # matches "cat" - -Even though C<dog> is the first alternative in the second regex, -C<cat> is able to match earlier in the string. - - "cats" =~ /c|ca|cat|cats/; # matches "c" - "cats" =~ /cats|cat|ca|c/; # matches "cats" - -At a given character position, the first alternative that allows the -regex match to succeed wil be the one that matches. Here, all the -alternatives match at the first string position, so th first matches. - -=head2 Grouping things and hierarchical matching - -The B<grouping> metacharacters C<()> allow a part of a regex to be -treated as a single unit. Parts of a regex are grouped by enclosing -them in parentheses. The regex C<house(cat|keeper)> means match -C<house> followed by either C<cat> or C<keeper>. Some more examples -are - - /(a|b)b/; # matches 'ab' or 'bb' - /(^a|b)c/; # matches 'ac' at start of string or 'bc' anywhere - - /house(cat|)/; # matches either 'housecat' or 'house' - /house(cat(s|)|)/; # matches either 'housecats' or 'housecat' or - # 'house'. Note groups can be nested. - - "20" =~ /(19|20|)\d\d/; # matches the null alternative '()\d\d', - # because '20\d\d' can't match - -=head2 Extracting matches - -The grouping metacharacters C<()> also allow the extraction of the -parts of a string that matched. For each grouping, the part that -matched inside goes into the special variables C<$1>, C<$2>, etc. -They can be used just as ordinary variables: - - # extract hours, minutes, seconds - $time =~ /(\d\d):(\d\d):(\d\d)/; # match hh:mm:ss format - $hours = $1; - $minutes = $2; - $seconds = $3; - -In list context, a match C</regex/> with groupings will return the -list of matched values C<($1,$2,...)>. So we could rewrite it as - - ($hours, $minutes, $second) = ($time =~ /(\d\d):(\d\d):(\d\d)/); - -If the groupings in a regex are nested, C<$1> gets the group with the -leftmost opening parenthesis, C<$2> the next opening parenthesis, -etc. For example, here is a complex regex and the matching variables -indicated below it: - - /(ab(cd|ef)((gi)|j))/; - 1 2 34 - -Associated with the matching variables C<$1>, C<$2>, ... are -the B<backreferences> C<\1>, C<\2>, ... Backreferences are -matching variables that can be used I<inside> a regex: - - /(\w\w\w)\s\1/; # find sequences like 'the the' in string - -C<$1>, C<$2>, ... should only be used outside of a regex, and C<\1>, -C<\2>, ... only inside a regex. - -=head2 Matching repetitions - -The B<quantifier> metacharacters C<?>, C<*>, C<+>, and C<{}> allow us -to determine the number of repeats of a portion of a regex we -consider to be a match. Quantifiers are put immediately after the -character, character class, or grouping that we want to specify. They -have the following meanings: - -=over 4 - -=item * - -C<a?> = match 'a' 1 or 0 times - -=item * - -C<a*> = match 'a' 0 or more times, i.e., any number of times - -=item * - -C<a+> = match 'a' 1 or more times, i.e., at least once - -=item * - -C<a{n,m}> = match at least C<n> times, but not more than C<m> -times. - -=item * - -C<a{n,}> = match at least C<n> or more times - -=item * - -C<a{n}> = match exactly C<n> times - -=back - -Here are some examples: - - /[a-z]+\s+\d*/; # match a lowercase word, at least some space, and - # any number of digits - /(\w+)\s+\1/; # match doubled words of arbitrary length - $year =~ /\d{2,4}/; # make sure year is at least 2 but not more - # than 4 digits - $year =~ /\d{4}|\d{2}/; # better match; throw out 3 digit dates - -These quantifiers will try to match as much of the string as possible, -while still allowing the regex to match. So we have - - $x = 'the cat in the hat'; - $x =~ /^(.*)(at)(.*)$/; # matches, - # $1 = 'the cat in the h' - # $2 = 'at' - # $3 = '' (0 matches) - -The first quantifier C<.*> grabs as much of the string as possible -while still having the regex match. The second quantifier C<.*> has -no string left to it, so it matches 0 times. - -=head2 More matching - -There are a few more things you might want to know about matching -operators. In the code - - $pattern = 'Seuss'; - while (<>) { - print if /$pattern/; - } - -perl has to re-evaluate C<$pattern> each time through the loop. If -C<$pattern> won't be changing, use the C<//o> modifier, to only -perform variable substitutions once. If you don't want any -substitutions at all, use the special delimiter C<m''>: - - $pattern = 'Seuss'; - m'$pattern'; # matches '$pattern', not 'Seuss' - -The global modifier C<//g> allows the matching operator to match -within a string as many times as possible. In scalar context, -successive matches against a string will have C<//g> jump from match -to match, keeping track of position in the string as it goes along. -You can get or set the position with the C<pos()> function. -For example, - - $x = "cat dog house"; # 3 words - while ($x =~ /(\w+)/g) { - print "Word is $1, ends at position ", pos $x, "\n"; - } - -prints - - Word is cat, ends at position 3 - Word is dog, ends at position 7 - Word is house, ends at position 13 - -A failed match or changing the target string resets the position. If -you don't want the position reset after failure to match, add the -C<//c>, as in C</regex/gc>. - -In list context, C<//g> returns a list of matched groupings, or if -there are no groupings, a list of matches to the whole regex. So - - @words = ($x =~ /(\w+)/g); # matches, - # $word[0] = 'cat' - # $word[1] = 'dog' - # $word[2] = 'house' - -=head2 Search and replace - -Search and replace is performed using C<s/regex/replacement/modifiers>. -The C<replacement> is a Perl double quoted string that replaces in the -string whatever is matched with the C<regex>. The operator C<=~> is -also used here to associate a string with C<s///>. If matching -against C<$_>, the S<C<$_ =~> > can be dropped. If there is a match, -C<s///> returns the number of substitutions made, otherwise it returns -false. Here are a few examples: - - $x = "Time to feed the cat!"; - $x =~ s/cat/hacker/; # $x contains "Time to feed the hacker!" - $y = "'quoted words'"; - $y =~ s/^'(.*)'$/$1/; # strip single quotes, - # $y contains "quoted words" - -With the C<s///> operator, the matched variables C<$1>, C<$2>, etc. -are immediately available for use in the replacement expression. With -the global modifier, C<s///g> will search and replace all occurrences -of the regex in the string: - - $x = "I batted 4 for 4"; - $x =~ s/4/four/; # $x contains "I batted four for 4" - $x = "I batted 4 for 4"; - $x =~ s/4/four/g; # $x contains "I batted four for four" - -The evaluation modifier C<s///e> wraps an C<eval{...}> around the -replacement string and the evaluated result is substituted for the -matched substring. Some examples: - - # reverse all the words in a string - $x = "the cat in the hat"; - $x =~ s/(\w+)/reverse $1/ge; # $x contains "eht tac ni eht tah" - - # convert percentage to decimal - $x = "A 39% hit rate"; - $x =~ s!(\d+)%!$1/100!e; # $x contains "A 0.39 hit rate" - -The last example shows that C<s///> can use other delimiters, such as -C<s!!!> and C<s{}{}>, and even C<s{}//>. If single quotes are used -C<s'''>, then the regex and replacement are treated as single quoted -strings. - -=head2 The split operator - -C<split /regex/, string> splits C<string> into a list of substrings -and returns that list. The regex determines the character sequence -that C<string> is split with respect to. For example, to split a -string into words, use - - $x = "Calvin and Hobbes"; - @word = split /\s+/, $x; # $word[0] = 'Calvin' - # $word[1] = 'and' - # $word[2] = 'Hobbes' - -To extract a comma-delimited list of numbers, use - - $x = "1.618,2.718, 3.142"; - @const = split /,\s*/, $x; # $const[0] = '1.618' - # $const[1] = '2.718' - # $const[2] = '3.142' - -If the empty regex C<//> is used, the string is split into individual -characters. If the regex has groupings, then list produced contains -the matched substrings from the groupings as well: - - $x = "/usr/bin"; - @parts = split m!(/)!, $x; # $parts[0] = '' - # $parts[1] = '/' - # $parts[2] = 'usr' - # $parts[3] = '/' - # $parts[4] = 'bin' - -Since the first character of $x matched the regex, C<split> prepended -an empty initial element to the list. - -=head1 BUGS - -None. - -=head1 SEE ALSO - -This is just a quick start guide. For a more in-depth tutorial on -regexes, see L<perlretut> and for the reference page, see L<perlre>. - -=head1 AUTHOR AND COPYRIGHT - -Copyright (c) 2000 Mark Kvale -All rights reserved. - -This document may be distributed under the same terms as Perl itself. - -=head2 Acknowledgments - -The author would like to thank Mark-Jason Dominus, Tom Christiansen, -Ilya Zakharevich, Brad Hughes, and Mike Giroux for all their helpful -comments. - -=cut - |