Perl has a very powerful pattern matching capability that lets it decipher and modify the contents of a string in a very compact and precise manner. In its most basic form, pattern matching tests whether a string matches a "regular expression" that you specify. Perl's regular expressions come from a software package developed by Henry Spencer, used in many UNIX utilities. They typically consists of a number of characters and character expressions chained together.
A pattern matching statement in Perl has this form:
$variable_containing_a_string =~ /regular expression/modifiers;
The "=~" indicates a regular expression operation. The slashes mark the beginning and end of the regular expression. The statement returns true if the the string contained in the variable matches the regular expression, false if otherwise.
Here is a simple example:
$command = <STDIN>;
# If $command contains the word Simon,
# then execute the subroutine: do_what_Simon_says()
if($command =~ /Simon/) {
do_what_Simon_says();
}
Here is another example showing words that do and do not match a particular regular expression:
Matches |
Doesn't Match |
| The man | The men |
| he maintained his dignity | hey man |
| he man | he-man |
| he ma | He ma |
Notice that the regular expression matches if it can be found anywhere within the string, the space character is significant, and letter case does matter.
\n # new line \r # carriage return \t # tab \f # form-feed
Other symbols test whether a character belongs to a certain class. For these matches, the capitalized form means the character does NOT match the indicated type:
\d # digit \D # NOT a digit \s # whitespace (\n,\r,\t,\f) \S # NOT a whitespace \w # alphanumeric (word) character \W # NOT an alphanumeric (word) character
Matches |
Doesn't Match |
| November 24, 1998 | Nov. 24, 1998 |
| Dec 25, 1901 | Dec 5, 1901 |
| 3 45 A19678 | 3 45A 19678 |
i # igore case of letters x # extended legibility - ignore whitespaces and comments
Example:
$test =~ /expression/i; # ignore letter case
The following expressions are equivalent:
$test =~ /\s\d\dthe\W\t/x;
$test =~ /\s # whitespace
\d # two digits
the # the word 'the'
\W # non-word character
\t/x; # tab / extended legibility modifier
Anchors can check the relative postion of an expression in a word or string.
^ # beginning of string $ # end of string \b #word boundary \B #non-word boundary
Examples:
Matches |
Doesn't Match |
| the end. | the end. Or the beginning? |
| this is the end. | this is the end. Well, not quite. |
Matches |
Doesn't Match |
| Before we come to the end, let's consider | therefore, in conclusion, we must |
| Consider this phrase: the? | What do they want? |
? # zero or one
* # zero or more digits
+ # one or more
{m,n} # at least m, up to n
Example:
$test =~ /\w+\s?the\s\d*\w+/;
OR:
$test =~ /\w+ # one or more alphanuermic (word) characters
\s? # zero or one whitespaces
the # the word 'the'
\s # whitespace
\d* # zero or more digits
\w+/x; # one or more word characters / extended legibility
Matches |
Doesn't Match |
| November the 3rd | Novemberthe |
| Novemberthe third | November the3 |
| November the 24th | the 24th |
Instead of just matching a single character, a partcular set of characters can be considered, enclosed in brackets []. If the first character of the set is '^', it means a negation.
[aDf35] # matches a, D, f, 3, or 5 [a-zA-Z] # any letter of the alphabet [^G91p!] # any character except for a G, 9, 1, p, or !
Matches |
Doesn't Match |
| 1998. | 1889. |
| BnwRD42199K | November the 19th. |
$test =~ /the|this|that/;
Matches |
| the item |
| this item |
| that item |
A part of a regular expression enclosed in paratheneses is remembered for later use. For each set of paratheneses matched, a special variable is set. The first expression in parathenses is set to \1, the second to \2, and so on.
Matches |
Doesn't Match |
| the first of the list | the first of list |
$modifiable =~ s/regular expression to match/replacement expression/;
Examples:
$date = "April 3, 1933"; $date =~ s/3/the 3rd/; # $date = "April the 3rd, 1933" $whitespaces = " "; $whitespaces =~ s/\s+//; # $whitespaces = ""
Normally, only the first item matched is substituted for. The 'g' modifier causes all items matched to be replaced by the substitution string.
$date = "April 3, 1933";
$date =~ s/3/the 3rd/g;
# $date = "April the 3rd, 19the 3rdthe 3rd"
Items matched in parentheses can be reused in the substitution expression. The value matched by the first set of parentheses is stored in the special variable $1, the second in $2, and so on.
$date = "April 15, 1933";
$date =~ s/(d+), 19(d+)/the $1th, 19$2 A.D./;
# $date = "April the 15th, 1933 A.D."