Pattern Matching

Perl has a very powerful pattern matching capability that lets it decipher and modify the contents of a string in a very compact and precise manner. In its most basic form, pattern matching tests whether a string matches a "regular expression" that you specify. Perl's regular expressions come from a software package developed by Henry Spencer, used in many UNIX utilities. They typically consists of a number of characters and character expressions chained together.


A pattern matching statement in Perl has this form:

$variable_containing_a_string =~ /regular expression/modifiers;

The "=~" indicates a regular expression operation. The slashes mark the beginning and end of the regular expression. The statement returns true if the the string contained in the variable matches the regular expression, false if otherwise.

Here is a simple example:

$command = <STDIN>;
# If $command contains the word Simon, 
# then execute the subroutine: do_what_Simon_says() 
if($command =~ /Simon/) {
  do_what_Simon_says();
}

Here is another example showing words that do and do not match a particular regular expression:

Regular expression = /he ma/

Matches

Doesn't Match

The man The men
he maintained his dignity hey man
he man he-man
he ma He ma

Notice that the regular expression matches if it can be found anywhere within the string, the space character is significant, and letter case does matter.




Non-alphanumeric characters

Besides alphanumeric characters, other characters can be tested for:

\n  # new line
\r  # carriage return
\t  # tab
\f  # form-feed

Other symbols test whether a character belongs to a certain class. For these matches, the capitalized form means the character does NOT match the indicated type:

\d  # digit
\D  # NOT a digit
\s  # whitespace (\n,\r,\t,\f)
\S  # NOT a whitespace
\w  # alphanumeric (word) character
\W  # NOT an alphanumeric (word) character

Regular expression = /\w\s\d\d\W\D19\d\d/

Matches

Doesn't Match

November 24, 1998 Nov. 24, 1998
Dec 25, 1901 Dec 5, 1901
3 45 A19678 3 45A 19678



Expression modifiers

Modifiers can be added after a regular expression that affect how it is interpreted. Some useful modifiers are:

i   # igore case of letters
x   # extended legibility - ignore whitespaces and comments


Example:

$test =~ /expression/i;   # ignore letter case


The following expressions are equivalent:

$test =~ /\s\d\dthe\W\t/x;

$test =~ /\s     # whitespace
          \d     # two digits
          the    # the word 'the'
          \W     # non-word character
          \t/x;  # tab / extended legibility modifier



Anchors


Anchors can check the relative postion of an expression in a word or string.

^   # beginning of string
$   # end of string
\b  #word boundary
\B  #non-word boundary

Examples:

Regular expression = /the end.$/

Matches

Doesn't Match

the end. the end. Or the beginning?
this is the end. this is the end. Well, not quite.


Regular expression = /\bthe\B/

Matches

Doesn't Match

Before we come to the end, let's consider therefore, in conclusion, we must
Consider this phrase: the? What do they want?




Multipliers

Multipliers allow zero, one or more of a character or expression to be specified:

?      # zero or one
*      # zero or more digits
+      # one or more
{m,n}  # at least m, up to n

Example:

$test =~ /\w+\s?the\s\d*\w+/;

OR:

$test =~ /\w+    # one or more alphanuermic (word) characters
          \s?    # zero or one whitespaces
          the    # the word 'the'
          \s     # whitespace
          \d*    # zero or more digits
          \w+/x; # one or more word characters / extended legibility

Regular expression =/\w+\s?the\s\d*\w+/

Matches

Doesn't Match

November the 3rd Novemberthe
Novemberthe third November the3
November the 24th the 24th


Character ranges

Instead of just matching a single character, a partcular set of characters can be considered, enclosed in brackets []. If the first character of the set is '^', it means a negation.

[aDf35]   # matches a, D, f, 3, or 5
[a-zA-Z]  # any letter of the alphabet
[^G91p!]  # any character except for a G, 9, 1, p, or !

Regular expression =/19[0-9]+\D/

Matches

Doesn't Match

1998. 1889.
BnwRD42199K November the 19th.



Alterations


Instead of trying to match one of many different characters, you can test for the matching different expressions. Each possible match is seperated by the '|' indicated an OR.

Example:
$test =~ /the|this|that/;

Regular expression =/the|this|that/

Matches

the item
this item
that item



Parentheses and References

A part of a regular expression enclosed in paratheneses is remembered for later use. For each set of paratheneses matched, a special variable is set. The first expression in parathenses is set to \1, the second to \2, and so on.

Regular expression =/(the) first of \1 list /

Matches

Doesn't Match

the first of the list the first of list


Substitutions

Substitutions are an important part of Perl's regular expression capabilities. To make a regular expression a substitution, prefix it with an 's' and append a replacement expression.

$modifiable =~ s/regular expression to match/replacement expression/;

Examples:

$date = "April 3, 1933";
$date =~ s/3/the 3rd/;      # $date = "April the 3rd, 1933"

$whitespaces = "       ";
$whitespaces =~ s/\s+//;    # $whitespaces = ""

Normally, only the first item matched is substituted for. The 'g' modifier causes all items matched to be replaced by the substitution string.

$date = "April 3, 1933";
$date =~ s/3/the 3rd/g;      
      #  $date = "April the 3rd, 19the 3rdthe 3rd"

Items matched in parentheses can be reused in the substitution expression. The value matched by the first set of parentheses is stored in the special variable $1, the second in $2, and so on.

$date = "April 15, 1933";
$date =~ s/(d+), 19(d+)/the $1th, 19$2 A.D./;
      # $date = "April the 15th, 1933 A.D."