Regex Operators
by pixelatedcyberdust
Date: April 03, 2004
Too put it simply, a regex unlocks the power to complete string comparisons.
That is, it gives us full control over how we view and and manipulate
any string (variable) we have. Regexes is short for Regular
Expressions and it takes even advanced programmers a while to understand
these well enough to code them efficiently and accurately.
Regular expressions is one of the reasons
Perl is such a powerful language, mastering these will give you full control
over the data you're using through your scripts. Before we begin,
here is a simple regex for us to look at.
$variable =~ s/text/TEXT/gi;
The m// operator
The m// operator is
how we deal with matching. This is used against the default
variable $_ by default but implementing another variable is just as easy
as inserting the variable name. This matching operator works well
when you need to know if a string contains a certain character, group
of characters or word or a group of words. Instead of saying if
($line eq "test") which will not work if all we want to
know if is the word test exists in $line, we would use m//
instead.
The main difference between a
simple eq or == and a m//, is one tests for equality and the other tests
for the existence of the value inside the string.
my $line;
while($line = <STDIN>)
{
if($line =~ m/exit/) { exit; }
}
This example acts as an infinite
loop until it matches what we're looking for. It's asking for input,
unless the line contains the word exit it's not going to end
for us. From this you can see where our search gets used; the characters
or words we want to match are placed inside the //.
my $text = "a blue cow ate the cheese";
if ($text =~ m/cow/)
{
print "mooooooo";
}
We are taking a predefined variable $text
and seeing if we can match the word cow anywhere in it.
As we can see, while running this code we'll get mooooo
back because it can find the word.
Remember, this matching operator doesn't test
for equality, it checks for the existence.
The s/// operator
The second most used operator
is the substitution operator. This gives us the power and
tools to manipulate our information in any way we wish. We could
scan an entire text file and change all the words "red" to "blue"
if that's what we wanted.
This works hand-in-hand with
the m/// we just learned in the fact that our words either exist
and we can do something with them, or they don't. This is to say,
we can't substitute any part of our text unless the text we want to change
already exists.
my $line;
while($line = <STDIN>)
{
chomp($line);
$line =~ s/exit/go/;
print "Did you say $line?\n";
}
We're doing a bit more work in
this example because there is a lot more to a substitution than to match
words or phrases. This is nearly the same example we used before,
if you type any phrase containing the word exit something will
happen. In this case, we are s/exit/go which means if
it finds the word exit, it will be replaced with the word
go.
The best way to learn is to do,
so run this script a few times and run a few tests. Type in words
that don't contain exit and some that do so you get familiar
with what's going on.
Unlike the match operator where
we have m/word/, we have a new set s/word/neword/. The second set
of slashes is the replacement words/characters for what you asked for
in the first set.
s/this/that;
# change the word from this to that
s/apple/pear; # change the word apple to pear
s/I have a red car/I have a red bike/; # change the entire sentence
if it matches
A few things to note before we
move on is our s/// will only work once by default and is case-sensitive.
Put simply, if we tried to change the word this to that,
by default it will only change the first occurrence of this and
leave the rest untouched and it will not match THIS.
my $text = "the rabbit
jumped down the hole where the cow lived.";
$text =~ s/the/THE/;
print $text;
This example substitutes the
lowercase word the to the uppercase THE. By running
this script you'll notice that only the first the that's found
gets replaced giving us the result: THE rabbit jumped down the hole where
the cow lived.
my $text = "the rabbit
jumped down the hole where the cow lived.";
$text =~ s/the/THE/;
print $text;
Using /g at the end of our substitution
means to substitute globally, instead of just matching the first instance
of the word or phrase we'll substitute it for each time it appears in
our data. Taking the same sentence we used before, simply by adding
the /g modifier to the end will replace every occurrence of the word the
and end with the result: THE rabbit jumped down THE hole where THE cow
lived.
my $text = "The rabbit
jumped down the hole where the cow lived.";
$text =~ s/the/THE/gi;
print $text;
With making the small change to our sentence (we capitalized
the T on The on the first word), our substitution would normally skip this
and replace only the because it's match is case sensitive.
The /i modifier changes the default to a case-insensitive
substitution. This will s/// (short for substitute) the words The,
THe, tHe and so forth with THE and since we're still using the global
modifier /g, it will change all instances of these words.
Sometimes we want to just remove certain words or phrases
instead of just s/// them with another word or phrase. This can
be done by leaving the second set of slashes empty. Doing so tells
Perl that you want to substitute the first set of words for nothing (an
empty substation), therefore removing the words completely.
my $text = "The rabbit jumped down the hole where
the cow lived.";
$text =~ s/the/gi;
print $text;
In this last example, we're removing the word the
in any case and as many times as it can be found in the string.
This will produce the results:
rabbit jumped down hole where cow lived
The
tr/// operator
The translation operator also
works on $_ by default, with this we can make a character-by-character
translation. The s/// worked on words, numbers and phrases.
This operator works on characters solely.
my $line;
while($line = <STDIN>)
{
chomp($line);
$line =~ tr/1/0/;
print "Did you say $line?\n";
}
We are translating each occurrence of the character "1"
with "0". Similar with s///, the 2nd set of slashes is what
we're converting our data into if it matches. For another simple
example,
my $text = "bear";
$text =~ tr/b/t/;
Which gives us the result tear as we are replacing
the character "b" with "t".
We can remove characters we want from our string instead
of swapping it for another. We do this using the /d (delete)
modifier. We create the character group we want to translate,
leave the second set of slashes empty and append d.
my $text = "This is a line of text";
$text =~ tr/a//d;
print "results: $text";
Take not the second set of slashes // are to be left empty
if you want to delete the characters instead of swapping them with another.
In our example above, we removed all the "a"s from our text,
which was just one however. A better example would have been to
remove an "i" or an "e", but I'll leave that up to
you to test.
We now have a fairly good understanding of swapping one
character with another, Perl allows us to swap more than one at a time.
This is to say, we can tr/// as few (if greater than one, of course) or
as many characters at a time as we want.
my $text = "This is the line that never ends. Yes
it goes on and on my friend. Some people started writing it, not knowing
what it was. And they'll continue writing it forever just because...this
is the line that never ends!";
$text =~ tr/th/ht/d;
print "results: $text";
You will notice we are translating two different characters,
the T and the H. We are swapping them with H and T. You can
swap as many or as little as you want like we discussed earlier, but keep
in mind it's in a set order. The first character in the first set
will swap with the first character in the second set (our "t"
was swapped with "h"), the second character in the first set
will always swap with the second letter in the second set (our "h"
swapped with "t").
This example let us switch the H's and T's around making
funny text :) These are case sensitive too, tr/A// will not be the
same as tr/a// and as of the time of writing this, I don't know of a case-insensitive
modifier to remedy this. So you'll need to use tr/Aa// if you want
to catch all of the same character.
Four our last example, let's have a little fun and remove
all the vowels from our text! We would do that by adding each of
the vowels to the first set of // and appending the delete modifier.
my $text = "This is the line that never ends. Yes
it goes on and on my friend. Some people started writing it, not knowing
what it was. And they'll continue writing it forever just because...this
is the line that never ends!";
$text =~ tr/aeiou//d;
print "results: $text";
We get the results (LOL):
Ths s th ln tht nvr nds. Ys t gs n nd n my frnd. Sm
ppl strtd wrtng t,
nt knwng wht t ws. And thy'll cntn wrtng t frvr jst bcs...ths s th ln
tht nvr
nds!
Challenges
1) Of the three regex operators
we learned, which one(s) does not alter the data in any way?
------------------------------------------------------------------------
The m// match operator
only matches segments of a string,
s/// and tr/// are used to change the data.
------------------------------------------------------------------------
2) We are trying to remove all
the "a"s from our variable $sentence using s/// but it's not
removing "A". How can we remove all cases?
------------------------------------------------------------------------
We need to setup a
case insensitive substitution. We do this using the case-insensitive
modifer, /i.
$sentence
=~ s/a/gi;
------------------------------------------------------------------------
3) What is the difference between
substitution and translation?
------------------------------------------------------------------------
Substution, or s///,
replaces words, numbers or phrases from a string. Translation, or
tr///, only translates or swaps characters.
An example
of s/// would be: s/word/this/gi, s/apple/pear/gi, s/moon is out/sun is
out/gi.
An example
of tr// would be: tr/a/e/, tr/1/0/, tr/x/z.
------------------------------------------------------------------------
|