Perl Functions Related to Regular Expressions

Perl Regular Expressions – Part 8

Perl Course

Foreword: In this part of the series, I talk about Perl functions that are related to regular expression features.

By: Chrysanthus Date Published: 6 Oct 2015

Introduction

This is part 8 of my series, Perl Regular Expressions. In this part of the series, I talk about Perl functions that are related to regular expression features. You should have read the previous parts of the series before coming here, as this is a continuation.

ucfirst
This function titlecases the first alphabet of a string. That is, it changes the first alphabet of the string to uppercase. If the letter was already in uppercase, it remains unchanged. The function returns a new changed string, leaving the original string unchanged. Try the following code:

use strict;

    my $subject = "the big boys are in town.";

    my $newSub = ucfirst($subject);

    print $newSub;

The output is:

    The big boys are in town.

uc
This function changes all alphabets in the string to uppercase. It returns a new string leaving the original string unchanged. Try the following code:

use strict;

    my $subject = "the big boys are in town.";

    my $newSub = uc($subject);

    print $newSub;

The output is:

    THE BIG BOYS ARE IN TOWN.

lcfirst
This function changes the first alphabet of a string to lowercase. If the alphabet was already in lowercase, it remains. The function returns a new string. Try the following code:

use strict;

    my $subject = "THE BIG BOYS ARE IN TOWN.";

    my $newSub = lcfirst($subject);

    print $newSub;

The output is:

tHE BIG BOYS ARE IN TOWN.

lc
This function changes all the alphabets of a string to lowercase. The function returns a new string. Try the following code:

use strict;

    my $subject = "THE BIG BOYS ARE IN TOWN.";

    my $newSub = lc($subject);

    print $newSub;

The output is:

the big boys are in town.

pos
This function returns or sets the current match position.

After a match, the pos() function can be used to return the next position that the searching in the subject is to begin, for the next match. This works with the global modifier.

In the above case, after the first match of “cat”, pos() would return 5. Counting position in string begins from zero. A good way to use the pos() function is in a while loop. The following code illustrates this:

use strict;

my $subject = "A cat is an animal. A rat is an animal. A bat is a creature.";

while($subject =~ /[cbr]at/g)
  {
    print "Next search starts at position: ", pos($subject), "\n";
  }

Here is the output of the code:

Next search starts at position: 5
Next search starts at position: 25
Next search starts at position: 45

The pos() function takes as argument the variable of the subject. The pos() function can also be used to set the position where search will continue, in the subject – see later.

quotemeta
This function is used on the pattern string itself. It escapes (precede by backslash) all ASCII non-word characters in the pattern and returns a new pattern string. ASCII word characters is the set, [A-Za-z_0-9] . Consider the following regex:

    /http://www/

Assume that this is to match the beginning of a Uniform Resource Locator. The pattern is

    http://www

In the pattern, : and / are non-ASCII word characters. If quotemeta() is used on the pattern, the pattern will become, http\:/\/\www . The following code illustrates this:

use strict;

    my $pat = "http://www";
    my $pattern = quotemeta($pat);

    "http://www.somesite.com" =~ /($pattern)/;

    print $pattern, "\n";
    print $1, "\n";

The output is:

    http\:\/\/www
    http://www

split
This function uses a regex to separate a string into parts. The syntax to use the function is:

split /pattern/, string

The split function splits a string into a list of sub strings and returns the list. The pattern is the separator e.g. a comma. The separator should not be part of the returned list.  Consider the following subject string:

$subject = "one two three";

If we know the regex pattern to identify space between words, then we can split this string into a list made up of the words, “one”, “two” and “three”. This list can be received by an array.  or \  is the character for a space.  + will match a space one or more times. The regex to separate the above words is:

               / +/
We assume that space might be created by hitting the spacebar more than once. The following code illustrates the use of the split operator with this pattern.

use strict;

my $subject = "one two three";

my @words = split / +/, $subject;

print "First Element is: ", $words[0], "\n";
print "Second Element is: ", $words[1], "\n";
print "Third Element is: ", $words[2], "\n";

In the subject, the words are separated by spaces. The output of the code is:

First Element is: one
Second Element is: two
Third Element is: three

The spilt() function has split the words in the subject using the space between the words, and put each word as element in the array.

It is possible to have words in a string separated by a comma and a space, like

my $subject = "one, two, three";

The regex to separate these words is:

          /, +/

The following code illustrates this:

use strict;

my $subject = "one, two, three";

my @words = split /, +/, $subject;

print "First Element is: ", $words[0], "\n";
print "Second Element is: ", $words[1], "\n";
print "Third Element is: ", $words[2], "\n";

The output of the above code is:

First Element is: one
Second Element is: two
Third Element is: three

Now, if the regex has groupings, then the list produced contains the matched sub strings from the groupings as well. Consider the following subject string:

my $subject = "/dir1/dir2";

The subject is a path to a directory.

We can use the following regex to split the string:

/(\/)/

The forward slash in the pattern is escaped and is in a group. The following code illustrates the use:

use strict;

my $subject = "/dir1/dir2";

my @words = split /(\/)/, $subject;

print "First Element is: ", $words[0], "\n";
print "Second Element is: ", $words[1], "\n";
print "Third Element is: ", $words[2], "\n";
print "Fourth Element is: ", $words[3], "\n";
print "Fifth Element is: ", $words[4], "\n";

The output of the code is:

First Element is:
Second Element is: /
Third Element is: dir1
Fourth Element is: /
Fifth Element is: dir2

Now, this code and its output needs explanation because of what we have as the value of the first array element. We said above that if the regex has groupings, then the list produced contains the matched sub strings from the groupings as well. The array receives the words and the matched sub strings for the group. Now, note that the separator begins the subject. So the split operator separates the beginning of the subject, which is nothing, from the first character of the subject. It returns nothing as its first separated value.

Can the grouping separator be removed from the returned list held by the array? – Yes: use a non-capturing group for the separator. The following code illustrates this:

use strict;

my $subject = "/dir1/dir2";

my @words = split /(?:\/)/, $subject;

print "First Element is: ", $words[0], "\n";
print "Second Element is: ", $words[1], "\n";
print "Third Element is: ", $words[2], "\n";
print "Fourth Element is: ", $words[3], "\n";
print "Fifth Element is: ", $words[4], "\n";

The output now is:

First Element is:
Second Element is:
Third Element is: dir1
Fourth Element is:
Fifth Element is: dir2

Now, undef has been placed in the array instead of the captured group. undef is not printed.

Can the array be shrunk by removing the undef elements? – Yes: use the following code:

use strict;

my $subject = "/dir1/dir2";

my @words = split /(?:\/)/, $subject;

foreach my $i (0..$#words)
    {
        splice @words, $i, 1 if $words[$i] eq undef;
    }

print "First Element is: ", $words[0], "\n";
print "Second Element is: ", $words[1], "\n";

The output is:

    First Element is: dir1
    Second Element is: dir2

The splice function removes the undef elements.

An Interesting example
Consider the following subject string:

my $subject = "http://www.somewebsite.com/dir1/dir2/file.htm";

This is a URL. Let us split this URL into its components, that is, “http:”, “www.somewebsite.com”, “dir1”, “dir2” and “file.htm”. The separator here is either a forward slash or a double forward slash. The pattern for this separator is:

/\/{1,2}/

The pattern wants between one or two forward slashes. This will satisfy the single or double slashes. There is no need to use a group (captured or non-captured) in the pattern. The separator will only be included in the list returned, if the pattern has a group. The following code illustrates this:

use strict;

my $subject = "http://www.somewebsite.com/dir1/dir2/file.htm";

my @words = split /\/{1,2}/, $subject;

print "First Element is: ", $words[0], "\n";
print "Second Element is: ", $words[1], "\n";
print "Third Element is: ", $words[2], "\n";
print "Fourth Element is: ", $words[3], "\n";
print "Fifth Element is: ", $words[4], "\n";

So “http:” becomes the first array element, “www.somewebsite.com”, becomes the second array element, “dir1” becomes the third array element, “dir2” becomes the fourth array element and “file.htm” becomes the fifth array element.

The output is:

First Element is: http:
Second Element is: www.somewebsite.com
Third Element is: dir1
Fourth Element is: dir2
Fifth Element is: file.htm

That is it for this part of the series.

Chrys

Broad Network

Related Articles

Perl Functions Related to Regular Expressions

Perl Regular Expressions – Part 8

Perl Course

Introduction

Related Links

Comments