Extra Features in PHP Regular Expressions
PHP Regular Expressions – Part VIII
Forward: What we have learned would solve many of our problems. However, there will come a time when you would want to do more in regex. So this last part of the series is to add to what we have learned and introduce you to extra features in PHP.
By: Chrysanthus Date Published: 11 Aug 2012
Introduction
Internal Option Setting
You can embed modifiers in the regex (in the pattern). I will use the case-less modifier, i to illustrate this. Remember, the case-less modifier makes the matching insensitive. However, when you embed a modifier, it has its effect from the point of embedding to the end of the regex. The exception to this is when the modifier is in a subpattern (see below). A modifier is embedded by enclosing it in the characters, (?), just after the ‘?’ sign.
Consider the subject,
"XYZ"
and the regex,
"/(?i)xyz/"
Note the character set, “(?i)” that has the i modifier. The above regex would match all of the above subject, since the modifier is the first element in the regex. The following expression produces a match:
preg_match("/(?i)xyz/", "XYZ")
Consider the following regex:
"/xy(?i)z/"
Here, the modifier has been put just before the last character, ‘z’. When the modifier is included in the regex, it has effect from the point of inclusion to the end of the regex. So the above regex would match, “xyZ” or “xyz”.
preg_match("/xyz(?i)/", "XYZ")
Modifiers embedded in this way, are called Internal Options.
Now, the following two regexes are the same:
/(?i)xyz/
and
/xyz/i
For the second one, the whole regex is case insensitive; we have seen this. For the first one, the whole regex is case insensitive by virtue of the fact that the modifier is at the beginning of the regex (inside the pattern).
You can unset a modifier by preceding it’s letter inside the pattern with a hyphen. Let us now look for a regex that can match, "XYz" or "Xyz" or "xYz" or "xyz". The regex for these subjects is:
/xy(?-i)z/i
Note that at the end of the regex, you have the i modifier which makes all the regex case insensitive. So in the regex, x and y are case insensitive. However, the case insensitivity of z has been unset by the presence of the option (?-i) in front of it. So, now, z in the regex is in lower case and would only match a corresponding lower case z in the subject.
Internal options can be used with long subject strings as well. The following expression produces a match.
preg_match("/the I(?i)nternet/", "I work with the Internet.")
The regex would match “the Internet” or the “the INTERNET”.
You use the following tag to insert a comment into your regex:
(?#Comment)
You start with ‘(?#’ you type your comment and then you end with ‘)’. The regex, /the I(?i)nternet/ can be commented as follows:
/the I(?# the first part of the regex)(?i)nternet(?# I for Internet must be in upper case)/
We saw the use of the x modifier to include a comment in a regex in part VI. Using the tag “(?#Comment)” is good when your regex and comments are on one line. If you want your regex and it comments to be on more than one line, then you should use the x modifier and escape all the white spaces, as follows:
$re = "/the I# the first part of the regex
nternet# I for Internet must be in upper case
/x";
The above literal is assigned to a variable and the variable would be used in the preg_match() function as follows:
preg_match("/the I# the first part of the regex
nternet# I for Internet must be in upper case
/x", "I work with the Internet.")
Note: the “(?#Comment)” tag cannot be nested, You cannot have “(?#Comment(?#Comment))” in a regex
A subpattern is a pattern in parentheses in regex. By default, any such pattern is captured into an array. The variable of this array is the third argument in the preg_match() function. Consider the following code:
<html>
<head>
</head>
<body>
<?php
if (preg_match("/(one).*(two)/", "This is one and that is two.", $matches))
echo "Matched" . "<br />";
else
echo "Not Matched" . "<br />";
echo $matches[0] . "<br />";
echo $matches[1] . "<br />";
echo $matches[2] . "<br />";
?>
</body>
</html>
This is the output of the above code:
Matched
one and that is two
one
two
The first item in the output is “Matched”. This is displayed by the if-statement in the code when matching occurs in the function, preg_match(). The next three lines in the output are elements captured and stored in the array, $matches, by the preg_match() function. The first element in the array is the complete sub string matched in the subject string. The next two elements in the array are the sub strings of the subpatterns captured. The two subpatterns are “(one)” and “(two)”. So “one” and “two” in the subject string are captured.
<html>
<head>
</head>
<body>
<?php
if (preg_match("/(?:one).*(two)/", "This is one and that is two.", $matches))
echo "Matched" . "<br />";
else
echo "Not Matched" . "<br />";
echo "<br />";
echo $matches[0] . "<br />";
echo $matches[1] . "<br />";
echo $matches[2] . "<br />";
?>
</body>
</html>
The output of the code is:
Matched
one and that is two
two
The last two lines are the elements of the array. The first element of this array is the entire sub string matched. The rest of the elements are sub strings captured. We prevented the first subpattern, “(one)” from being captured by transforming it into, “(?:one)”. From the output, we see that “one” of the subject string has not been captured, as we expected. “two” has been captured.
The tag for making group non-capturing is
(?:subpattern)
We have seen how you can embed modifiers in a regex. You may want to include a modifier in a non-capturing subpattern. There are two ways of doing this. Let us say you want include the modifier, i in the non-capturing sub pattern “(?:one)” above. You can do it like this:
(?:(?i)one)
or like this
(?i:one)
The first method above is the more obvious way (based on what we have learned). The second method is like a contraction of the first method. The following expression produces a match:
preg_match("/(?:(?i)one).*(two)/", "This is ONE and that is two.")
The following expression also produces a match.
preg_match("/(?i:one).*(two)/", "This is ONE and that is two.")
Modifiers in Subpatterns
We said at the beginning of this part of the series, that a modifier embedded in a regex, has its effects from the point of inclusion to the end of the regex. The question you may have is this: “If the modifier is in a subpattern, would it have its effect only in the subpattern or right to the end of the regex out of the subpattern?”
Let us just write two short scripts to verify that. This is the first:
<html>
<head>
</head>
<body>
<?php
if (preg_match("/(?i:one).*(two)/", "This is ONE and that is TWO."))
echo "Matched" . "<br />";
else
echo "Not Matched" . "<br />";
?>
</body>
</html>
The above script does not produce a match. In the above script, the i modifier is inside a non-capturing subpattern. The word “TWO” inside the subject is in upper case. A match is not produced.
In the following script, we are not dealing with a non-capturing subpattern; we are dealing with a capturing subpattern.
<html>
<head>
</head>
<body>
<?php
if (preg_match("/((?i)one).*(two)/", "This is ONE and that is TWO."))
echo "Matched" . "<br />";
else
echo "Not Matched" . "<br />";
?>
</body>
</html>
The above script does not produce a match. In the above script, the i modifier is inside a capturing subpattern. The word “TWO” inside the subject is in upper case. A match is not produced.
We conclude for this section that if a modifier is in a subpattern, captured or non-captured, it has its effect only on that subpattern and not outside the subpattern. If the modifier is not inside a subpattern, it has its effect from its point of insertion to the end of the regex.
That is it for this section.
And, finally we have come to the end of the series. We saw so many things. There are still some extra features in PHP regexes to be seen. I intend to address the extra issues as independent articles. I hope you appreciated this series.
Chrys
Related Links
Major in Website DesignWeb Development Course
HTML Course
CSS Course
ECMAScript Course
PHP Course