First Occurrence in PHP Regular Expression Matching
Advanced PHP Regular Expressions - Part 4
Foreword: In the evaluation of the matching operation by PHP, the regex matches the first occurrence of the substring in the subject; that is what I talk about in this part of the series.
By: Chrysanthus Date Published: 11 Jul 2019
Introduction
Illustration
Read and test the following script:
<?php
$subject = "I am a man. You are a man. He is a man.";
$regex = '/man/';
$ret = preg_match($regex, $subject, $matches);
echo $ret;
?>
The output is:
1
The preg_match() function returns 1 if the pattern matches given subject, 0 if it does not, or FALSE if an error occurred.
Matching has taken place. In the subject string, the word, “man” is typed in three places. The regex is /man/. It is the first occurrence of the substring, “man” that is matched in the subject. The other words of “man” in the subject, are ignored. If you want all the occurrences of the substring in question to be matched, you have to use the preg_match_all() function. The first occurrence can be called the leftmost occurrence. The following program confirms this:
<?php
$subject = "This is a cat. That is a rat. Here is a bat.";
$regex = '/[brc]at/';
preg_match_all($regex, $subject, $matches);
echo $matches[0][0], '<br>';
echo $matches[0][1], '<br>';
echo $matches[0][2], '<br>';
?>
In the following order, the output is:
cat
rat
bat
With the preg_match_all() function, “cat”, which occurs first in the subject, is matched first, as expected. The leftmost occurrence (substring) in the subject is always matched first. Without the preg_match_all() function, the rest of the occurrences are ignored. Try the following code that proves this (shows that the rest is ignored):
<?php
$subject = "This is a cat. That is a rat. Here is a bat.";
$regex = '/[brc]at/';
preg_match($regex, $subject, $matches);
echo $matches[0], '<br>';
echo $matches[1], '<br>';
echo $matches[2], '<br>';
?>
The match array acquires only:
cat
In this code, the regex matches just “cat” which is the first occurrence of the possible matches in the subject. The preg_match() function and not the preg_match_all() function has been used. In this situation, the $matches array is one dimensional and not two dimensional.
Same thing with Alternatives
The class e.g. [brc] produces a set of alternatives in the regex. With any form of alternatives in the regex, even with the preg_match_all() function, it is the first occurrence in the subject that is matched first. Try the following code, which uses the official alternative operator, | in the regex, and does not use the preg_match_all() function:
<?php
$subject = "This is a child. That is a man. Here is a woman.";
$regex = '/woman|man|child/';
preg_match($regex, $subject, $matches);
echo $matches[0], '<br>';
echo $matches[1], '<br>';
echo $matches[2], '<br>';
?>
The output is just, “child”, which is the leftmost or first occurrence substring in the subject.
First occurrence is what is in the subject, not what is in the regex. The first occurrence substring may not correspond to the first subpattern in the regex and may even be the last, as in the above code.
Nested Groups
With nested groups, it is still the first occurred substring in the subject that matches first; it does not matter what nests what, in the regex. Any group (subpattern) in the regex that corresponds to the first occurred substring, matches first. Try the following code that illustrates this with the preg_match_all() function.
<?php
$subject = "keepers, bookkeepers, bookkeeper and book go together.";
$regex = '/book(keeper(s|)|)/';
preg_match($regex, $subject, $matches);
echo $matches[0], '<br>';
echo $matches[1], '<br>';
echo $matches[2], '<br>';
?>
The output is:
bookkeepers
keepers
s
The first occurred substring in the subject that could match a corresponding pattern in the regex is “bookkeepers”; the second is “keepers”; the third is “s”. The search is done by looking at the subject first, before looking at the regex. The search looks for anything in the subject from left to right that would match any construct in the regex.
Note: what is matched first goes first into the array. The first occurrence in the subject is matched to whatever it sees in the regex.
That is it for this part of the series. We stop here and continue in the next part.
Chrys
Related Links
Basics of PHP with Security ConsiderationsWhite Space in PHP
PHP Data Types with Security Considerations
PHP Variables with Security Considerations
PHP Operators with Security Considerations
PHP Control Structures with Security Considerations
PHP String with Security Considerations
PHP Arrays with Security Considerations
PHP Functions with Security Considerations
PHP Return Statement
Exception Handling in PHP
Variable Scope in PHP
Constant in PHP
PHP Classes and Objects
Reference in PHP
PHP Regular Expressions with Security Considerations
Date and Time in PHP with Security Considerations
Files and Directories with Security Considerations in PHP
Writing a PHP Command Line Tool
PHP Core Number Basics and Testing
Validating Input in PHP
PHP Eval Function and Security Risks
PHP Multi-Dimensional Array with Security Consideration
Mathematics Functions for Everybody in PHP
PHP Cheat Sheet and Prevention Explained
More Related Links