More PHP Regular Expression Patterns
PHP Regular Expressions with Security Considerations - Part 3
Foreword: In this part of the series, I continue to analyze patterns in PHP Regular Expressions.
By: Chrysanthus Date Published: 18 Jan 2019
Introduction
Matching Repetitions
In the subject string, characters or groups of characters may repeat themselves. I shall talk about groups of characters, as a topic, later. For now, let us concentrate on a single character repeating itself. There are quantifier metacharacters that allow us to match repetition of single characters or groups of characters in the subject string. These meta characters are: ?, * , + , and {}. They allow us to decide on the number of repeats we are looking for. Quantifiers are put immediately after the character, character class, or grouping (see later) in the regex. Here they are with their meanings, where x refers to a particular character:
x* : means match 'x' 0 or more times, i.e., any number of times
x+ : means match 'x' 1 or more times, i.e., at least once
x? : means match 'x' 0 or 1 time
x{n,} : means match 'x' at least n or more times; note the comma.
x{n} : match 'x' exactly n times
x{n,m} : match 'x' at least n times, but not more than m times.
Note: the letter ‘x’ above stands for any character of text, e.g. ‘b’, ‘c’, ‘d’, ‘1’, ‘2’, etc. The quantifier is typed inside a pattern (regex).
Examples
*
This matches the preceding item 0 or more times. /o*/ matches ‘o’ in 'ghost' of the subject, "A ghost booooed". It would also match “oooo” in the subject. To give the regex more meaning you have to combine it with other characters. For example, /bo*/ matches 'boooo' in "A ghost booooed" and 'b' in "A bird warbled", but nothing in "A goat grunted", even though this last string has an ‘o’.
+
This matches the preceding item 1 or more times. Equivalent to {1,} – see below. /a+/ matches the 'a' in "candy" and all the a's in "caaaaaaandy".
?
This matches the preceding item 0 or 1 time. /e?le?/ matches the 'el' in "angel" and the 'le' in "angle.". /e?le?/ means, you have a word which has ‘l’ optionally preceded by ‘e’ and optionally followed by ‘e’. This means, it will also match, “lying”. By the time you finish this series, you will know how to modify the regex, to restrict it to match only “angel” or “angle”.
{n,}
Here, n is a positive integer. This matches at least n occurrences of the preceding item.
For example, /a{2,} does not match the 'a' in "candy", but matches all of the a's in "caandy" and in "caaaaaaandy.".
{n}
Here n is a positive integer. This matches exactly n occurrences of the preceding item. /a{2}/ does not match the 'a' in "candy," but it matches all of the a's in "caandy," and only the first two a's in "caaandy."
{n,m}
Here n and m are positive integers. This matches at least n and at most m occurrences of the preceding item.
For example, /a{1,3}/ matches nothing in "cndy", the 'a' in "candy," the first two a's in "caandy," and the first three a's in "caaaaaaandy". Notice that when matching "caaaaaaandy", the match is "aaa", even though the subject had more a's in it.
The following code produces a match:
$year = "2009";
if (preg_match("/\d{2,4}/", $year) === 1)
This is a simple validation that makes sure the year is at least 2 digits and not more than 4 digits. You can try the above with the following program:
<?php
$year = "2009";
if (preg_match("/\d{2,4}/", $year) === 1)
{
echo 'Matched';
}
else
{
echo 'Not Matched or Error occurred!';
}
?>
We can match different character strings with the alternation metacharacter '|'. To match ‘pig’ or ‘sheep’, we form the regex, /pig|sheep/. PHP will try to match the regex at the earliest possible point in the subject. At each character position, PHP will first try to match the first alternative, ‘pig’. If ‘pig’ doesn't match, PHP will then try the next alternative, ‘sheep’. If ‘sheep’ does not match either, then PHP moves on to the next position in the subject and starts with the first alternative again.
Some examples:
The following produces a match:
if (preg_match("/pig|sheep|cow/", "pigs are a group of animals") === 1)
Here, ‘pig’ is matched. There is no ‘sheep’ or ‘cow’ in the subject.
Note that in the subject, it is the set of letters, ‘p’,’i’, and ’g’ that is matched. It is not ‘pigs’ that is matched. There is no ‘s’ after “pig” in the regex. ‘pig’ is a sub-string among all the characters in the subject that is matched. Also note that it is not necessarily a word that is matched, but a sub-string (which consists of characters and may even be one character).
Note as well, that the space in the subject is a character, which could be a member of a string sub string. What I have just said, applies to all other matching, not only alternations.
The following produces a match:
if (preg_match("/pig|sheep|cow/", "sheep are a group of animals") === 1)
Here, ‘sheep’ is matched. There is no ‘pig’ or ‘cow’ in the subject. The search did not see ‘pig’, so it matched ‘sheep’
The following produces a match:
if (preg_match("/pig|sheep|cow/", "cows are a group of animals") === 1)
Here, ‘cow’ is matched. There is no ‘pig’ or ‘sheep’ in the subject. The search did not see ‘pig’ or ‘sheep’, so it matched ‘cow’
Now, in the following expression ‘pig’ and not ‘sheep’ is matched.
if (preg_match("/pig|sheep|cow/", "pigs and sheep are groups of animals") === 1)
This is because ‘pig’ appears first in the subject before ‘sheep’.
Also in the following expression ‘sheep’ and not ‘pig’ is matched.
if (preg_match("/sheep|pig|cow/", "pigs and sheep are groups of animals") === 1)
This is because, though ‘sheep’ is the first alternative in the regex, ‘pig’ appears first in the subject before ‘sheep’.
Metacharacters
There are some characters that you cannot use in a regex. These characters simply have special meanings in the regex. Here they are:
{ } [ ] ( ) ^ $ . | * + ? - \
They are called metacharacters.
A metacharacter can be matched by putting a backslash before it. The following examples illustrate this:
if (preg_match("/3+3/", "3+3=6") === 1) //doesn't match because '+' is a metacharacter
if (preg_match("/3\+3/", "3+3=6") === 1) //matches because '+' becomes an ordinary '+'
The following conditional produces a match.
if (preg_match("/www.website.com\/contact\.html/", "www.website.com/contact.html") === 1)
Always remember that a decimal point as a character in a pattern (regex) always has to be escaped, like this “\.” . This is because the period in the pattern matches any character including the dot itself. Decimal point, period and dot mean the same thing here.
You can combine matching features. I have shown you some of these, such as in /[cbr]at/. Here is another example
if (preg_match("/\d{2,4}/", $year) === 1)
This is to verify that year is at least 2 but not more than 4 digits. Here, $year is the subject, and should have been declared.
Character Classes Revisited
A character class is a set of characters in square brackets, of which any one and only one of the characters that is found in the subject in relation to the pattern is matched. Consider the pattern (regex),
$re = "/[bcr]at/";
This would match bat, cat or rat. The class is [brc] and only one of these characters in the square brackets, together with “at” can match something in the subject. A class is a set of these characters; [gjd] is another class, [hdqwe] is another class [opqd] is another class, etc. Only one of the characters in the square brackets together with the rest of the pattern would match something in the string.
You have to accept the following:
The dash character, -, inside a character class indicates a range. I have shown you this before. However, the dash character outside the character class is taken literally. If you want - itself as a character in the class (within square brackets), then escape it like this, \- .
The circumflex character, ‘^’, at the beginning of the character class, negates the class. Inside the character class, but not at the beginning of the class, it is taken as an ordinary character. Outside the character class, at the beginning of the overall pattern, it matches the start of the subject. Outside the character class, but not at the beginning of the overall pattern, it is taken as a meta character (not to use arbitrarily).
Outside the character class, the escape sequence, \b is treated as a word boundary; inside a character class, it is treated as a backspace character.
Inside the character class, the period has no special meaning. Outside and in the pattern, it matches any character except the \n character in the subject, by default. I will explain what “by default” here, means, later.
That is it for this part of the series. Let us take a break here. We continue in the next part.
Chrys
Related Links
Basics of PHP with Security ConsiderationsWhite Space in PHP
PHP Data Types with Security Considerations
PHP Variables with Security Considerations
PHP Operators with Security Considerations
PHP Control Structures with Security Considerations
PHP String with Security Considerations
PHP Arrays with Security Considerations
PHP Functions with Security Considerations
PHP Return Statement
Exception Handling in PHP
Variable Scope in PHP
Constant in PHP
PHP Classes and Objects
Reference in PHP
PHP Regular Expressions with Security Considerations
Date and Time in PHP with Security Considerations
Files and Directories with Security Considerations in PHP
Writing a PHP Command Line Tool
PHP Core Number Basics and Testing
Validating Input in PHP
PHP Eval Function and Security Risks
PHP Multi-Dimensional Array with Security Consideration
Mathematics Functions for Everybody in PHP
PHP Cheat Sheet and Prevention Explained
More Related Links