More Regular Expression Patterns in PHP
PHP Regular Expressions – Part III
Forward: In this part of the series, we continue to analyze patterns in PHP Regular Expressions.
By: Chrysanthus Date Published: 11 Aug 2012
Introduction
Matching Repetitions
In the subject string, characters or groups of characters may repeat themselves. We shall talk about groups of characters, as a topic, later. For now, let us concentrate on single character repeating itself. There are quantifier metacharacters that allow us to match repetition of single characters or groups of characters in the subject string. These meta characters are: ?, * , + , and {}. They allow us to decide on the number of repeats we are looking for. Quantifiers are put immediately after the character, character class, or grouping (see later) in the regex. Here they are with their meanings, where x refers to a particular character:
x* : means match 'x' 0 or more times, i.e., any number of times
x+ : means match 'x' 1 or more times, i.e., at least once
x? : means match 'x' 0 or 1 times
x{n,} : means match 'x' at least n or more times; note the comma.
x{n} : match 'x' exactly n times
x{n,m} : match 'x' at least n times, but not more than m times.
Note: the letter ‘x’ above stands for any character of a text, e.g. ‘b’, ‘c’, ‘d’, ‘1’, ‘2’, etc. The quantifier is typed inside a pattern (regex).
*
Matches the preceding item 0 or more times. /o*/ matches ‘o’ in 'ghost' of the subject string, "A ghost booooed". It would also match “oooo” in the subject string. To give the regex more meaning you have to combine it with other characters. For example, /bo*/ matches 'boooo' in "A ghost booooed" and 'b' in "A bird warbled", but nothing in "A goat grunted", even though this last string has an ‘o’.
+
Matches the preceding item 1 or more times. Equivalent to {1,} – see below. /a+/ matches the 'a' in "candy" and all the a's in "caaaaaaandy".
?
Matches the preceding item 0 or 1 time. /e?le?/ matches the 'el' in "angel" and the 'le' in "angle.". /e?le?/ means, you have a word which has ‘l’ optionally preceded by ‘e’ and optionally followed by ‘e’. This means, it will also match, “lying”. By the time you finish this series, you will know how to modify the regex, to restrict it to match only “angel” or “angle”.
{n,}
Where n is a positive integer. This matches at least n occurrences of the preceding item.
For example, /a{2,} doesn't match the 'a' in "candy", but matches all of the a's in "caandy" and in "caaaaaaandy.".
{n}
Where n is a positive integer. This matches exactly n occurrences of the preceding item. /a{2}/ doesn't match the 'a' in "candy," but it matches all of the a's in "caandy," and only the first two a's in "caaandy."
{n,m}
Where n and m are positive integers. This matches at least n and at most m occurrences of the preceding item.
The following code produces a match:
$year = "2009";
preg_match("/\d{2,4}/", $year)
This is a simple validation that makes sure the year is at least 2 digits and not more than 4 digits. You can try the above with the following program:
$year = "2009";
if (preg_match("/\d{2,4}/", $year))
{
echo "Matched";
}
else
{
echo "Not Matched";
}
We can match different character strings with the alternation metacharacter '|'. To match ‘pig’ or ‘sheep’, we form the regex, /pig|sheep/. PHP will try to match the regex at the earliest possible point in the subject string. At each character position, PHP will first try to match the first alternative, ‘pig’. If ‘pig’ doesn't match, PHP will then try the next alternative, ‘sheep’. If ‘sheep’ doesn't match either, then PHP moves on to the next position in the subject string and starts with the first alternative again
Some examples:
The following produces a match:
preg_match("/pig|sheep|cow/", "pigs are a group of animals")
Here, ‘pig’ is matched. There is no ‘sheep’ or ‘cow’ in the subject string.
Note that in the subject string, it is the set of letters, ‘p’,’i’, and ’g’ that is matched. It is not ‘pigs’ that is matched. There is no ‘s’ after “pig” in the regex. ‘pig’ is a sub-string among all the characters in the subject string that is matched. Also note that it is not a word that is matched, but a sub-string (which consists of characters and may even be one character).
Note as well, that the space in the subject string is a character, which could be a member of a string sub string. What I have just said, applies to all other matching, not only alternations.
The following produces a match:
preg_match("/pig|sheep|cow/", "sheep are a group of animals")
Here, ‘sheep’ is matched. There is no ‘pig’ or ‘cow’ in the subject string. The search did not see ‘pig’, so it matched ‘sheep’
The following produces a match:
preg_match("/pig|sheep|cow/", "cows are a group of animals")
Here, ‘cow’ is matched. There is no ‘pig’ or ‘sheep’ in the subject string. The search did not see ‘pig’ or ‘sheep’, so it matched ‘cow’
Now, in the following expression ‘pig’ and not ‘sheep’ is matched.
preg_match("/pig|sheep|cow/", "pigs and sheep are groups of animals")
This is because ‘pig’ appears first in the subject string before ‘sheep’.
Also in the following expression ‘sheep’ and not ‘pig’ is matched.
preg_match("/sheep|pig|cow/", "pigs and sheep are groups of animals")
This is because, even though ‘sheep’ is the first alternative in the regex, ‘pig’ appears first in the subject string before ‘sheep’.
There are some characters that you cannot use in a regex. These characters simply have special meanings in the regex. Here they are:
+ * ? [ ^ ] $ ( ) : { } = ! < > |
They are called metacharacters.
A metacharacter can be matched by putting a backslash before it. The following examples illustrate this:
preg_match("/3+3/", "3+3=6") # doesn't match because '+' is a metacharacter
preg_match("/3+3/", "3+3=3") # matches because '+' becomes an ordinary '+'
The following expression produces a match:
preg_match("/www.website.com/contact.html/", "www.website.com/contact.html")
Always remember that a decimal point as a character in a pattern (regex) always has to be escaped, that is “.”.
Combining Matching Features
You can combine matching features. We have seen some of these such as in /[cbr]at/. This is another example
preg_match("/\d{2,4}/", $year)
The above is to verify that year is at least 2 but not more than 4 digits. Here $year is the subject string, and should have been declared.
Variable in Regex
In a pattern, you can have a variable in place of a sub string. Consider the following statement:
$var = "dog";
The following statement matches:
preg_match("/his $var by/", "This is his dog by me.")
Here, the pattern, /his dog by/ is the same as /his $var by/. In the later pattern, “dog” has been replaced by $var.
A character class is a set of characters in square brackets, of which any one and only one of the characters that is found in the subject in relation to the pattern is matched. Consider the pattern (regex),
$re = "/[bcr]at/";
This would match bat, cat or rat. The class is [brc] and only one of these characters in the square brackets, together with “at” can match something in the subject string. A class is a set of these characters; [gjd] is another class, [hdqwe] is another class [opqd] is another class, etc. Only one of the characters in the square brackets together with the rest of the pattern would match something in the string.
You have to accept the following:
The dash character, -, inside a character class indicates a range. We have seen this before. However, the dash character outside the character class and in the pattern is taken literally.
The circumflex character, ‘^’, at the beginning of the character class, negates the class; inside the character class but not at the beginning of the class, it is taken literally. Outside the character class, at the beginning of the overall pattern, it matches the start of the subject string.
Outside the character class, the escape sequence, b is treated as a word boundary; inside a character class, it is treated as a backspace character.
Inside the character class, the period has no special meaning. Outside and in the pattern, it matches any character except the n character in the subject, by default. We shall see what “by default” here, means, later.
The newline character is never treated in any special way in character classes. A class such as [^e] will always match a n character.
That is it for this part of the series. We have talked about matching repetitions; we have talked about matching alternations; we have talked about matching alternations; we have talked about metacharacters; we have talked about combining matching features; we have talked about variables in regex and we have revisited character classes. In the next chapter, we shall talk about the effects of having parentheses in a pattern.
Let us take a break here. We continue in the next part of the series.
Chrys
Related Links
Major in Website DesignWeb Development Course
HTML Course
CSS Course
ECMAScript Course
PHP Course
NEXT