PHP Regular Expression Patterns
PHP Regular Expressions with Security Considerations - Part 2
Foreword: In this part of the series I analyzing patterns in PHP Regular Expressions.
By: Chrysanthus Date Published: 18 Jan 2019
Introduction
Character Classes
The Square Brackets
A character class allows a set of possible characters, where one of them would match at a particular point (a character), in the subject string. Character classes are denoted by brackets [...], with the set (class) of characters to be possibly matched inside. Here are some examples:
Let your subject string be
He has a cat.
You may know that he has an animal, but it does not matter to you which animal he has. You will be satisfied if he has a cat, bat or a rat. Note that the words, cat, bat and rat, each has at but begins with a c or b or r. The regex to check this is
/[bcr]at/
The following produces a match
<?php
if (preg_match("/[bcr]at/", "He has a cat.") === 1)
{
echo 'Matched';
}
else
{
echo 'Not Matched or Error occurred!';
}
?>
Here, because of the square brackets we interpret the regex as follows: the pattern should match any word whose first character is a b, c, or t, the rest of the characters being at.
The square brackets denote a class of elements. However, it is any one element in the class (square brackets) that is to be matched, not all of them together. Here, the class is the group of letters, b, c and r; only one has to match in conjunction with at.
There is still more you have to know about the character class. I will talk about that later.
Range of Characters
The '-' Character
There may come a time when you would want to match any occurrence of a digit between 0 to 9, or a lower case character between a to z, or an uppercase character between A to Z. These are ranges of characters and for each range you would want to know if one character in the range exist in the subject string.
The - Character is used for this. So the range 0 to 9 is denoted by 0-9; a to z by a-z; and A to Z by A-Z.
The following code produces a match:
if (preg_match("/[0-9]/", "ID5id") === 1)
The square brackets indicate that any element it contains should be tested for matching. A range of characters is a class, and so you have to use the square brackets, as in the above expression. In that case, a match occurs between 5 in the range 0 to 9 and 5 in the subject string, ID5id.
The above conditional is the same as
if (preg_match("/[0123456789]/", "ID5id") === 1)
Note the use of the square brackets. The following code will produce a match for a similar reason:
if (preg_match("/[a-z]/", "ID5i") === 1)
A match occurs between i in the range a-z and i, the only lowercase later in the present subject.
Of course, you can combine a range with other characters in the regex. The regex /ID[0-9]id/ will match ID4id, ID5id, ID6id; in fact any word beginning with ID followed by a digit and then id. So
if (preg_match("/ID[0-9]id/", "ID2id is an ID") === 1)
produces a match. preg_match() is the main PHP function, you use, when you want just a match. I will talk about other PHP functions that are used in regular expressions, later.
Note: the range format gives a short form of writing a class. The range has to be in square brackets to effectively be considered as a class. It is any one element in the square brackets that is matched.
Character ranges and some special regex characters can be negated.
If you are looking for a match with any character except a digit, you would write,
[^0-9]
This refers to all characters existing, which are not in the range 0-9. The following conditional produces a match:
if (preg_match("/[^0-9]/", "12P34") === 1)
P is not in the range [0-9]; P is outside. Concerning all characters, P is in the range [^0-9]. Note the presence and absence of the ^ character in the classes [0-9] and [^0-9], in this paragraph.
The special character used for negation is ^.
The range outside [a-z] is [^a-z]. That is [^a-z] is the negation of [a-z].
The range outside [A-Z] is [^A-Z]. That is [^A-Z] is the negation of [A-Z].
I show you other negations below.
Abbreviations for Common Character Classes
\d
\d means, any digit, and it abbreviates [0-9]. The following code produces a match:
if (preg_match("/ID\did/", "ID5id is an ID") === 1)
Negated \d
\D is negated \d. It represents any character that is not a digit, that is [^0-9].
\s
\t\r\n\f are white space characters. \ or simply is produced when you press the spacebar of your keyboard. \t is produces when you press the tab key on your keyboard. \r is the carriage return character. \n is the new line character and \f is the form feed character.
\s is the abbreviation for any white space character. That is \s is equivalent to [ \t\r\n\f].
The following conditional produces a match:
if (preg_match("/\n/", "The first line.\r\nThe second line.") === 1)
The following conditional also produces a match:
if (preg_match("/\s/", "The first line.\r\nThe second line.") === 1)
\s is a class of white space characters, i.e. any white space character.
Negated \s
\S
\S is negated \s. It represents any character that is not a whitespace, that is [^s].
\S, [^s] and [^ \t\r\n\f] are equivalent.
The negation symbol negates the class (within the square brackets).
\w
This is a word character. It represents any alphanumeric character including the underscore. \w and [0-9a-zA-Z_] are equivalent.
Negated \w
\W is negated \w. It represents any non-word character. \W and [^w] are equivalent.
The Period .
The period . matches any character except \n. For example, /.p/ matches 'ap' in the subject string, "An apple is on the tree". /.p/ represents two characters, which are any character (except \n) followed by p.
You can use the \d \s \w \D \S \W abbreviations both inside and outside of character classes.
The aim here is to see how you can match a regex from the beginning of the subject string or to the end of the subject string (or both the beginning and the end).
The ^ Character for Matching at the Beginning
If you want the matching to occur at the beginning of the subject string, start the regex with the ^ character.
The following conditional produces a match:
if (preg_match("/^one/", "one and two") === 1)
The following conditional does not produce a match:
if (preg_match("/^one/", "The one I saw") === 1)
In the first case the word one is at the beginning of the subject string. In the second case, the word one is not at the beginning of the subject string.
At this point, you may ask, Is ^ not the negation symbol? Well it is the negation symbol. The problem is to know when to use it. When used inside a class (square brackets) it is the negation symbol; when used at the beginning of a regex, just after the forward slash, it is the regex character for matching at the beginning. It is known as an anchor metacharacter.
The $ Character for Matching at End
If you want the matching to occur at the end of the subject, end the pattern with the $ character.
The following expression produces a match:
if (preg_match("/last$/", "This is the last") === 1)
The following expression does not produce a match:
if (preg_match("/last$/", "The last boy") === 1)
In the first case the word last is at the end of the subject. In the second case, the word last is not at the end of the subject.
Note: $ actually matches the end of the subject string, or just before a newline character at the end of the subject string.
^ and $ are called anchor meta characters.
Matching the Whole String
Now, note that the .* character combination (period followed by asterisk) in the pattern matches any sub string including a sub string of zero length.
You can match the whole subject string, using the ^ with the $ characters. The following code produces a match:
if (preg_match("/^be.*end$/", "beginning and end") === 1)
The following code also produces a match:
if (preg_match("/^be.*end$/", "beginning with end") === 1)
The subject of the first case is, beginning and end. The subject of the second case is beginning with end. The difference occurs in the word in the middle (and/with). Matching occurs for both of them.
The regex pattern of both cases is the same. The pattern begins with ^ and ends with $. The regexp indicates that the subject to be matched has to begin with be, followed by any character, any number of times; and the subject has to end with end.
Note: Matching actually searches the subject for a sub-string, represented by the pattern of the regex. However, when you are matching the whole subject string, the regex represents the whole string.
So, you can now match a whole string. By the time you complete this series, you will be able to match a whole subject having particular words within the string. I will not show you how to do that. It will be an exercise for you. You will simply need to combine many of the features I explain in the series.
Wow, we have done a lot, there are still many things to be learned. We shall continue to take it step by step.
This is a good place to take a break. We continue in the next part of the series.
Chrys
Related Links
Basics of PHP with Security ConsiderationsWhite Space in PHP
PHP Data Types with Security Considerations
PHP Variables with Security Considerations
PHP Operators with Security Considerations
PHP Control Structures with Security Considerations
PHP String with Security Considerations
PHP Arrays with Security Considerations
PHP Functions with Security Considerations
PHP Return Statement
Exception Handling in PHP
Variable Scope in PHP
Constant in PHP
PHP Classes and Objects
Reference in PHP
PHP Regular Expressions with Security Considerations
Date and Time in PHP with Security Considerations
Files and Directories with Security Considerations in PHP
Writing a PHP Command Line Tool
PHP Core Number Basics and Testing
Validating Input in PHP
PHP Eval Function and Security Risks
PHP Multi-Dimensional Array with Security Consideration
Mathematics Functions for Everybody in PHP
PHP Cheat Sheet and Prevention Explained
More Related Links