PHP Regular Expression Patterns

PHP Regular Expressions with Security Considerations - Part 2

Foreword: In this part of the series I analyzing patterns in PHP Regular Expressions.

By: Chrysanthus Date Published: 18 Jan 2019

Introduction

This is part 2 of my series, PHP Regular Expressions with Security Considerations. In this part of the series I analyzing patterns in PHP Regular Expressions. You should have read the previous part of the series before coming here, as this is the continuation.

Character Classes
The Square Brackets
A character class allows a set of possible characters, where one of them would match at a particular point (a character), in the subject string. Character classes are denoted by brackets [...], with the set (class) of characters to be possibly matched inside. Here are some examples:

Let your subject string be

                             “He has a cat.”

You may know that he has an animal, but it does not matter to you which animal he has. You will be satisfied if he has a cat, bat or a rat. Note that the words, “cat”, “bat” and “rat”, each has “at” but begins with a “c” or “b” or “r”. The regex to check this is

                              /[bcr]at/

The following produces a match

<?php

        if (preg_match("/[bcr]at/", "He has a cat.") === 1)
            {
                echo 'Matched';
            }
        else
            {
                echo 'Not Matched or Error occurred!';
            }

?>

Here, because of the square brackets we interpret the regex as follows: the pattern should match any word whose first character is a “b”, “c”, or “t”, the rest of the characters being ‘at’.

The square brackets denote a class of elements. However, it is any one element in the class (square brackets) that is to be matched, not all of them together. Here, the class is the group of letters, ‘b’, ‘c’ and ‘r’; only one has to match in conjunction with “at”.

There is still more you have to know about the character class. I will talk about that later.

Range of Characters
The '-' Character
There may come a time when you would want to match any occurrence of a digit between 0 to 9, or a lower case character between ‘a’ to ‘z’, or an uppercase character between A to Z. These are ranges of characters and for each range you would want to know if one character in the range exist in the subject string.

The ‘-‘ Character is used for this. So the range 0 to 9 is denoted by 0-9; ‘a’ to ‘z’ by a-z; and A to Z by A-Z.

The following code produces a match:

            if (preg_match("/[0-9]/", "ID5id") === 1)

The square brackets indicate that any element it contains should be tested for matching. A range of characters is a class, and so you have to use the square brackets, as in the above expression. In that case, a match occurs between 5 in the range 0 to 9 and 5 in the subject string, “ID5id”.

The above conditional is the same as

            if (preg_match("/[0123456789]/", "ID5id") === 1)

Note the use of the square brackets. The following code will produce a match for a similar reason:

           if (preg_match("/[a-z]/", "ID5i") === 1)

A match occurs between ‘i’ in the range a-z and ‘i’, the only lowercase later in the present subject.

Of course, you can combine a range with other characters in the regex. The regex /ID[0-9]id/ will match “ID4id”, “ID5id”, “ID6id”; in fact any word beginning with ‘ID’ followed by a digit and then ‘id’. So

            if (preg_match("/ID[0-9]id/", "ID2id is an ID") === 1)

produces a match. preg_match() is the main PHP function, you use, when you want just a match. I will talk about other PHP functions that are used in regular expressions, later.

Note: the range format gives a short form of writing a class. The range has to be in square brackets to effectively be considered as a class. It is any one element in the square brackets that is matched.

Negation
Character ranges and some special regex characters can be negated.

If you are looking for a match with any character except a digit, you would write,

             [^0-9]

This refers to all characters existing, which are not in the range 0-9. The following conditional produces a match:

            if (preg_match("/[^0-9]/", "12P34") === 1)

P is not in the range [0-9]; P is outside. Concerning all characters, P is in the range [^0-9]. Note the presence and absence of the ‘^’ character in the classes [0-9] and [^0-9], in this paragraph.

The special character used for negation is “^”.

The range outside [a-z] is [^a-z]. That is [^a-z] is the negation of [a-z].

The range outside [A-Z] is [^A-Z]. That is [^A-Z] is the negation of [A-Z].

I show you other negations below.

Abbreviations for Common Character Classes
\d
\d means, any digit, and it abbreviates [0-9]. The following code produces a match:

            if (preg_match("/ID\did/", "ID5id is an ID") === 1)

Negated \d
\D is negated \d. It represents any character that is not a digit, that is [^0-9].

\s
\t\r\n\f  are white space characters. ‘\ ‘ or simply ‘ ‘ is produced when you press the spacebar of your keyboard. \t is produces when you press the tab key on your keyboard. \r is the carriage return character. \n is the new line character and \f is the form feed character.

\s is the abbreviation for any white space character. That is \s is equivalent to [ \t\r\n\f].

The following conditional produces a match:

            if (preg_match("/\n/", "The first line.\r\nThe second line.") === 1)

The following conditional also produces a match:

            if (preg_match("/\s/", "The first line.\r\nThe second line.") === 1)

\s is a class of white space characters, i.e. any white space character.

Negated \s
\S
\S is negated \s. It represents any character that is not a whitespace, that is [^s].

\S, [^s] and [^ \t\r\n\f] are equivalent.

The negation symbol negates the class (within the square brackets).

\w
This is a word character. It represents any alphanumeric character including the underscore. \w and [0-9a-zA-Z_] are equivalent.

Negated \w
\W is negated \w. It represents any non-word character. \W and [^w] are equivalent.

The Period ‘.’
The period ‘.’ matches any character except \n. For example, /.p/ matches 'ap' in the subject string, "An apple is on the tree". /.p/ represents two characters, which are any character (except \n) followed by ‘p’.

You can use the \d \s \w \D \S \W abbreviations both inside and outside of character classes.

Beginning and End of a String
The aim here is to see how you can match a regex from the beginning of the subject string or to the end of the subject string (or both the beginning and the end).

The ^ Character for Matching at the Beginning
If you want the matching to occur at the beginning of the subject string, start the regex with the ‘^’ character.

The following conditional produces a match:

            if (preg_match("/^one/", "one and two") === 1)

The following conditional does not produce a match:

            if (preg_match("/^one/", "The one I saw") === 1)

In the first case the word ‘one’ is at the beginning of the subject string. In the second case, the word ‘one’ is not at the beginning of the subject string.

At this point, you may ask, “Is ‘^’ not the negation symbol?” Well it is the negation symbol. The problem is to know when to use it. When used inside a class (square brackets) it is the negation symbol; when used at the beginning of a regex, just after the forward slash, it is the regex character for matching at the beginning. It is known as an anchor metacharacter.

The $ Character for Matching at End
If you want the matching to occur at the end of the subject, end the pattern with the ‘$’ character.

The following expression produces a match:

            if (preg_match("/last$/", "This is the last") === 1)

The following expression does not produce a match:

            if (preg_match("/last$/", "The last boy") === 1)

In the first case the word ‘last’ is at the end of the subject. In the second case, the word ‘last’ is not at the end of the subject.

Note: $ actually matches the end of the subject string, or just before a newline character at the end of the subject string.

^ and $ are called anchor meta characters.

Matching the Whole String
Now, note that the .* character combination (period followed by asterisk) in the pattern matches any sub string including a sub string of zero length.

You can match the whole subject string, using the ‘^’ with the ‘$’ characters. The following code produces a match:

            if (preg_match("/^be.*end$/", "beginning and end") === 1)

The following code also produces a match:

            if (preg_match("/^be.*end$/", "beginning with end") === 1)

The subject of the first case is, “beginning and end”. The subject of the second case is “beginning with end”. The difference occurs in the word in the middle (and/with). Matching occurs for both of them.

The regex pattern of both cases is the same. The pattern begins with ‘^’ and ends with ‘$’. The regexp indicates that the subject to be matched has to begin with “be”, followed by any character, any number of times; and the subject has to end with “end”.

Note: Matching actually searches the subject for a sub-string, represented by the pattern of the regex. However, when you are matching the whole subject string, the regex represents the whole string.

So, you can now match a whole string. By the time you complete this series, you will be able to match a whole subject having particular words within the string. I will not show you how to do that. It will be an exercise for you. You will simply need to combine many of the features I explain in the series.

Wow, we have done a lot, there are still many things to be learned. We shall continue to take it step by step.

This is a good place to take a break. We continue in the next part of the series.

Chrys

Cousins

BACK NEXT

Broad Network

Series Articles

PHP Regular Expression Patterns

PHP Regular Expressions with Security Considerations - Part 2

Introduction

Related Links

Cousins

Comments