Regular Expression Patterns in PHP

PHP Regular Expressions – Part II

Forward: In this part of the series, we start analyzing patterns in PHP Regular Expressions.

By: Chrysanthus Date Published: 11 Aug 2012

Introduction

In this part of the series, we start analyzing patterns in PHP Regular Expressions.

Character Classes
The Square Brackets
A character class allows a set of possible characters, where one of them would match at a particular point, a character, in the subject string. Character classes are denoted by brackets [...], with the set (class) of characters to be possibly matched inside. Here are some examples:

Let your subject string be

                             “He has a cat.”

You may know that he has an animal, but it does not matter to you which animal he has. You will be satisfied if he has a cat, bat or a rat. Note that the words, “cat”, “bat” and “rat”, each has “at” but begins with a “c” or “b” or “r”. The regex to check this is

                              /[bcr]at/

The following produces a match

                    preg_match("/[bcr]at/", "He has a cat.")

Here, because of the square brackets we interpret the regex as follows: the pattern should match any word whose first character is a “b”, “c”, or “t”, the rest of the characters being ‘at’.

The square brackets denote a class of elements. However, it is any one element in the class (square brackets) that is to be matched, not all of them together. Here, the class is the group of letters, ‘b’, ‘c’ and ‘r’; only one has to match in conjunction with “at”.

There is still more we have to know about the character class. We shall see that later.

Range of Characters
The ‘-‘ Character
There may come a time when you would want to match any occurrence of a digit between 0 to 9, or a lower case character between ‘a’ to ‘z’, or an uppercase character between A to Z. These are ranges of characters and for each range you would want to know if one character in the range exist in the subject string (I will address the issue of multiple occurrences of a character from a range later).

The ‘-‘ Character is used for this. So the range 0 to 9 is denoted by 0-9; ‘a’ to ‘z’ by a-z; and A to Z by A-Z.

The following code produces a match:

                    preg_match("/[0-9]/", "ID5id")

The square brackets indicate that any element it contains should be tested for matching. A range of characters is a class, and so you have to use the square brackets, as in the above expression. In that case, a match occurs between 5 in the range 0 to 9 and 5 in the subject string, “ID5id”.

The above expression is the same as

               preg_match("/[0123456789]/", "ID5id")

Note the use of the square brackets. The following code will produce a match for a similar reason:

                    preg_match("/[a-z]/", "ID5i")

A match occurs between ‘i’ in the range a-z and ‘i’, the only lowercase later in our present subject.

Of course, you can combine a range with other characters in the regex. The regex /ID[0-9]id/ will match “ID4id”, “ID5id”, “ID6id”; in fact any word beginning with ‘ID’ followed by a digit and then ‘id’. So

                    preg_match("/ID[0-9]id/", "ID2id is an ID")

produces a match. Remember, preg_match() is the main PHP function, you use, when you want a match.

Note: the range format gives a short form of writing a class. The range has to be in square brackets to effectively be considered as a class. It is any one element in the square brackets that is matched.

Negation
Character ranges and some special regex characters can be negated.

Any character except a digit is written as

             [^0-9]

This refers to all characters existing, which are not in the range 0-9. The following expression produces a match:

                    preg_match("/[^0-9]/", "12P34")

P is not in the range [0-9]; P is outside. Concerning all characters, P is in the range [^0-9]. Note the presence and absence of the ‘^’ character in the classes [0-9] and [^0-9], in this paragraph.

The special character used for negation is “^”.

The range outside [a-z] is [^a-z]. That is [^a-z] is the negation of [a-z].

The range outside [A-Z] is [^A-Z]. That is [^A-Z] is the negation of [A-Z].

We shall see other negations below.

Abbreviations for Common Character Classes
\d
\d means, any digit, and it abbreviates [0-9]. The following code produces a match:

               preg_match("/ID\did/", "ID5id is an ID")

Negated \d
\D is negated \d. It represents any character that is not a digit, that is [^0-9].

\s
\t\r\n\f  are white space characters. ‘\ ‘ or simply ‘ ‘ is produced when you press the spacebar of your keyboard. \t is produced when you press the tab key on your keyboard. \r is the carriage return character. \n is the new line character and \f is the form feed character.

\s is the abbreviation for any white space character. That is \s is equivalent to [ \t\r\n\f].

The following expression produces a match:

             preg_match("/\n/", "The first line.\r\nThe second line.")

The following expression also produces a match:

              preg_match("/\s/", "The first line.\r\nThe second line.")

\s is a class of white space characters.

Negated \s
\S
\S is negated \s. It represents any character that is not a white space, that is [^s].

\S, [^s] and [^ \t\r\n\f] are equivalent.

The negation symbol negates the class (within the square brackets)

\w
This is a word character. It represents any alphanumeric character including the underscore. w and [0-9a-zA-Z_] are equivalent.

Negated w
W is negated w. It represents any non-word character. \W and [^w] are equivalent.

The Period ‘.’
The period ‘.’ matches any character except \n. For example, /.s/ matches 'is' in the subject string, "An apple is on the tree". /.s/ represents two characters, which are any character (except \n) followed by ‘s’.

You can use the \d\s\w\D\S\W abbreviations both inside and outside of character classes.

Beginning and End of a String
The aim here is to see how you can match a regex to the beginning of the subject string or the end of the subject string (or both the beginning and the end).

The ^ Character for Matching at the Beginning
If you want the matching to occur at the beginning of the subject string, start the regex with the ‘^’ character.

The following expression produces a match:

preg_match("/^one/", "one and two")

The following expression does not produce a match:

preg_match("/^one/", "The one I saw")

In the first case the word ‘one’ is at the beginning of the subject string. In the second case, the word ‘one’ is not at the beginning of the subject string.

At this point, you may ask, “Is ‘^’ not the negation symbol?” Well it is the negation symbol. The problem is to know when to use it. When used inside a class (square brackets) it is the negation symbol; when used at the beginning of a regex, just after the forward slash, it is the regex character for matching at the beginning of the subject string. It is an anchor metacharacter.

The $ Character for Matching at End
If you want the matching to occur at the end of the subject string, end the regex with the ‘$’ character.

The following expression produces a match:

                    preg_match("/last$/", "This is the last")

The following expression does not produce a match:

                    preg_match("/last$/", "The last boy")

In the first case the word ‘last’ is at the end of the subject string. In the second case, the word ‘last’ is not at the end of the subject string.

Note: $ actually matches the end of the subject string, or just before a newline character at the end of the subject string.

^ and $ are called anchor meta characters.

Matching the Whole String
Now, note that the .* character combination (period followed by asterisk)  in the pattern matches any sub string including a sub string of zero length.

You can match the whole subject string, using the ‘^’ with the ‘$’ characters. The following code produces a match:

                  preg_match("/^be.*end$/", "beginning and end")

The following code also produces a match:

                 preg_match("/^be.*end$/", "beginning with end")

The subject string of the first case is, “beginning and end”. The subject string of the second case is “beginning with end”. The difference occurs in the word in the middle (and/with).

The regex pattern of both cases is the same. The pattern begins with ‘^’ and ends with ‘$’. The regexp indicates that the subject string to be matched has to begin with “be”, followed by any character, any number of times; and the subject string has to end with “end”.

Note: All along, when we say match, we are actually searching the subject string for a sub-string, represented by the pattern of the regex. Well, when you are matching the whole subject string, the regex represents the whole string. In PHP the method or function used in place of search() above, is match(); and it is more convenient to use there, than our search() here.

So, you can now match a whole string. By the time you complete this series, you will be able to match a whole subject string having particular words within the string. I will not go into the details. It will be an exercise for you. You will simply need to combine many of the features I explain in the series.

We have done a lot so far, there are still many things to be learned. Regular Expressions is relatively new in software programming. So, we shall continue to take it step by step.

This is a good place to take a break. We continue in the next part of the series.

Chrys

Broad Network

Related Articles

Regular Expression Patterns in PHP

PHP Regular Expressions – Part II

Introduction

Related Links

Comments