MySQL Regular Expressions Pattern Constructs
MySQL Regular Expressions - Part 3
Division 8
Forward: In this part of the series we look at MySQL Regular Expressions Pattern Constructs.
By: Chrysanthus Date Published: 8 Aug 2012
Introduction
Database table for Illustration
We shall use the following database table called Pets, for illustration:
| name | owner | species | sex | birth | death |
+--------+--------+---------+------+------------+------------+
| Buffy | Harold | dog | f | 1999-05-13 | NULL |
| Bowser | Diane | dog | m | 1999-08-31 | 2005-07-29 |
| Fluffy | Harold Taylor | cat | f | 2003-02-04 | NULL |
| Timtim | Gwen Jones | cat | m | 2004-03-17 | NULL |
| Whistler | Gwin | bir5d | NULL | 2007-12-09 | NULL |
The data in this table is slightly different from the one in the table of the previous part of the series. You should be testing the SELECT statements given below with this table.
The (abc) Construct
Here, (abc) simply means one, two, three, four, five, etc. characters in parentheses. In a pattern, parentheses group characters into a set. The set of characters in the pattern behaves as a single character and the special characters we saw in the previous part of the series, can be applied on them as if they were a single character. For example, you can have something like, (ab)+, (abc)*, (abcd)?, etc. The following SELECT statement returns the fourth row of the above table, matching the fourth cell value, “Timtim” of the first column:
SELECT * from Pets where name rLike "(tim)+";
The pattern is, "(tim)+" . The pattern matches one or more (tim) in one subject. The subject matched is, “Timtim”; there are two (tim) in the subject. If the pattern were, “(tim)*” , all the cell values of the first column would have matched, because, * means zero or more (while + means one or more).
A word is a sequence of word characters that is not preceded by or followed by word characters. A word character is an alphanumeric character or an underscore, _. Examples of word characters are: ‘a’, ‘b’, ‘c’, ‘1’, ‘2’, ‘3’, etc. the marker, [[:<:]] in a pattern matches the beginning of a word. The marker, [[:>:]] in a pattern matches the end of a word. These constructs are word-boundary markers. They are not characters; they are boundary identifiers. The following SELECT statement returns the fourth row of the above table, matching “Gwen” in the subject, “Gwen Jones” of the fourth cell in the second column:
SELECT * from Pets where owner rLike "Gwen[[:>:]]";
The pattern is, "Gwen[[:>:]]" . It begins with the word, “Gwen” followed by the end of word marker, [[:>:]] . So, it matches “Gwen” and its end in the subject, “Gwen Jones”. In the subject, after the word, “Gwen”, you have a word-boundary before the space character. The [[:<:]] is used in a similar way, but in front of the word.
Note that each word-boundary marker has two pairs of opposite square brackets, two colons and a < or > character.
The [a-dX] and [^a-dX] Constructs
The dot special character matches any single character. What about the case when you want to match a single character from a set of characters? Consider the words, “Gwen”, “Gwin”, “Gwan” and “Gw4n”. The words differ in the character at the third position. The first word has ‘e’ in the third position; the second word has ‘i’ in the third position; the third word has ‘a’ in the third position; and the fourth word has ‘4’ in the third position. So you may be interested to match the characters, ‘e’, ‘i’, ‘a’ and ‘4’. This means you want to match one character from the set, [eia4]. The pattern for this case is, "[eia4]", which would match a single character at a position in a subject that is, ‘e’ or ‘i’ or ‘a’ or ‘4’.
The following SELECT statement will return the fourth and fifth rows of the table above, matching “Gwen Jones” in the fourth cell and “Gwin” in the fifth cell, all in the second column.
SELECT * from Pets where owner rLike "Gw[eia4]n";
The pattern in the statement is, "Gw[eia4]n" . The subjects are “Gwen Jones” and “Gwin”. The word involved in the subjects, is “Gwen” for “Gwen Jones” and “Gwin” for “Gwin”. “Gwen” and “Gwin” differ in the third position of the characters. So, in the subject, “Gwen Jones”, ‘e’ in “Gwen” is matched and in the subject, “Gwin”, ‘i’ is matched. ‘e’ and ‘i’ are in the third position of the words, “Gwen” and “Gwin” respectively. In the pattern, "Gw[eia4]n" , [eia4] is in the third position.
Note: [eia4] and (eia4) mean different things. [eia4] means ‘e’ or ‘i’ or ‘a’ or ‘4’; while (eia4) means treat the set of characters, “eia4”, as a single character.
[eia4] from above is an example of a class. What about the case, where you want a match from the class, [abcde] ? Note that this is a range of characters of the alphabet from ‘a’ to ‘e’. [abcde] is better written as, [a-e] . Note the use of the hyphen.
You can combine a range with single characters. For example, [a-dX] means ‘a’ or ‘b’ or ‘c’ or ‘d’ or ‘X’ , that is, the range a-d and X.
You can also have a range with digits, e.g. [0-5], [3-9], [0-9]
Negating the Class
If you want to match a character at a position in the subject except the ones in the class, then you would use the ^ symbol. So, if you want to match any character except the ones in [a-dX] , you would type [^a-dX] in the pattern. Note the use and position of ^ in the class construct. The following SELECT statement will return the first and second rows, matching, “dog” in the first and second cells of the third colon:
SELECT * from Pets where species rLike "d[^a-n]g";
The pattern in the statement is "d[^a-n]g" meaning that no character in the range, a-n, should be matched at the position in question. The pattern matched, begins with ‘d’, followed by a character that is not in the range, a-n, and then followed by ‘g’. ‘o’ is not in the range, so “dog” is matched.
Formal Character Classes
Some classes are so important that they have names. The following list gives the names of these classes and their descriptions:
Name: Description
alnum: Alphanumeric characters
alpha: Alphabetic characters
blank: Whitespace characters
cntrl: Control characters
digit: Digit characters
graph: Graphic characters
lower: Lowercase alphabetic characters
print: Graphic or space characters
punct: Punctuation characters
space: Space, tab, newline, and carriage return
upper: Uppercase alphabetic characters
xdigit: Hexadecimal digit characters
To use any of these classes, you use the marker construct, [[:name:]] .
The following SELECT statement will return the last row in the table matching “bir5d” for the digit, 5, in the fifth cell of the third column:
SELECT * from Pets where species rLike "bir[[:digit:]]d";
The pattern is, "bir[[:digit:]]d" . So the subject “bir5d” , having the digit, 5, at the fourth position is matched. [[:digit:]] is at the fourth position of the pattern, "bir[[:digit:]]d".
Let m and n be integers (whole numbers). x{m} in a pattern matches exactly m consecutive instances of x. x{m, } matches at least m consecutive instances of x. x{m,n} matches between m and n consecutive instances of x. We now look at some examples.
The pattern, "fe{1}d" , matches the subject “fed”, where ‘e’ is matched exactly one time. The pattern, "fe{1,}d" , matches the subject, “fed” and the subject, “feed”, where ‘e’ is matched at least once. The pattern, "fe{1,2}d" , matches the subject, “fed” and the subject, “feed” , where ‘e’ is matched at least one time and at most 2 times.
The constructs are not only used for a character; they can also be used for an atom (piece of the pattern). So, the pattern, "a[bcd]{3}e" matches the subject, “abcde”. The pattern, "a[bcd]{3,}e" matches the subject, “abcde”. The pattern, "a[bcd]{1,10}e" matches the subject, “abcde”. The atom is [bcd].
Now, [bcd] means ‘b’ or ‘c’ or ‘d’. So the sub pattern, "[bcd]{3)" , means “bbb” or “ccc” or “ddd” or “bcb” or “cbc” or “dcd” or “bdb” or “cdc” or “dbd”. The sub pattern [bcd]{3,} means a combination of at least any of the three characters. The sub pattern, [bcd]{1,10}, means at least one character from the three characters, up to a combination of 10 characters from the 3 characters.
White space character literals are: '\b', '\t', '\n', '\v', '\f', and '\r'. The names of these literals are respectively, backspace, tab, newline, vertical-tab, form-feed and carriage-return.
You can use the whitespace character literals, directly in the pattern. Assume that you had the following row added to the table:
| Lion | Peter | fi\nsh | f | 2008-12-09 | NULL |
The cell for species has the value, “fi\nsh”, with the newline character, ‘\n’. The following SELECT statement will return the row, matching the subject, “fi\nsh”.
SELECT * from Pets where species rLike "fi\nsh";
The pattern is, "fi\nsh".
You can also use the name of the white space character, but in this case, you will have to place it in the marker, [[.name.]] . The following SELECT statement returns the added row, matching the subject, "fi\nsh".
SELECT * from Pets where species rLike "fi[[.newline.]]sh";
The pattern is, "fi[[.newline.]]sh" .
You can use the literal in the marker. The following SELECT statement returns the added row, matching the subject, "fi\nsh".
SELECT * from Pets where species rLike "fi[[.\n.]]sh";
The pattern is, "fi[[.\n.]]sh"
The marker has two pairs of opposite square brackets and two dots.
The MySQL Regular Expressions Metacharacters
A metacharacter is a character used to control or define other characters. In the case of patterns, the MySQL Regular Expressions metacharacters are:
^ $ . * + ? | { } [ ] : = > <
In order to use any of these metacharacters in a pattern, you have to precede it with two backslashes. In this way, it will be seen as an ordinary character and it will not have its defined effect. So, the pattern, "2\\+3\\=6" , will match the subject, “2+3=5” .
That is it for this part of the series. We stop here and continue in the next part.
Chrys
Related Links
Major in Website DesignWeb Development Course
HTML Course
CSS Course
ECMAScript Course
NEXT