Security Risks and Prevention Explained for PHP Regular Expressions
PHP Cheat Sheet and Prevention Explained - Part 7
Foreword: In this part of the series, I explain security risks in PHP Regular Expressions, and how to prevent them.
By: Chrysanthus Date Published: 29 Jan 2019
Introduction
Risks are weaknesses from PHP or from you the programmer, that you may ignore; and attackers (hackers) would take advantage of.
The preg_match() Function
- if (preg_match("/World/", "Hello World") == 1)
And
- if (preg_match("/World/", "Hello World"))
are the same. They both compare with == which does loose comparison.
The preg_match() function returns 1 if the pattern matches the given subject, 0 if it does not, or FALSE if an error occurred (for example, if the coding in the pattern does not make sense).
Solution: Since the if-condition with == interprets 0 and false, as false, henceforth use === instead of
if (preg_match("/World/", "Hello World") == 1)
Or
if (preg_match("/World/", "Hello World"))
as in the following code:
<?php
if (preg_match("/World/", "Hello World") === 1)
{
echo 'Matched';
}
else
{
echo 'Not Matched or Error occurred!';
}
?>
This explanation for the preg_match() function also applies to the preg_match_all() function.
Class Range
Basic Range
A class is any (one) character within square brackets that would match at that position in the pattern. So /[bcr]at/ will match "bat" or "cat" or "rat", because b will match b in bat, c will match c in cat and r will match r in rat.
When the characters in the square brackets form a range, you use a hyphen for the range, to match any of the characters. So /[0-9]/ will match "ID5id" because 5 is a character in the range 0 to 9 inclusive. However, the use of hyphen for range in square brackets does not go without problems.
The hyphen is a character in its own right, that needs to be matched. Normally /big-file/ matches the subject, "big-file.php is the filename". However, - has a special meaning in the class with square brackets.
The class of word characters is, [0-9a-zA-Z_] . It means, any character from 0 to 9 inclusive, no space, any character from a to z inclusive, no space, any character from A to Z inclusive, no space, or the underscore. The hyphen is not a word character; it is a metacharacter here.
Meta Characters for regex are { } [ ] ( ) ^ $ . | * + ? - \
So, if you want a hyphen in a class, what do you do? PHP allows us to escape the hyphen, as in [g\-p] meaning either a or - or z. You can also place the hyphen at the end, as in [gp-] ; you can still place the hyphen at the beginning, as in [-gp] ; since it is any character that matters.
The problem with the last two options (non escaping), is that the human mind can interpret p- as any character from p to z or interpret -g as any character from a to g. Prevention: always remember the rules for hyphen in a class, or always escape the hyphen when you need hypen in the class.
Outside a class, such as in /[0-9]big-file/ you do not need to escape the hyphen. However, if you escape it, such as in /[0-9]big\-file/ there is no problem.
[a-fz] and [a-f-m]
[a-fz] matches any letter between 'a' and 'f' (inclusive) or the letter 'z'. There is no problem here: just know that a range can be followed by other optional characters.
[a-f-m] matches any letter between 'a' and 'f' (inclusive), the hyphen ('-'), or the letter 'm'. The problem here is that the human mind can easily interpret this as the matching of any letter between a and f or any letter between f and m. Prevention: always remember the rules for hyphen in a class, or always escape the hyphen when you need hypen in a class.
['-?]
Internally in the computer, a range depends on the character set. ['-?] matches any of the characters '()*+,-./0123456789:;<=>? which is a range, if the operating system operates on ASCII character set. A program with this will work well in an ASCII based operation system. If you take the program into an operating system that operates on the Unicode character set, it will not wok, because ['-?] does not form a (continous) range there. Prevention: if you want ['-?] to work in both types of operating systems, as a range, you have to code, [\N{APOSTROPHE}-\N{QUESTION MARK}] instead, where \N{APOSTROPHE} means ' and \N{QUESTION MARK}] means ? , forming the range '()*+,-./0123456789:;<=>? for both operating systems.
The \s Class
\s does not work with Uniciode character set as it works with ASCII character set. This gives rise to portability problems between operating systems. I will not go into the details - consult some other document for the details.
\b and Class
Outside the character class and in the pattern, the escape sequence, \b is treated as a word boundary; inside a class, it is treated as a backspace character. Prevention: Know where to use \b and be careful how you code.
The Period (.) and the Class
Outside the character class and in the pattern, the period matches any character except \n . Inside the character class, it is ordinary and it matches only the dot. Prevention: Know where to use the period as a meta character in the pattern, and be careful how you code. Escaping or not escaping the period in the class, mean the same thing. If you are looking for the dot in a pattern outside the class, you must escape it.
Group and the Class
For a pattern, ([]) is accepted but [()] is not accepted. That is, () does not group or capture inside a class. However, you can have a class inside a group. Actully, most characters that are meta characters in regular expressions (that is, characters that carry a special meaning like . , * , or ) ) lose their special meaning and can be used inside a character class without the need to escape them.
So, though [h(ei)g] matches in the following script, there is no capturing:
<?php
preg_match("/[h(ei)g]+/", "height", $matches);
echo $matches[0], '<br>';
echo $matches[1], '<br>';
?>
The output is:
heigh
NULL
NULL here means, did not capture.
Negation can occur in a class. A class can be negated by having the caret (^) as the first character within the square brackets. An escape sequence can also be a negated class. [^0-9] means the pattern should not match any digit at that position. It is the same as \D which is an escape sequence. The immediate problem is, if the caret is a normal character and you want to match in the class, what do you do? Solution. escape the caret as in [\^0123] or [01\^23]. Or do not let it be the first character, as in [^0123].
If you are looking for the caret outside the class, you must escape it, otherwise it would mean the start of the string. So /[0123]\^/ is accepted and /[0123]^/ is not accepted. The caret is also a metacharacter.
Matching the Start of the Subject
To match the start of the subject, place the caret at the beginning of the pattern, as in /^one/ . The immediate problem is, if the caret is a normal character at the start of the subject, what do you do? Solution: escape the caret as in /\^one/ .
Another problem is if the caret is in the middle of the subject, and you do not escape it in the pattern, then your program will say, Not Matched, when there should have been matching, but no fatal error message is issued. For example, /one^two/ will not match "one^two" unless you escape ^ like so, /one\^two/ . Your script continuous to run erroneously, with users ending up with wrong results. Solution: If you are looking for ^ as an ordinary character, always escape it in the pattern.
The Escape Character \
If you are looking for the ordinary escape character in a pattern, inside or outside the class, always escape it, like so, \\ .
Matching Alternatives
Sometimes we would like our regex to be able to match different possible words or character strings. This is accomplished by using the alternation metacharacter '|' . To match dog or cat , we form the regex /dog|cat/ . PHP will try to match the regex at the earliest possible point in the subject string. At each character position, PHP will first try to match the first alternative, dog . If dog doesn't match, PHP will then try the next alternative, cat . If cat doesn't match either, then the match fails and PHP moves to the next position in the string.
The following conditional will match "cat":
if (preg_match("/cat|dog|bird/", "cats and dogs") === 1)
"cat" alone (without "dog") is matched.
The following conditional will still match "cat":
if (preg_match("/dog|cat|bird/", "cats and dogs") === 1)
"cat" alone (without "dog") is matched.
Though dog is the first alternative in the second regex, cat is able to match earlier in the string.
Problem: In the second regex, "cat" alone is matched, earlier, instead of dog. Solution: To match all possibilities, use the preg_match_all() function. The following code illustrates this:
<?php
preg_match_all("/dog|cat|bird/", "cats and dogs", $matches);
echo $matches[0][0], '<br>';
echo $matches[0][1], '<br>';
?>
The output is:
cat
dog
The following conditional matches "c":
if (preg_match("/c|ca|cat|cats/", "cats") === 1)
Here, all the alternatives match at the first string position, so the first complete alternative in the regex, is the one that matches (is taken) - this is a problem. Solution: If some of the alternatives are truncations of the others, put the longest ones first to give them a chance to match, as in the following code, which matches, "cats":
<?php
if (preg_match("/cats|cat|ca|c/", "cats", $matches) === 1)
echo $matches[0];
?>
The output is:
cats
If you really want to match all alternatives, then use preg_match_all().
The following conditional matches "c":
if (preg_match("/a|b|c/", "cab") === 1)
"cab" =~ /a|b|c/ # matches "c"
# /a|b|c/ == /[abc]/
The alternative that is complete in the regex, beginning from the left, matches first. This example also shows that character classes are like alternations of characters (for example /a|b|c/ == /[abc]/).
Quantifiers
Quantifiers are:
x* : means match 'x' 0 or more times, i.e., any number of times
x+ : means match 'x' 1 or more times, i.e., at least once
x? : means match 'x' 0 or 1 times
x{n,} : means match 'x' at least n or more times; note the comma.
x{n} : match 'x' exactly n times
x{n,m} : match 'x' at least n times, but not more than m times.
The Greediness of x* or x+ with the Dot
Consider the following code segment that produces a match:
$subject = "In a meeting, you have to greet people";
$regex = "/m.*t/";
preg_match($regex, $subject, $matches);
The regex says, match from ‘m’ and then any character as many times as possible until ‘t’. From the subject string, the possible matches are “meet” or “meeting, you have to greet”. In practice, the matching statement above will match, “meeting, you have to greet”; that is greediness.
Consider this time, the following code segment that also produces a match:
subject = "In a meeting, you have to greet people";
regex = "/m.+t/";
preg_match($regex, $subject, $matches);
The regex says, match from ‘m’ and then any character you meet next, but as many times as possible until ‘t’. From the subject, the possible matches again are “meet” or “meeting, you have to greet”. In practice, the matching statement above will match, “meeting, you have to greet”; that is greediness.
Limiting of greediness is to make the quantifier match the first occurrence (leftmost) in the subject. To achieve this append ? to the quantifier symbol, that is, x*? or x+? . Try the following script:
<?php
$subject1 = "In a meeting, you have to greet people";
$regex1 = "/m.*t/";
preg_match($regex1, $subject1, $matches);
echo $matches[0], '<br>';
$subjectA = "In a meeting, you have to greet people";
$regexA = "/m.*?t/";
preg_match($regexA, $subjectA, $matches);
echo $matches[0], '<br>';
$subject2 = "In a meeting, you have to greet people";
$regex2 = "/m.+t/";
preg_match($regex2, $subject2, $matches);
echo $matches[0], '<br>';
$subjectB = "In a meeting, you have to greet people";
$regexB = "/m.+?t/";
preg_match($regexB, $subjectB, $matches);
echo $matches[0], '<br>';
?>
The output is:
meeting, you have to greet
meet
meeting, you have to greet
meet
In this code, where ? was appended, you have “meet” as the matched substring.
The x?, x{n,} and x{n,m} Quantifiers
The greediness of the x?, x{n,} and x{n,m} quantifiers is subjective or optional in interpretation. Whatever the case, limitation is to append ? to the quantifier symbol. Let us consider them one-by-one.
The x? Quantifier
Consider the following statement:
preg_match("/(b.?)/", "The book is nice", $matches);
where the subject is "The book is nice" and the regex is /(b.?)/.
The regex says, match b followed by any character, zero or 1 time. So, it can match “b” or “bo”. In practice, this statement will match “bo”; that can be considered as greediness. Limitation solution is to type ? after the quantifier symbol, ? to match ‘b’ alone.
The x{n,} Quantifier
Consider the following statement:
preg_match("/(m.{2,}t)/", "In a meeting, you have to greet people.", $matches);
In practice, you will have “meeting, you have to greet” and not “meet” matched; that can be interpreted as greediness. To have “meet”, use the syntax x{n,}? or exactly x{n}. In the case of m.{2,}?t you can use just m.{2}t .
The x{n,m} Quantifier
Consider the following statement:
preg_match("/(m.{2,24}t)/", "In a meeting, you have to greet people", $matches);
In practice, you will have “meeting, you have to greet” and not “meet” matched; that can be interpreted as greediness. To have “meet”, use m.{2,24}?t .
Read and try the following code that demonstrates the above:
<?php
$subject1 = "The book is nice";
$regex1 = "/b.?/";
preg_match($regex1, $subject1, $matches);
echo $matches[0], '<br>';
$regexA = "/b.??/";
preg_match($regexA, $subject1, $matches);
echo $matches[0], '<br>';
$subject2 = "In a meeting, you have to greet people";
$regex2 = "/m.{2,}t/";
preg_match($regex2, $subject2, $matches);
echo $matches[0], '<br>';
$regexB = "/m.{2,}?t/";
preg_match($regexB, $subject2, $matches);
echo $matches[0], '<br>';
$regexB1 = "/m.{2}t/";
preg_match($regexB1, $subject2, $matches);
echo $matches[0], '<br>';
$subject3 = "In a meeting, you have to greet people";
$regex3 = "/m.{2,24}t/";
preg_match($regex3, $subject3, $matches);
echo $matches[0], '<br>';
$regexC = "/m.{2,24}?t/";
preg_match($regexC, $subject3, $matches);
echo $matches[0], '<br>';
?>
The output is:
bo
b
meeting, you have to greet
meet
meet
meeting, you have to greet
meet
Note: When the limitation of the greediness is given, the quantifier may be said to be non-greedy.
Empty pattern matches subject with text. The following code illustrates this:
<?php
if (preg_match("//", "Text of coded message.") === 1)
echo 'Matched';
?>
The output is:
Matched
An empty pattern is nothing, looking for nothing in something (subject); it has to see nothing, and so matched.
You can also have a pattern/subject pair where the pattern is equivalent to empty. The following code illustrates this:
<?php
if (preg_match("/[^0-9]/", "abcdefg") === 1)
echo 'Matched';
?>
The output is:
Matched
An empty pattern will matched an empty string, as in the following code:
<?php
if (preg_match("//", "") === 1)
echo 'Matched';
?>
The output is:
Matched
That is it for this part of the series. We stop here and continue in the next part.
Chrys
Related Links
Basics of PHP with Security ConsiderationsWhite Space in PHP
PHP Data Types with Security Considerations
PHP Variables with Security Considerations
PHP Operators with Security Considerations
PHP Control Structures with Security Considerations
PHP String with Security Considerations
PHP Arrays with Security Considerations
PHP Functions with Security Considerations
PHP Return Statement
Exception Handling in PHP
Variable Scope in PHP
Constant in PHP
PHP Classes and Objects
Reference in PHP
PHP Regular Expressions with Security Considerations
Date and Time in PHP with Security Considerations
Files and Directories with Security Considerations in PHP
Writing a PHP Command Line Tool
PHP Core Number Basics and Testing
Validating Input in PHP
PHP Eval Function and Security Risks
PHP Multi-Dimensional Array with Security Consideration
Mathematics Functions for Everybody in PHP
PHP Cheat Sheet and Prevention Explained
More Related Links