Categories
PHP

Regular Expressions

A regular expression or regex is used for searching, editing, extracting and manipulating data. Using regular expressions (regex) you can verify if a specific string matches a given text pattern or to find out a set of characters from a sentence or large batch of characters.

Regular expressions are patterns that can be matched with strings. Regular expressions are also used in replacing, splitting, and rearranging text. Regular expressions generally follow a similar pattern in most programming languages.

For example, you want to match the following HTML tags <h1>, <h2>, <h3>, <h4>, <h5>, <h6>, and</h1>, </h2>, </h3>, </h4>, </h5>, </h6>, simply write a regex pattern like this: /<\/?h[1-6]>/. See the following code:

<?php
 $html = '<H1>Heading 1</H1>';
 $pattern = '/<\/?h[1-6]>/i';
 echo preg_match($pattern, $html);
 //Prints 1

preg_match function searches the string for a match using the pattern (/ is the pattern delimiter) and returns 1 if the pattern matches, 0 if it does not match, or false if an error occurred.

PHP uses preg_match and preg_match_all functions for matching and preg_replace function for replacement. You’ll read about these functions later.

JavaScript example:

As we already mentioned that regular expressions generally follow a similar pattern in most programming languages. In the following example, we use the same pattern in our JavaScript code.

 var html = '<H1>Heading 1</H1>';
 var pattern = /<\/?h[1-6]>/i;
 regex.test(html);
 //true

This tutorial covers the following topics:

  1. Regular Expression Syntax
  2. Pattern delimiter
  3. Anchors – matching the start and end of a string
  4. Quantifiers
  5. Character classes
  6. Negation character classes
  7. Named character classes
  8. Matching any character with wildcard
  9. Groups or Subpatterns
  10. Backreferences
  11. Alternation – multiple regular expressions
  12. Escape sequences
  13. Pattern modifiers

Regular Expression Syntax

  1. Regular expressions must always be quoted as strings, for example, '/pattern/' or "/pattern/".
  2. You can use any pair of punctuation characters as the delimiters, for example, '/pattern/', '#pattern#', "%pattern%", etc.

The following example shows how the preg_match() is called to find the literal pattern "cat" in the subject string “raining cats and dogs”:

<?php
 $string = 'raining cats and dogs';
 $pattern= '/cat/';
 if (preg_match($pattern, $string) === 1)
  echo 'Found "cat"';
 //Prints: Found "cat"

Pattern delimiter

As we already mentioned above, a delimiter can be any non-alphanumeric, non-backslash, non-whitespace character. The following are equivalent:

<?php
 $pattern = "/cat/";

 //same as previous, but different delimiter
 $pattern = '~cat~';

 //same as previous, but different delimiter
 $pattern = '!cat!';

 //same as previous, but different delimiter
 $pattern = '#cat#';

If the regex delimiter occurs within the regex, it must be escaped with a backslash. To avoid this, choose a delimiter that does not occur in the regex. For example, you can use the pipe sign | if the forward slash / occurs in the regex, '|reg/ex|', otherwise, always use the forward slash / it is the standard delimiter and most programming languages use it as a regex delimiter. See the following example:

<?php
 //Escape forward slashes 
 $pattern = '/https:\/\//';

 //Not need to escape
 $pattern = '#https://#';

Meta Characters

The $^*()+.?[]\{}| punctuation letters are called metacharacters which make regular expressions work. Here is an overview of these special characters:

Anchors – start and end of a string

A regular expression can specify that a pattern occurs at the start or end of a subject string using anchors.

MetaDescription
^Beginning of text
$End of the text, define where the pattern ends
  • /^hello$/ – If you look for the word “hello”, the “h” must be at the beginning, while “o” is at the end. To search this string exactly (and nothing before and after).
  • /^hello/ – If the word you’re looking for is at the beginning of the text.
  • /hello$/ – If the word you’re looking for is at the end of the text.
  • /hello/ – If the word you’re looking for is anywhere in the text.
<?php
 // Matches if string start with "hello"
  echo preg_match('/^hello/', 'hello, world'); #Prints: 1
  echo preg_match('/^hello/', 'hi, world');    #Prints: 0

 // Matches if string ends with "world"
  echo preg_match('/world$/', 'hi, world');    #Prints: 1
  echo preg_match('/world$/', 'hi, friends');  #Prints: 0

 // Must match "hello" exactly
 echo preg_match('/^hello$/', 'hello');        #Prints: 1
 echo preg_match('/^hello$/', 'hello, world'); #Prints: 0

Quantifiers

*, +, ?, {, } interprets as quantifiers unless they are included in a character class. Quantifiers specify how many instances of a character, group, or character class must be present in the input for a match to be found.

MetaDescription
?Once or not at all (0 – 1), equivalent to {0, 1}
/hi!?/ matches hi and hi!
*Zero or more times (0 – ∞), equivalent to {0, }
/hi!*/ matches hi, hi!, hi!!, hi!!!, and so on.
+One or more times (1 – ∞), equivalent to {1, }
/hi!+/ matches hi!, hi!!, hi!!!, and so on.
{n}Exactly n times (where n is a number)
/hi!{3}/ matches hi!!!
{n, }At least n times (n – ∞)
/hi!{0, }/ works similar to *
/hi!{1, }/ works similar to +
{ ,m}No or m times (0 – m) where m is a number
/hi!{ ,3}/ matches hi and hi!!!
{n,m}At least n but not more than m times
/hi!{0,1}/ works similar to ?
/hi!{0,2}/ matches hihi! and hi!!

Character classes

[ ] Character class, match only characters that are listed in the class. Defines one character out of the group of letters or digits. [aeiou] match either a, e, i, o, or u. A hyphen - creates a range when it is placed between two characters. The range includes the character before the hyphen, the character after the hyphen, and all characters that lie between them in numerical order. See the following examples:

MetaDescription
[0-9]Matches any digit.
[a-z]Matches any small alphabet character.
[A-Z]Matches any capital alphabet character.
[a-zA-Z0-9]Matches any alphanumeric character.
gr[ae]yMatches grey or gray but not graey.

For example, to match a three-character string that starts with a b, ends with a ll, and contains a vowel as the middle letter, the expression:

<?php
 echo preg_match('/b[aeiou]ll/', 'bell'); #Prints: 1

Match any string that contains “ball, bell, bill, boll, or bull”.

Negation (match all except that do not exist in the pattern)

The caret ^ at the beginning of the class means “No“. If a character class starts with the ^ meta-character it will match only those characters that are not in that class.

MetaDescription
[^A-Z]matches everything except the upper case letters A through Z
[^a-z]matches everything except the lowercase letters a through z
[^0-9]matches everything except digits 0 through 9
[^a-zA-Z0-9]combination of all the above-mentioned examples

The following code matches anything other than alphabet (lower or upper):

<?php
 $pattern  = '/^[^a-zA-Z]+$/';
 echo preg_match($pattern, 'hello'); #Prints: 0
 echo preg_match($pattern, '1234'); #Prints: 1
 echo preg_match($pattern, 'hi12'); #Prints: 0

Named character classes

Named ClassDescription
[:alnum:]Matches all ASCII letters and numbers. Equivalent to [a-zA-Z0-9].
[:alpha:]Matches all ASCII letters. Equivalent to [a-zA-Z].
[:blank:]Matches spaces and tab characters. Equivalent to [ \t].
[:space:]Matches any whitespace characters, including space, tab, newlines, and vertical tabs. Equivalent to [\n\r\t \x0b].
[:cntrl:]Matches unprintable control characters. Equivalent to [\x01-\x1f].
[:digit:]Matches ASCII digits. Equivalent to [0-9].
[:lower:]Matches lowercase letters. Equivalent to [a-z].
[:upper:]Matches uppercase letters. Equivalent to [A-Z].
<?php
 $ptrn = '/[[:digit:]]/';
 echo preg_match($ptrn, 'Hello'); # Prints 0
 echo preg_match($ptrn, '150'); # Prints 1

Wild Card – dot or period to match any character

To represent any character in a pattern, a . (period) is used as a wildcard. The pattern "/e../" matches any three-letter string that begins with a lowercase "e"; for example, eateggend, etc. To express a pattern that actually matches a period use the backslash character \ for example, “/brainbell\.com/” matches brainbell.com but not brainbell_com.

Groups or subpatterns

Parentheses ( ) are used to define groups in regular expressions. You can use the set operators *, +, and ? in such a group, too. Groups show how we can extract data from the input provided.

MetaDescription
( )Capturing group
(?<name>)Named capturing group
(?:)Non-capturing group
(?=)Positive look-ahead
(?!)Negative look-ahead
(?<=)Positive look-behind
(?<!)Negative look-behind

Applying repeating operators (or quantifiers) to groups, the following pattern matches “123”, “123123”, “123123123”, and so on.

<?php
 $pattern = '/(123)+/';
 echo preg_match($pattern, '123'); # Prints 1
 echo preg_match($pattern, '123123'); # Prints 1
 echo preg_match($pattern, '123123123'); # Prints 1

The following example matches a URL:

<?php
 $pattern = '!^(https?://)?[a-zA-Z]+(\.[a-zA-z]+)+$!';
 echo preg_match($pattern, 'brainbell.com'); //Prints: 1
 echo preg_match($pattern, 'http://brainbell.com'); //Prints: 1
 echo preg_match($pattern, 'https://brainbell.com'); //Prints: 1

Backreferences

You can refer to a group (or subpattern) captured earlier in a pattern with a backreference. The \1 refers to the first subpattern, \2 refers to the second subpattern, and so on.

<?php
 $subject = 'PHP PHP Tutorials';
 $pattern = '/(PHP)\s+\1/';
 echo preg_match($pattern, $subject);
 //Prints: 1

You can not use backreferences with a non-capturing subpattern (?:), see the following code:

<?php
 $subject = 'PHP PHP Tutorials';
 $pattern = '/(?:PHP)\s+\1/';
 echo preg_match($pattern, $subject);
 /*Warning: preg_match():
    Compilation failed:
    reference to non-existent subpattern*/

If a pattern is enclosed in double quotes, the backreferences are referenced as \\1, \\2, \\3, and so on.

<?php
 $subject = 'PHP PHP Tutorials';
 $pattern = "/(PHP)\s+\\1/";
 echo preg_match($pattern, $subject);
 //Prints: 1

For more information visit: Using backreferences with preg_replace().

Alternation – combine multiple regex

The | operator has the lowest precedence of the regular expression operators, treating the largest surrounding expressions as alternative patterns. This operator splits the regular expression into multiple alternatives. School|College|University matches School, College, or University with each match attempt. Only one name matches each time, but a different name can match each time. /a|b|c/ matches a, or b, or c with each match attempt.

<?php
 $pattern = '/(c|b|r)at/';
 echo preg_match($pattern, 'cat'); // 1
 echo preg_match($pattern, 'rat'); // 1
 echo preg_match($pattern, 'bat'); // 1

Escape characters \

\ (the backslash) masks metacharacters and special characters so they no longer possess a special meaning. If you want to look for a metacharacter as a regular character, you have to put a backslash in front of it. For example, if you want to match one of these characters: $^*()+.?[\{|, you should have to escape that character with \. The following example matches $100:

<?php
 echo preg_match('/\$[0-9]+/', '$100'); #Prints: 1

Note: To include a backslash in a double-quoted string, you need to escape the meaning of the backslash with a backslash. The following example shows how the regular expression pattern “\$” is represented:

<?php
 //Escaping $ with \\ in a double-quoted pattern
 echo preg_match("/\\$[0-9]+/", '$100'); #Prints: 1

 //Escaping $ with \ will not match the pattern 
 echo preg_match("/\$[0-9]+/", '$100'); #Prints: 0

It’s better to avoid confusion and use single quotes when passing a string as a regular expression:

<?php
 echo preg_match('/\$[0-9]+/', '$100'); #Prints: 1

The backslash itself is a metacharacter, too. If you look for the backslash, you write \\\.

<?php
 #Single-quoted string
 echo preg_match('/\\\/', '\ backslash'); #Prints: 1
 
 #Doouble-quoted string, must use \\\
 echo preg_match("/\\/", '\ backslash');
 #Warning: preg_match(): No ending delimiter '/' found

Read more details on Escaping special characters in a regular expression.

You can also use the backslash to specify generic character types:

  • \d any decimal digit, equivalent to [0-9]
  • \D any character that is not a decimal digit, equivalent to [^0-9]
  • \s any whitespace character
  • \S any character that is not a whitespace character
  • \w any “word” character
  • \W any “non-word” character

See the full list of escape sequences on php.net.

If you want to search for a date, simply use '/\d{2}.\d{2}.\d{4}/', it matches 01-12-2020, 10/10/2023 or 12.04.2025

  • \d is equivalent to [0-9].
  • \d{2} matches exactly two digits.
  • \d{4} matches exactly four digits.

Pattern Modifiers

You can change the behavior of a regular expression by placing a pattern modifier at the end of the regular expression after the closing delimiter. . For example, for case insensitive pattern match, use the i modifier /pattern/i.

<?php
 $string = 'Hello, World';
 $pattern = '/hello/i';
 echo preg_match($pattern, $string);

Following is the list of pattern modifiers:

  • i modifier Perform case insensitive match.
  • m modifier This modifier has no effect if there are no \n (newline) characters in a subject string, or no occurrences of ^ or $ in a pattern, otherwise, instead of matching the beginning and end of the string, the ^ and $ symbols will match the beginning and end of the line.
  • s modifier The . will also match newlines. By default, the dot . character in a pattern matches any character except newline characters. By adding this modifier you can make the dot character match newlines too.
  • x modifier Ignores the whitespace in the pattern unless you escape it with a backslash or within brackets, for example, use " \ " (for a space), " \t " (for a tab), or " \s " (for any whitespace character). Whitespace inside a character class is never ignored. This allows you to split a long regular expression over lines and indent it similar to the PHP code.
  • A modifier Anchors the pattern to the beginning of the string. Matches only to the beginning of a string even if newlines are embedded and the m modifier is used. Only used by preg_replace() function.
  • D modifier Matches only at the end of the string. Without this modifier, a dollar sign is ignored if the m modifier is set.
  • S modifier Studying a pattern if it is used often to optimize the search time.
  • U modifier This modifier makes the quantifiers lazy by default and using the ? after them instead marks them as greedy.
  • X modifier Any backslash in a pattern followed by a letter that has no special meaning causes an error.
  • J modifier Allow duplicate names for subpatterns.
  • u modifier Treats the pattern and subject as being UTF-8 encoded.

More Regular Expressions Tutorials: