Matching Phone Numbers

A slightly more interesting example would be to match U.S. and Canadian telephone numbers. In their most basic forms, these are a sequence of seven digits, usually separated by some character such as a space, a dash (-), or a dot (.). A regular expression for this would be as follows:

[0-9]{3,3}[-. ]?[0-9]{4,4}

This simple expression says match exactly three digits ([0-9]{3,3}), followed optionally by a single dash, period, or space ([-. ]?), and then match exactly four more digits ([0-9]{4,4}).

To add in the area code, which is itself a three-digit number, is a bit more interesting. This can optionally be wrapped in parentheses, or not wrapped in parentheses but separated from the other digits by a space, dash, or dot. Our regular expression begins to get more complicated. The new portion of the expression to match the area code will look like this:

\(?[0-9]{3,3}\)?[-. ]?

Because the ( and ) characters are used by regular expressions, we have to escape them with the backslash (\) to use them as characters we want to match. Our complete regular expression thus far would be this:

\(?[0-9]{3,3}\)?[-. ]?[0-9]{3,3}[-. ]?[0-9]{4,4}

If you look closely at the preceding expression, however, you should see that in addition to correctly matching strings such as (###)###-####, it also matches strings such as (###)-###-####, which might not be what we want. To improve this, we could use some grouping:

(\(?[0-9]{3,3}\)?|[0-9]{3,3}[-. ]?)[0-9]{3,3}[-. ]?[0-9]{4,4}

The new area code portion of the expression

(\(?[0-9]{3,3}\)?|[0-9]{3,3}[-. ]?)

consists of the same two parts it did before, but now they are in a group (denoted by the unescaped ( and )), and the | character indicates that only one of the two can occur.

Our regular expression now refuses to accept strings such as (###)-###-####. Upon some reflection, however, we do not care what format the user enters the phone number in, as long as there are 10 digits in it. This would relieve the user completely from having to worry about the format, but it probably would make us have to do a bit more work to extract these digits later on. A regular expression for this might be as follows:


As mentioned in previous sections, this might not be the most efficient regular expression because the ".*" sequence will pretty much guarantee some greedy searching problems; for infrequent form validation, however, it should not stress our servers significantly.