Regular expressions

This page is marked as In Progress so expect small errors or unfinished bits

A regular expression allows you to hunt for a pattern in a string. Examples:

When you see a regular expression it can be scary but if you start at the beginning it should make sense.

JavaScript, PHP and regular expressions

JavaScript and PHP have their own way of implementing regular expressions but the expressions themselves work the same way. This page attempts to cover both but that does mean you will need to apply the regular expressions in one or other (or both) of those languages.

In JavaScript (looking for the character Z):

var someVariable='why look for the Z';
var position=someVariable.search(/Z/);
if (position!=-1){

}
        

JavaScript has a RegExp object to use regular expressions but it is easier to use the methods built into the String object (match, search and split). Most of these examples will work with all three of those methods. Search returns -1 if there is no match.

In PHP you would do this to find the Z:

$something="why look for the Z";
if (preg_match("/Z/" , $something) {

}
        

In addition to preg_match() PHP provides:

Basics

The actual way to use regular expressions within PHP and JavaScript are not covered here. Instead it is how to construct a regular expression which will work in either language which is important.

The above search looks for Z inside the variable. The two slashes are special characters used to mark the beginning and end of the regular expression. To use any of these examples in PHP or JavaScript paste them in to either of the example above to replace /Z/.

To check if the Z appears at the beginning of the string you use another special character (caret):

/^Z/
        

You can do the same but check if it is at the end using a dollar character:

/Z$/
        

Putting the two together would check if Z was at the beginning and at the end (meaning it was the only character):

/^Z$/
        

Wildcards

The period (.) is used to represent any letter:

/l.ok/
        

That will match "look" but also would have matched "lhok" and so on. To represent any number of letters use the *:

/l.*ok/
        

That will match the same as the previous example but also "looook" or "lzzzzzzzzzzzzzok".

To return to the hunt for Z and adjust the last example where Z had to be the only character:

/^Z.*Z$/
        

That would match any string which began and ended with Z. There could be any number of characters between.

Escaping reserved characters

Already you will have a problem if you want to match / ^ . * $ in a string. To use those as characters in a regular expression as text (not special, reserved characters) you have to escape them as you do quotes in PHP ECHO statements (with a backslash):

/\$/
        

That would check if a dollar sign appears anywhere in the string. This would match only if that $ was at the end of the string:

/\$$/
        

You also need to escape quotes and some other characters (including of course the \ itself if you want to treat it as a character to search for).

Quantifiers

Sometimes you might want to look for a character repeated a number of times. You might do this by putting in the pattern ZZZ but you can also use another method:

/Z{3}/
        

That will match ZZZ or ZZZZ but not ZZ. This may not look useful with smaller numbers but Z{1000} would be easier than typing Z a thousand times! More likely to be useful is this:

/Z{1,3}/
        

That will match Z or ZZ or ZZZ. It is searching for between 1 and 3. Missing out the second value (e.g. {1,}) means at least 1 time but also anything more than one. Missing out the first value would match any number up to (including none) but not more than the second value.

There are also some special symbols used as quantifiers. * you met in the wildcard section. It means the previous character must can appear 0 or 1 or many times. + means 1 or many times. ? means 0 or 1 times. This duplicates the previous way to quantify but some prefer the simplicity.

Finding a range of characters

Rather than just looking for Z you could look for a choice of characters:

/[tZ]/
        

This will find the position of EITHER t or Z. Combine that with a quantifier and you can find any occurence of repeated characters:

/[tZ]{2}/
        

That will find tt or ZZ because it is looking for two identical characters which appear together. You can also search for any lower-case letter which repeats:

/[a-z]{2}/
        

That will find the double o in look. You can also search for any number using [0-9] or capitals using [A-Z].

The upright bar is used as an OR. You can search for a number of things which could match:

/[ste|ven|372|the]/
        

In this case "the" or "ven" or the others would be matched.

Modifiers

As well as the pattern to be matched you can include modifiers in the regular expression. These come after the final slash:

/l.ok/i
        

The i says to ignore case when searching. So this pattern would now match LOOOOOOK as well as all the previous matches.

A g modifier will look for all occurrences of the pattern rather than just matching the first but is not needed in PHP which does that anyway.

The m modifier allows you to search for multi-line patterns (ignoring line breaks).

Meta characters

These let you search for special things. There are many but three of the most useful might be:

A complex example

Often HTML forms are used to collect user email addresses (for contact forms, on-line ordering etc.). It would be worth checking if the user had put in a valid email address. Not so much to check for deliberate wrong emails but in case they make a mistake. This is one possible regular expression for checking for a valid email address:

/^[a-z]|[A-Z]{1,}[a-z]|[A-Z]|[0-9]|-|_|\.{0,}@[a-z]|[A-Z]{1,}\.[a-z]|[A-Z]{2,}/
        

If you saw that without knowing the above stuff it would probably make you scream! As it is you may be considering a career change. It says:

This will accept some non-valid addresses still but also allows for .me or .co.uk domains and similar. Some regular expressions for email checking will block anything except the original domain-dot-three letter addresses (e.g. whatever.com).

There is a briefer way to write this same code because the OR symbols (|) are not actually needed. The square brackets which include the acceptable characters can contain more than one range. So rather than [a-z]|[A-Z] you can just put [a-zA-Z]. The regular expression is now:

/^[a-zA-Z]{1,}[a-zA-Z0-9-_\.]{0,}@[a-zA-Z]{1,}\.[a-zA-Z]{2,}/
        

If that is more confusing to you then stick with the OR structure.

submit to reddit Delicious Tweet