Regular Expressions Tutorial

Regular Expressions

Regular Expressions were introduced 1956 by Stephen Cole Kleene to model nervous activity, but proved to be very effective to describe all sorts of character strings.

A RE describes the Form of character strings. If a character string can be described by a RE, then the RE is said to match the given string. REs are greedy, which means they try to match as many characters as possible in a given input.

The three most common forms (flavours) of REs are:

basic REs (used in ed, sed, lex, ...)
extended REs (used in egrep, awk, regex(3), ...)
Perl compatible REs (used in perl, libpcre, ...)

Documentation

A full documentation for REs can be found in:

man pages regex(7), awk(1), lex(1)
Regular Expressions in the The Single UNIX Specification, Version 2
Jeffrey E. F. Friedl, Mastering Regular Expressions, O'Reilly
Compilers: Principle, Techniques and Tools, by Aho, Sethi and Ullman, Addison Wesley

Definition of a (extended) RE

A RE is one or more non-empty branches, separated by '|'. It matches anything that matches one of the branches.
A branch is the concatenation of one or more pieces.
A piece is an atom, possibly followed by a single '*', '+', '?', or a bound.
A bound is a '{', followed by an integer, possibly followed by ',' possibly followed another number, followed by '}'.
An atom is any other character, the special characters '^' or '$', or any RE enclosed in brackets '()'.

Atoms

Atoms are the basic components of a RE

x	the character 'x' itself.
\X	if X is an 'a', 'b', 'f', 'n', 'r', 't', or 'v', then the ANSI-C interpretation of \x. Otherwise, a literal 'X' (used to escape operators such as '*').
\123	the character with octal value 123.
\xe5	the character with hexadecimal value e5.
.	any character (byte) except newline.
[xyz]	a character class: x OR y OR z.
[ako‑sP]	a character class with a range in it; matches an 'a', a 'k', any letter from 'k' through 's', or a 'P'.
[^A‑Z]	a negated character class: i.e., any character but those in the class. In our example, any character EXCEPT an uppercase letter.
[:class:]	a character class expression: allowed only within another character class. The valid contents of class are: alnum, alpha, blank, cntrl, digit, graph, lower, print, punct, space, upper, xdigit.

Pieces

Pieces are used to concatenate one or more REs, or to specify how often a precedent piece must be repeated.

(r)	the RE r itself. (This is actually an atom.)
rs	the RE r followed by the RE s.
r\|s	the RE r OR the RE s.
r*	the RE r zero or more times.
r+	the RE r one or more time.
r?	the RE r zero or one time.
r{2,6}	the RE r anywhere from two to six times.
r{2,}	the RE r two or more times.
r{,6}	the RE r up to six times.
r{4}	the RE r exactly for times.

Examples

The RE (x|y|z) is equivalent to [xyz], both match the single character 'x' or 'y' or 'z'.

The RE (a|b) is equivalent to (b|a).

The RE Twe{2}dled(ee|um) matches both strings Tweedledum and Tweedledee.

Regular Examples

The real numbers such as 0.75 or 1.23e6 can be described by the RE:

[0-9]+\.[0-9]*([eE][+-]?[0-9]+)?

The above RE can be rewritten using a character class:

[[:digit:]]+\.[[:digit:]]*([eE][+-]?[[:digit:]]+)?

There is a small problem with the previous RE: numbers like 3. are accepted, but not .3. We want to accept any number with at least one digit either before or after the decimal point. The following RE solves the problem by stating both options separately: the first part matches all numbers with a leading digit, the second part matches all numbers with a leading decimal point.

(([[:digit:]]+\.[[:digit:]]*)|(\.[[:digit:]]+))([eE][+-]?[[:digit:]]+)?

Basic REs

Basic REs are used by older but still widely used programs such as sed and lex. The difference with extended REs are:

'|', '+', and '?' are ordinary characters and there is no equivalent for their functionality
The delimiters for bounds are '\{' and '\}', with '{' and '}' by themselves ordinary characters
The parentheses for nested sub-expressions are '$' and '$', with '(' and ')' by themselves ordinary characters
'^' is an ordinary character except at the beginning of the RE or(!) the beginning of a parenthesised sub-expression
'$' is an ordinary character except at the end of the RE or(!) the end of a parenthesised sub-expression
'*' is an ordinary character if it appears at the beginning of the RE or the beginning of a parenthesised sub-expression (after a possible leading '^')
There is one new type of atom, a back reference: '\' followed by a nonzero decimal digit x matches the same sequence of characters matched by the parenthesised sub-expression number x (counting all sub-expressions by the positions of their opening parentheses, left to right), so that for example the RE '$[bc]$\1' matches 'bb' and 'cc' but not 'bc'

The simple RE matching real numbers, rewritten as a basic RE is:

[0-9][0-9]*\.[0-9]*\([eE][+-]\{0,1\}[0-9][0-9]*\)\{0,1\}