Regular Expressions Tutorial
Regular Expressions
Regular Expressions were introduced 1956 by Stephen Cole Kleene to model nervous activity, but proved to be very effective to describe all sorts of character strings.
A RE describes the Form of character strings. If a character string can be described by a RE, then the RE is said to match the given string. REs are greedy, which means they try to match as many characters as possible in a given input.
The three most common forms (flavours) of REs are:
- basic REs (used in ed, sed, lex, ...)
- extended REs (used in egrep, awk, regex(3), ...)
- Perl compatible REs (used in perl, libpcre, ...)
Documentation
A full documentation for REs can be found in:
- man pages regex(7), awk(1), lex(1)
- Regular Expressions in the The Single UNIX Specification, Version 2
- Jeffrey E. F. Friedl, Mastering Regular Expressions, O'Reilly
- Compilers: Principle, Techniques and Tools, by Aho, Sethi and Ullman, Addison Wesley
Definition of a (extended) RE
- A RE is one or more non-empty branches, separated by '|'. It matches anything that matches one of the branches.
- A branch is the concatenation of one or more pieces.
- A piece is an atom, possibly followed by a single '*', '+', '?', or a bound.
- A bound is a '{', followed by an integer, possibly followed by ',' possibly followed another number, followed by '}'.
- An atom is any other character, the special characters '^' or '$', or any RE enclosed in brackets '()'.
Atoms
Atoms are the basic components of a RE
x | the character 'x' itself. |
\X | if X is an 'a', 'b', 'f', 'n', 'r', 't', or 'v', then the ANSI-C interpretation of \x.
Otherwise, a literal 'X' (used to escape operators such as '*'). |
\123 | the character with octal value 123. |
\xe5 | the character with hexadecimal value e5. |
. | any character (byte) except newline. |
[xyz] | a character class : x OR y OR z. |
[ako‑sP] | a character class with a range in it; matches an 'a', a 'k', any letter from 'k' through 's', or a 'P'. |
[^A‑Z] | a negated character class : i.e., any character but those in the class. In our example, any character EXCEPT an uppercase letter. |
[:class:] | a character class expression : allowed only within another character class. The valid contents of class are:
alnum, alpha, blank, cntrl, digit,
graph, lower, print, punct, space,
upper, xdigit. |
Pieces
Pieces are used to concatenate one or more REs, or to specify how often a precedent piece must be repeated.
(r) | the RE r itself. (This is actually an atom.) |
rs | the RE r followed by the RE s. |
r|s | the RE r OR the RE s. |
r* | the RE r zero or more times. |
r+ | the RE r one or more time. |
r? | the RE r zero or one time. |
r{2,6} | the RE r anywhere from two to six times. |
r{2,} | the RE r two or more times. |
r{,6} | the RE r up to six times. |
r{4} | the RE r exactly for times. |
Examples
The RE (x|y|z) is equivalent to [xyz], both match the single character 'x' or 'y' or 'z'.
The RE (a|b) is equivalent to (b|a).
The RE Twe{2}dled(ee|um) matches both strings Tweedledum
and Tweedledee
.
Regular Examples
The real numbers such as 0.75
or 1.23e6
can be described by the RE:
[0-9]+\.[0-9]*([eE][+-]?[0-9]+)?
The above RE can be rewritten using a character class:
[[:digit:]]+\.[[:digit:]]*([eE][+-]?[[:digit:]]+)?
There is a small problem with the previous RE: numbers like 3.
are accepted, but not .3
. We want to accept any number with at least one digit either before or after the decimal point. The following RE solves the problem by stating both options separately: the first part matches all numbers with a leading digit, the second part matches all numbers with a leading decimal point.
(([[:digit:]]+\.[[:digit:]]*)|(\.[[:digit:]]+))([eE][+-]?[[:digit:]]+)?
Basic REs
Basic REs are used by older but still widely used programs such as sed and lex. The difference with extended REs are:
- '|', '+', and '?' are ordinary characters and there is no equivalent for their functionality
- The delimiters for bounds are '\{' and '\}', with '{' and '}' by themselves ordinary characters
- The parentheses for nested sub-expressions are '\(' and '\)', with '(' and ')' by themselves ordinary characters
- '^' is an ordinary character except at the beginning of the RE or(!) the beginning of a parenthesised sub-expression
- '$' is an ordinary character except at the end of the RE or(!) the end of a parenthesised sub-expression
- '*' is an ordinary character if it appears at the beginning of the RE or the beginning of a parenthesised sub-expression (after a possible leading '^')
- There is one new type of atom, a back reference: '\' followed by a nonzero decimal digit x matches the same sequence of characters matched by the parenthesised sub-expression number x (counting all sub-expressions by the positions of their opening parentheses, left to right), so that for example the RE '\([bc]\)\1' matches 'bb' and 'cc' but not 'bc'
The simple RE matching real numbers, rewritten as a basic RE is:
[0-9][0-9]*\.[0-9]*\([eE][+-]\{0,1\}[0-9][0-9]*\)\{0,1\}