Regular Expressions Tutorial
Regular Expressions
Regular Expressions were introduced 1956 by Stephen Cole Kleene to model nervous activity, but proved to be very effective to describe all sorts of character strings.
A RE describes the Form of character strings. If a character string can be described by a RE, then the RE is said to match the given string. REs are greedy, which means they try to match as many characters as possible in a given input.
The three most common forms (flavours) of REs are:
- basic REs (used in ed, sed, lex, ...)
- extended REs (used in egrep, awk, regex(3), ...)
- Perl compatible REs (used in perl, libpcre, ...)
Documentation
A full documentation for REs can be found in:
- man pages regex(7), awk(1), lex(1)
- Regular Expressions in the The Single UNIX Specification, Version 2
- Jeffrey E. F. Friedl, Mastering Regular Expressions, O'Reilly
- Compilers: Principle, Techniques and Tools, by Aho, Sethi and Ullman, Addison Wesley
Definition of a (extended) RE
- A RE is one or more non-empty branches, separated by '|'. It matches anything that matches one of the branches.
- A branch is the concatenation of one or more pieces.
- A piece is an atom, possibly followed by a single '*', '+', '?', or a bound.
- A bound is a '{', followed by an integer, possibly followed by ',' possibly followed another number, followed by '}'.
- An atom is any other character, the special characters '^' or '$', or any RE enclosed in brackets '()'.
Atoms
Atoms are the basic components of a RE
x | the character 'x' itself. |
\X | if X is an 'a', 'b', 'f', 'n', 'r', 't', or 'v', then the ANSI-C interpretation of \x. Otherwise, a literal 'X' (used to escape operators such as '*'). |
\123 | the character with octal value 123. |
\xe5 | the character with hexadecimal value e5. |
. | any character (byte) except newline. |
[xyz] | a character class: x OR y OR z. |
[ako‑sP] | a character classwith a range in it; matches an 'a', a 'k', any letter from 'k' through 's', or a 'P'. |
[^A‑Z] | a negated character class: i.e., any character but those in the class. In our example, any character EXCEPT an uppercase letter. |
[:class:] | a character class expression: allowed only within another character class. The valid contents of class are: alnum, alpha, blank, cntrl, digit, graph, lower, print, punct, space, upper, xdigit. |
Pieces
Pieces are used to concatenate one or more REs, or to specify how often a precedent piece must be repeated.
(r) | the REr itself. (This is actually an atom.) |
rs | the REr followed by the REs. |
r|s | the REr OR the REs. |
r* | the REr zero or more times. |
r+ | the REr one or more time. |
r? | the REr zero or one time. |
r{2,6} | the REr anywhere from two to six times. |
r{2,} | the REr two or more times. |
r{,6} | the REr up to six times. |
r{4} | the REr exactly for times. |
Examples
The RE (x|y|z) is equivalent to [xyz], both match the single character 'x' or 'y' or 'z'.
The RE (a|b) is equivalent to (b|a).
The RE Twe{2}dled(ee|um) matches both strings Tweedledum
and
Tweedledee
.
Regular Examples
The real numbers such as 0.75
or 1.23e6
can be described by the RE:
[0-9]+\.[0-9]*([eE][+-]?[0-9]+)?
The above RE can be rewritten using a character class:
[[:digit:]]+\.[[:digit:]]*([eE][+-]?[[:digit:]]+)?
There is a small problem with the previous RE:
numbers like 3.
are accepted, but not .3
.
We want to accept any number with at least one digit either before or after the decimal point.
The following RE solves the problem by stating both options separately:
the first part matches all numbers with a leading digit, the second part matches all numbers with a leading decimal point.
(([[:digit:]]+\.[[:digit:]]*)|(\.[[:digit:]]+))([eE][+-]?[[:digit:]]+)?
Basic REs
Basic REs are used by older but still widely used programs such as sed and lex. The difference with extended REs are:
- '|', '+', and '?' are ordinary characters and there is no equivalent for their functionality
- The delimiters for bounds are '\{' and '\}', with '{' and '}' by themselves ordinary characters
- The parentheses for nested sub-expressions are '\(' and '\)', with '(' and ')' by themselves ordinary characters
- '^' is an ordinary character except at the beginning of the RE or(!) the beginning of a parenthesised sub-expression
- '$' is an ordinary character except at the end of the RE or(!) the end of a parenthesised sub-expression
- '*' is an ordinary character if it appears at the beginning of the RE or the beginning of a parenthesised sub-expression (after a possible leading '^')
- There is one new type of atom, a back reference: '\' followed by a nonzero decimal digit x matches the same sequence of characters matched by the parenthesised sub-expression number x (counting all sub-expressions by the positions of their opening parentheses, left to right), so that for example the RE '\([bc]\)\1' matches 'bb' and 'cc' but not 'bc'
The simple RE matching real numbers, rewritten as a basic RE is:
[0-9][0-9]*\.[0-9]*\([eE][+-]\{0,1\}[0-9][0-9]*\)\{0,1\}