# Regular Expressions Tutorial

## Regular Expressions

Regular Expressions were introduced 1956 by Stephen Cole Kleene to model nervous activity, but proved to be very effective to describe all sorts of character strings.

A RE describes the *Form* of character strings. If a character string can be described by a RE, then the RE is said to *match* the given string. REs are *greedy*, which means they try to *match* as many characters as possible in a given input.

The three most common forms (flavours) of REs are:

- basic REs (used in ed, sed, lex, ...)
- extended REs (used in egrep, awk, regex(3), ...)
- Perl compatible REs (used in perl, libpcre, ...)

### Documentation

A full documentation for REs can be found in:

- man pages regex(7), awk(1), lex(1)
- Regular Expressions in the The Single UNIX Specification, Version 2
- Jeffrey E. F. Friedl, Mastering Regular Expressions, O'Reilly
- Compilers: Principle, Techniques and Tools, by Aho, Sethi and Ullman, Addison Wesley

### Definition of a (extended) RE

- A RE is one or more non-empty
*branches*, separated by '|'. It matches anything that matches one of the branches. - A
*branch*is the concatenation of one or more*pieces*. - A
*piece*is an*atom*, possibly followed by a single '*', '+', '?', or a*bound*. - A
*bound*is a '{', followed by an integer, possibly followed by ',' possibly followed another number, followed by '}'. - An
*atom*is any other character, the special characters '^' or '$', or any RE enclosed in brackets '()'.

## Atoms

Atoms are the basic components of a RE

x | the character 'x' itself. |

\X | if X is an 'a', 'b', 'f', 'n', 'r', 't', or 'v', then the ANSI-C interpretation of \x. Otherwise, a literal 'X' (used to escape operators such as '*'). |

\123 | the character with octal value 123. |

\xe5 | the character with hexadecimal value e5. |

. | any character (byte) except newline. |

[xyz] | a character class: x OR y OR z. |

[ako‑sP] | a character classwith a range in it; matches an 'a', a 'k', any letter from 'k' through 's', or a 'P'. |

[^A‑Z] | a negated character class: i.e., any character but those in the class. In our example, any character EXCEPT an uppercase letter. |

[:class:] | a character class expression: allowed only within another character class. The valid contents of class are:
alnum, alpha, blank, cntrl, digit,
graph, lower, print, punct, space,
upper, xdigit. |

## Pieces

Pieces are used to concatenate one or more REs, or to specify how often a precedent piece must be repeated.

(r) | the RE r itself. (This is actually an atom.) |

rs | the RE r followed by the RE s. |

r|s | the RE r OR the RE s. |

r* | the RE r zero or more times. |

r+ | the RE r one or more time. |

r? | the RE r zero or one time. |

r{2,6} | the RE r anywhere from two to six times. |

r{2,} | the RE r two or more times. |

r{,6} | the RE r up to six times. |

r{4} | the RE r exactly for times. |

## Examples

The RE (x|y|z) is equivalent to [xyz], both match the single character 'x' or 'y' or 'z'.

The RE (a|b) is equivalent to (b|a).

The RE Twe{2}dled(ee|um) matches both strings Tweedledum

and Tweedledee

.

## Regular Examples

The real numbers such as 0.75

or 1.23e6

can be described by the RE:

[0-9]+\.[0-9]*([eE][+-]?[0-9]+)?

The above RE can be rewritten using a character class:

[[:digit:]]+\.[[:digit:]]*([eE][+-]?[[:digit:]]+)?

There is a small problem with the previous RE: numbers like 3.

are accepted, but not .3

. We want to accept any number with at least one digit either before or after the decimal point. The following RE solves the problem by stating both options separately: the first part matches all numbers with a leading digit, the second part matches all numbers with a leading decimal point.

(([[:digit:]]+\.[[:digit:]]*)|(\.[[:digit:]]+))([eE][+-]?[[:digit:]]+)?

## Basic REs

Basic REs are used by older but still widely used programs such as sed and lex. The difference with extended REs are:

- '|', '+', and '?' are ordinary characters and there is no equivalent for their functionality
- The delimiters for bounds are '\{' and '\}', with '{' and '}' by themselves ordinary characters
- The parentheses for nested sub-expressions are '\(' and '\)', with '(' and ')' by themselves ordinary characters
- '^' is an ordinary character except at the beginning of the RE or(!) the beginning of a parenthesised sub-expression
- '$' is an ordinary character except at the end of the RE or(!) the end of a parenthesised sub-expression
- '*' is an ordinary character if it appears at the beginning of the RE or the beginning of a parenthesised sub-expression (after a possible leading '^')
- There is one new type of atom, a back reference: '\' followed by a nonzero decimal digit
*x*matches the same sequence of characters matched by the parenthesised sub-expression number*x*(counting all sub-expressions by the positions of their opening parentheses, left to right), so that for example the RE '\([bc]\)\1' matches 'bb' and 'cc' but not 'bc'

The simple RE matching real numbers, rewritten as a basic RE is:

[0-9][0-9]*\.[0-9]*\([eE][+-]\{0,1\}[0-9][0-9]*\)\{0,1\}