sed & awk Workshop

This is a somewhat crude transcript of a sed & awk workshop for the Linux User Group Bolzano-Bozen-Bulsan from the 25 January 2003.

Regular Expressions

introduced 1956 by S.C. Kleene, to describe the states of a FSA (model of nervous activity)
REs describe the Form of character strings
A string is matched by a RE if the string is a element of the class described by the RE
REs are greedy
Forms of REs:
- basic REs (ed, sed, lex, ...)
- extended REs (egrep, awk, regex(3), ...)
- perl compatible REs (perl, libpcre, ...)
Definition of a (extended) RE
- A RE is one or more non-empty branches, separated by '|'. It matches anything that matches one of the branches
- A branch is the concatenation of one or more pieces
- A piece is an atom, possibly followed by a single(!) '*', '+', '?', or a bound
Documentation
- man pages regex(7), awk(1), lex(1)
- Regular Expressions in The Single UNIX ® Specification, Version 2
- Compilers -- Principle, Techniques and Tools, by Aho Sethi and Ullman, Addison Wesley

Atoms

Atoms are the basic components of a RE

x: the character 'x' itself
\X: if X is an 'a', 'b', 'f', 'n', 'r', 't', or 'v', then the ANSI-C interpretation of \x. Otherwise, a literal 'X' (used to escape operators such as '*')
\123: the character with octal value 123
\xe5: the character with hexadecimal value e5
.: any character (byte) except newline
[xyz]: a character class: x OR y OR z
[ako-sP]: a character class with a range in it; matches an 'a', a 'k', any letter from 'k' through 's', or a 'P'
[^A-Z]: a negated character class: i.e., any character but those in the class. In our example, any character except an uppercase letter
[:str:]: a character class expression: Allowed only within another character class. The valid contents of str are: alnum, alpha, blank, cntrl, digit, graph, lower, print, punct, space, upper, xdigit

Pieces

Pieces are used to concatenate one or more REs, or to specify how often a precedent piece must be repeated

(r): the RE r itself
rs: the RE r followed by the RE s
r|s: the RE r OR the RE s
r*: the RE r zero or more times
r+: the RE r one or more time
r?: the RE r zero or one time
r{2,6}: the RE r anywhere from two to six times
r{2,}: the RE r two or more times
r{,6}: the RE r up to six times
r{4}: the RE r exactly for times

Regular Examples

The RE (x|y|z) is equivalent to the RE [xyz].

And the RE (a|b) is equivalent to (b|a).

The RE (B|F)al{2} matches both the strings Ball and Fall.

Regular Examples to match real numbers

Real numbers (simple)

[0-9]+\.[0-9]*([eE][+-]?[0-9]+)?

Real numbers (character class)

[[:digit:]]+\.[[:digit:]]*([eE][+-]?[[:digit:]]+)?

Problem: numbers like 3. are accepted, but not .3.

Real numbers (catch all)

(([[:digit:]]+\.[[:digit:]]*)|(\.[[:digit:]]+))([eE][+-]?[[:digit:]]+)?

Basic REs

'|', '+', and '?' are ordinary characters and there is no equivalent for their functionality
The delimiters for bounds are '\{' and '\}', with '{' and '}' by themselves ordinary characters
The parentheses for nested sub-expressions are '$' and '$', with '(' and ')' by themselves ordinary characters
'^' is an ordinary character except at the beginning of the RE or(!) the beginning of a parenthesized sub-expression
'$' is an ordinary character except at the end of the RE or(!) the end of a parenthesized sub-expression
'*' is an ordinary character if it appears at the beginning of the RE or the beginning of a parenthesized sub-expression (after a possible leading '^')
There is one new type of atom, a back reference: '\' followed by a nonzero decimal digit d matches the same sequence of characters matched by the d-th parenthesized sub-expression (numbering sub-expressions by the positions of their opening parentheses, left to right), so that (e.g.) '$[bc]$\1' matches 'bb' or 'cc' but not 'bc'

Real numbers as Extended RE

[0-9]+\.[0-9]*([eE][+-]?[0-9]+)?

Real numbers as Basic RE

[0-9][0-9]*\.[0-9]*\([eE][+-]\{0,1\}[0-9][0-9]*\)\{0,1\}

Real numbers as Basic RE, written in the shell

\[0-9\]\[0-9\]\*\\.\[0-9\]\*\\\(\[eE\]\[+-\]\\\{0,1\\\}\[0-9\]\[0-9\]\*\\\)\\\{0,1\\\}

sed

was written 1973 or 1974 by Lee E. McMahon
is an acronym for stream editor
uses Basic Regular Expressions
works as follow:
1. read a entire line from stdin in its pattern buffer
2. modify the pattern buffer according to the supplied commands
3. print the pattern buffer to stdout
4. if not EOF then goto 1
has roughly 20 commands which makes of him a real RISE (Reduced Instruction Set Editor)

sed Synopsis

bash$ sed [options] program [inputfile]

This simple program consists of the command 'd'. It tells sed to delete the pattern buffer.

bash$ sed -e 'd' /etc/hosts

Another command is 'p'. It tells sed to print the pattern buffer. (Every line is printed twice)

bash$ sed -e 'p' /etc/hosts

We don't always want to work on the whole document --> There must be a mechanism to address a line or several lines

Addresses

n: selects the line n
$: selects the last line
/re/: selects the lines matching the RE re
\crec: selects the lines matching the RE re. The c may be any character
first~step: GNU extension! Selects every step'th line starting with line first
addr1,addr2: Address range: selects all input lines which match the inclusive range of lines starting from the first address and continuing to the second address
addr!: select those lines, where the addr does not match

Examples

The command = prints the current line number. A substitute program for wc -l might be:

bash$ sed -n -e '$='

This one emulates head:

bash$ sed -n -e '1,10p'
bash$ sed -e '10q'

sed commands

Eliminate comments

bash$ sed -e 's/#.*//' /etc/inetd

The substitute command:

s/re/repl/flags

flags is zero or more of the characters

g: substitute all matches of re
n: substitute the nth match
p: print the pattern buffer after a successful substitution
w file: If the substitution was made, then write out the result to the named file
I: GNU extension! match case-insensitive
s/// is not recursive

s/abc/abc/g

This is not a endless loop!

s/otto/o/g

The String ottotto will be changed to otto, not to o.

Eliminate comments

bash$ sed -e 's/#.*//' /etc/inetd

Eliminate comments and empty lines

bash$ sed -e 's/#.*//;/^$/d' /etc/inetd

Have a 133t prompt

bash$ ls -l | sed -e 's/o/0/;s/l/1/;s/e/3/'
bash$ ls -l | sed -e 's/o/0/g;s/l/1/g;s/e/3/g'
bash$ ls -l | sed -e 'y/ole/013/g'

Convert a file from DOS to UNIX and back

# Under UNIX: convert DOS newlines (CR/LF) to Unix format

bash$ sed 's/.$//' file    # assumes that all lines end with CR/LF
bash$ sed 's/^M$// file    # in bash/tcsh, press Ctrl-V then Ctrl-M
# Under DOS: convert Unix newlines (LF) to DOS format
C:\> sed 's/$//' file    # method 1
C:\> sed -n p file       # method 2

Or use the utilities dos2unix and unix2dos, or the command

tr -d [^M] < inputfile > outputfile

for a conversion from DOS to UNIX, or

:set fileformat=dos
:set fileformat=unix

from within vim, or...

The character # is a command (which cannot have any address)

This is useful if the sed program is stored in a file. The whole program can be executed with

bash$ sed -f programfile < inputdata

The { and } commands group different commands. } is a command --> it must be preceded by a semicolon.

bash$ sed -ne '/gimme this line number/{=;q;}'

The command n reads a new line from stdin

/skip this line/{d;n;}
 # do some nasty stuff
 ...

REs are greedy

eliminating HTML tags from a file

bash$ sed -e 's/<.*>//g' text.html

If the file contains a line like:

This <b> is </b> a <i>example</i>.,

then the result will be:

This.

Solution:

bash$ sed -e 's/<[^>]*>//g' text.html

References

The elleff-Language:

Every vowel c in a word is substituted with clcfc.

--> The ampersand (&) holds the matched string:

bash$ sed -e 's/[aeiou]\+/&l&f&/g'

Referencing a sub-string

Sub-strings enclosed with $ and $ can be referenced with \n (n is a digit from 1 to 9)

bash$ sed -e 's/\([^ ]\+\)  *\([^ ]\+\)  *\([^ ]\+\)/\3 \2 \1/'

swaps the first three words in a line
does nothing if the line contains less than 3 words.

The elleff back-transform

The RE [aeiou]l[aeiou]f[aeiou] matches strings which are not ellef vowels.

Basic REs can use the back-reference in the RE itself!

bash$ sed -e 's/\([aeiou]\+\)l\1f\1/\1/g'

Space Balls

The patterns are manipulated in the pattern space
The hold space can store multiple lines, separated by newline.
There are commands to fill/empty the hold space
There aren't any commands to work directly on the hold space

D: Delete text in the pattern space up to the first newline
N: Add a newline to the pattern space, then append the next line of input to the pattern space
P: Print out the portion of the pattern space up to the first newline
h: Replace the contents of the hold space with the contents of the pattern space
H: Append a newline to the contents of the hold space, and then append the contents of the pattern space to that of the hold space
g: Replace the contents of the pattern space with the contents of the hold space
G: Append a newline to the contents of the pattern space, and then append the contents of the hold space to that of the pattern space
x: Exchange the contents of the hold and pattern spaces

Space Balls: Example

Print the first line as last

bash$ sed -n -e '1h;1!p;${g;p;}'

h: hold space <- pattern space
g: pattern space <- hold space

Emulation of tac

bash$ sed -n -e 'G;h;$p'

G: pattern space <<- '\n' hold space

Problem:

The output shows a exceeding newline at the end: it is because G adds a newline followed by the content of the hold buffer to the pattern buffer, even in the first line (which is printed at the end).

tac improved

bash$ sed -n -e 'G;h;$s/.$//p'
bash$ sed -n -e '1!G;h;$p'

A simple counter in sed

/^[[:digit:]][[:digit:]]*$/!n;         # the line must contain only digits
x;s/.*//;x;                            # clear the hold space
: add
/9$/{s/9$//;x;s/.*/0&/;x;b add;};      # eliminate the last 9 from the p.s.
                                       # and add a 0 in front of the h.s.
s/8$/9/
s/7$/8/
s/6$/7/
s/5$/6/
s/4$/5/
s/3$/4/
s/2$/3/
s/1$/2/
s/0$/1/
s/^$/1/
G;s/\n//g;            # add the content of the h.s to the p.s

Branches

: label: Definition of label (up to 8 characters)
b label: unconditionally branch to label
t label: branch to label only if there has been a successful 's'ubstitution since the last input line was read or 't' branch was taken

If label is omitted in the b or t command, then the next cycle is started.

Eliminate K/K++ comments

one-line comments (K++): kk...
multi-line-comments (K): ko...ok

#!/bin/sed -f

# delete K++ comments
/^[[:blank:]]*kk.*/d
s/kk.*//

# If no comment is found, then start a new cicle
: test
/ko/!b

# Append new lines to the pattern space until a entire K-comment is in the
# pattern space
: append
/ok/!{N;b append;}

# delete every K-comment (but don't be greedy!)
s/ko\([^o]\|o[^k]\)*o\?ok//g

t test

awk

was written by Aho, Weinberger, and Kernighan
was first described in Software Practice and Experience in July, 1978
uses Extended Regular Expressions
has a rich grammar (with if, while, for etc.)
works as follow
1. execute the BEGIN block
2. read a entire line from stdin into $0
3. elaborate it according to the code in the program body
4. if not EOF (or exit) then goto 2
5. execute the END block
isn't at all awkward

Program Structure

organisation of an awk program

pattern { action }

A pattern can be:

empty
BEGIN
END
expression
expression , expression

A simple program

BEGIN { print "START" }
{ print }
END { print "STOP" }

A simple program with quit command

BEGIN { print "START" }
/quit/{ exit }
{ print }
END { print "STOP" }

Use of variables

bash$ awk '{ a++ } END{ print a, "lines." }'

are created automatically when they are referenced the first time
they are initialised to 0 or to the empty string
can be of the type integer and/or string (or arrays)
the type is determined by its content

Example: Slicing the input

bash$ ls -lg | awk '{ print $3, ":", $7 }'

Who tells awk which character to take as field separator?

And why are there spaces between the fields in the output string?

The field separator can be specified with the FS variable.

BEGIN { FS=":"; OFS=""; }
{ print $1, "'s name is: ", $5 }

called as

awk -f programfile /etc/passwd

Some built-in variables

FS field separator
OFS output field separator
NF number of fields
RS row separator (default: \n)
ORS output row separator (default: \n)
ARGC number of command line arguments
ARGV vector of command line arguments
etc.

Example: Emulation of wc -w

BEGIN{ w=0 }
{ w+= NF }
END{ print w }

END{ print w }; { w+=NF }; BEGIN{ w=0 }

would work too.

Example: String manipulation

bash$ awk '{ sub(/[^ ]* */,""); print $0 }'

The content of the variables NF and $n is reassigned.
It is no more possible to reference sub-strings of REs (like \1 in sed)
gsub() for global substitution

A simple calculator

BEGIN{ print "type a number" }
{ print $1 "square =" $1*$1 }

Example: Rotating the input column

{ j=1+j%3; print $j }

Control structures

Conditions

if (expr) statement
if (expr) statement else statement

Loops

while (expr) statement
do statement while (expr)
for (opt_expr ; opt_expr ; opt_expr) statement
for (var in array) statement

and also

continue
break

Arrays

One-dimensional Arrays

all arrays have string indexes (A[2] is equivalent to A["2"])
a element can be deleted with delete A[expr]

printing all elements of a array:

for (i in A)
    print A[i]

Multidimensional Arrays

are mapped to one-dimensional arrays by concatenating the indices, separated by SUBSEP
A[i,j] is equivalent to A[i SUBSEP j]

for ( (i,j) in A ) print A[i,j]

Functions

A function is defined as

function name( args ) { statements }

and can return a value

return expression

All variables are global..

function set_n(i)
{ n=i; }

BEGIN{ n=6; set_n(1); print n }

.. but arguments are local

function set_n(i,   n)
{ n=i; }

BEGIN{ n=6; set_n(1); print n }

I/O

print: writes $0 ORS to standard output.
print expr1, expr2, ..., exprn: writes expr1 OFS expr2 OFS ... exprn ORS to standard output.
printf format, expr-list: duplicates the printf C library function writing to standard output.
print > file: writes $0 ORS to file
getline: reads into $0, updates the fields, NF, NR and FNR
getline < file: reads into $0 from file, updates the fields and N
getline var: reads the next record into var, updates NR and FNR
getline var < file: reads the next record of file into var
command | getline: pipes a record from command into $0 and updates the fields and NF
command | getline var: pipes a record from command into var

Dividing even/odd pages of a Text (RFC)

Assuming the pages pre-formatted and separated by ^L (0x0c)

BEGIN{ job = 1 }

{ print > "txt.out." job }

/^\x0c$/ { job = job % 2 + 1 }

Passing variables to awk

Let us suppose we have a shell variable $SearchString, and we want to pass it to an awk program (which emulates grep)

First Try

awk '/$SearchString/{ print }' textfile.txt

This doesn't work, because the shell inhibits variable expansion between single quotes (').

awk /$SearchString'/{ print }' textfile.txt

What happens if $SearchString contains a space?

Second Try

awk /"$SearchString"'/{ print }'

Another solution

awk -v ss="$SearchString" '$0 ~ ss { print }'

The -v option is available with POSIX compliant awk implementations. mawk and gawk support it, but oawk does not. Some of the nawk implementations support it, some do not.

Statistic of password generators

BEGIN { bytes = 0 }
{
    n=length($0)
    for(i=1; i<=n; i++)
        A[substr($0,i,1)]++
    bytes+=n
}
END {
    n = 0
    med = 0
    print bytes, "bytes"
    for(i in A)
    {
        med+=A[i]
        n++
    }
    print n, "chars"
    med/=n
    print "average frequency =", med
    var = 0;
    for(i in A)
        var+=(A[i]-med)^2/n
    print "variance =", var
    print "std. dev =", sqrt(var)
}

base64-encoded random values

bash$ (base64-encode < /dev/urandom | tr -d +/\\n | \ 
head -c "${1:-8000}" 2> /dev/null ; echo) | awk -f stat
8000 bytes
62 chars
average frequency = 129.032
variance = 108.354

passwords generated with pwgen

bash$ pwgen -c -n 8 1000|awk -f stat
8000 bytes
56 chars
average frequency = 142.857
variance = 38521.9

ed

was written by Ken Thompson (1972 or earlier)
uses Basic Regular Expressions (with some extensions)
is a fully featured editor
has two distinct modes: command and input
reads the user commands from stdin
has one(!) error message: '?'

The form of a ed command

[address [,address]]command[parameters]

The addresses are like those of sed with many extensions:

'.' for the present line
'-' and '+' for the previous and next line
'%' for the whole document
'/re/' for the next line matching re
'?re?' for the previous line matching re
etc...

Some of the commands are

'a' append text to the selected line
'd' delete the selected lines from buffer
'g/re/command-list' apply command-list to each of the addressed lines matching re
'p' print the selected lines
's/re/repl/fl' substitute re with repl
'w' file write buffer to file
'q' quit

Examples

Invocation of ed

bash$ ed textfile.txt < commandfile

bash$ ed textfile.txt <<EOF
a
This is now the last line.
.
wq
EOF

Inserts a line at the end of a file (the initial position is the last line).

A single period exits from insert mode

The notation <<string\n...\nstring: here-document of the shell

Print all lines matching a RE

bash$ ed textfile.txt <<EOF
g/re/p
q
EOF

Useful readings

Notes on the history of sed, awk, ed etc.
- Unix History
- Early History of Unix is a chapter from the netbook: Netizens: On the History and Impact of Usenet and the Internet
- Interview with Bill Joy about the history of vi, which he wrote to have a visual editor resembling en which he wrote to improve em which he wrote to improve ed.
Documentation of sed
- Single UNIX® Specification, Version 2
- texinfo manual of GNU sed
- Eric Pemente's sedfaq
- Carlos Jorge Duarte's do it with sed
- Good introduction to sed
- A German introduction to sed
Documentation of awk
Other resources
- Newsgroup: de.comp.os.unix.shell
- The asr manpage collection