sed & awk Workshop
This is a somewhat crude transcript of a sed & awk workshop for the Linux User Group Bolzano-Bozen-Bulsan from the 25 January 2003.
Regular Expressions
- introduced 1956 by S.C. Kleene, to describe the states of a FSA (model of nervous activity)
- REs describe the Form of character strings
- A string is matched by a RE if the string is a element of the class described by the RE
- REs are greedy
- Forms of REs:
- basic REs (ed, sed, lex, ...)
- extended REs (egrep, awk, regex(3), ...)
- perl compatible REs (perl, libpcre, ...)
- Definition of a (extended) RE
- A RE is one or more non-empty branches, separated by '|'. It matches anything that matches one of the branches
- A branch is the concatenation of one or more pieces
- A piece is an atom, possibly followed by a single(!) '*', '+', '?', or a bound
- Documentation
Atoms
Atoms are the basic components of a RE
- x
- the character 'x' itself
- \X
- if X is an 'a', 'b', 'f', 'n', 'r', 't', or 'v', then the ANSI-C interpretation of \x. Otherwise, a literal 'X' (used to escape operators such as '*')
- \123
- the character with octal value 123
- \xe5
- the character with hexadecimal value e5
- .
- any character (byte) except newline
- [xyz]
- a
character class
: x OR y OR z - [ako-sP]
- a
character class
with a range in it; matches an 'a', a 'k', any letter from 'k' through 's', or a 'P' - [^A-Z]
- a
negated character class
: i.e., any character but those in the class. In our example, any character except an uppercase letter - [:str:]
- a
character class expression
: Allowed only within another character class. The valid contents of str are: alnum, alpha, blank, cntrl, digit, graph, lower, print, punct, space, upper, xdigit
Pieces
Pieces are used to concatenate one or more REs, or to specify how often a precedent piece must be repeated
- (r)
- the RE r itself
- rs
- the RE r followed by the RE s
- r|s
- the RE r OR the RE s
- r*
- the RE r zero or more times
- r+
- the RE r one or more time
- r?
- the RE r zero or one time
- r{2,6}
- the RE r anywhere from two to six times
- r{2,}
- the RE r two or more times
- r{,6}
- the RE r up to six times
- r{4}
- the RE r exactly for times
Regular Examples
The RE (x|y|z)
is equivalent to the RE [xyz]
.
And the RE (a|b)
is equivalent to (b|a)
.
The RE (B|F)al{2}
matches both the strings Ball
and Fall
.
Regular Examples to match real numbers
Real numbers (simple)
[0-9]+\.[0-9]*([eE][+-]?[0-9]+)?
Real numbers (character class)
[[:digit:]]+\.[[:digit:]]*([eE][+-]?[[:digit:]]+)?
Problem: numbers like 3.
are accepted, but not .3
.
Real numbers (catch all)
(([[:digit:]]+\.[[:digit:]]*)|(\.[[:digit:]]+))([eE][+-]?[[:digit:]]+)?
Basic REs
- '|', '+', and '?' are ordinary characters and there is no equivalent for their functionality
- The delimiters for bounds are '\{' and '\}', with '{' and '}' by themselves ordinary characters
- The parentheses for nested sub-expressions are '\(' and '\)', with '(' and ')' by themselves ordinary characters
- '^' is an ordinary character except at the beginning of the RE or(!) the beginning of a parenthesized sub-expression
- '$' is an ordinary character except at the end of the RE or(!) the end of a parenthesized sub-expression
- '*' is an ordinary character if it appears at the beginning of the RE or the beginning of a parenthesized sub-expression (after a possible leading '^')
- There is one new type of atom, a back reference: '\' followed by a nonzero decimal digit d matches the same sequence of characters matched by the d-th parenthesized sub-expression (numbering sub-expressions by the positions of their opening parentheses, left to right), so that (e.g.) '\([bc]\)\1' matches 'bb' or 'cc' but not 'bc'
Real numbers as Extended RE
[0-9]+\.[0-9]*([eE][+-]?[0-9]+)?
Real numbers as Basic RE
[0-9][0-9]*\.[0-9]*\([eE][+-]\{0,1\}[0-9][0-9]*\)\{0,1\}
Real numbers as Basic RE, written in the shell
\[0-9\]\[0-9\]\*\\.\[0-9\]\*\\\(\[eE\]\[+-\]\\\{0,1\\\}\[0-9\]\[0-9\]\*\\\)\\\{0,1\\\}
sed
- was written 1973 or 1974 by Lee E. McMahon
- is an acronym for stream editor
- uses Basic Regular Expressions
- works as follow:
- read a entire line from stdin in its pattern buffer
- modify the pattern buffer according to the supplied commands
- print the pattern buffer to stdout
- if not EOF then goto 1
- has roughly 20 commands which makes of him a real RISE (Reduced Instruction Set Editor)
sed Synopsis
bash$ sed [options] program [inputfile]
This simple program consists of the command 'd'. It tells sed to delete the pattern buffer.
bash$ sed -e 'd' /etc/hosts
Another command is 'p'. It tells sed to print the pattern buffer. (Every line is printed twice)
bash$ sed -e 'p' /etc/hosts
We don't always want to work on the whole document --> There must be a mechanism to address a line or several lines
Addresses
- n
- selects the line n
- $
- selects the last line
- /re/
- selects the lines matching the RE re
- \crec
- selects the lines matching the RE re. The c may be any character
- first~step
- GNU extension! Selects every step'th line starting with line first
- addr1,addr2
- Address range: selects all input lines which match the inclusive range of lines starting from the first address and continuing to the second address
- addr!
- select those lines, where the addr does not match
Examples
The command = prints the current line number. A substitute program for wc -l
might be:
bash$ sed -n -e '$='
This one emulates head
:
bash$ sed -n -e '1,10p'
bash$ sed -e '10q'
sed commands
Eliminate comments
bash$ sed -e 's/#.*//' /etc/inetd
The substitute command:
s/re/repl/flags
flags is zero or more of the characters
- g: substitute all matches of re
- n: substitute the nth match
- p: print the pattern buffer after a successful substitution
- w file: If the substitution was made, then write out the result to the named file
- I: GNU extension! match case-insensitive
- s/// is not recursive
s/abc/abc/g
This is not a endless loop!
s/otto/o/g
The String ottotto
will be changed to otto
, not to o
.
Eliminate comments
bash$ sed -e 's/#.*//' /etc/inetd
Eliminate comments and empty lines
bash$ sed -e 's/#.*//;/^$/d' /etc/inetd
Have a 133t
prompt
bash$ ls -l | sed -e 's/o/0/;s/l/1/;s/e/3/'
bash$ ls -l | sed -e 's/o/0/g;s/l/1/g;s/e/3/g'
bash$ ls -l | sed -e 'y/ole/013/g'
Convert a file from DOS to UNIX and back
# Under UNIX: convert DOS newlines (CR/LF) to Unix format
bash$ sed 's/.$//' file # assumes that all lines end with CR/LF
bash$ sed 's/^M$// file # in bash/tcsh, press Ctrl-V then Ctrl-M
# Under DOS: convert Unix newlines (LF) to DOS format
C:\> sed 's/$//' file # method 1
C:\> sed -n p file # method 2
Or use the utilities dos2unix and unix2dos, or the command
tr -d [^M] < inputfile > outputfile
for a conversion from DOS to UNIX, or
:set fileformat=dos
:set fileformat=unix
from within vim, or...
The character #
is a command (which cannot have any address)
This is useful if the sed program is stored in a file. The whole program can be executed with
bash$ sed -f programfile < inputdata
The {
and }
commands group different commands. }
is a command --> it must be preceded by a semicolon.
bash$ sed -ne '/gimme this line number/{=;q;}'
The command n
reads a new line from stdin
/skip this line/{d;n;}
# do some nasty stuff
...
REs are greedy
eliminating HTML tags from a file
bash$ sed -e 's/<.*>//g' text.html
If the file contains a line like:
This <b> is </b> a <i>example</i>.,
then the result will be:
This.
Solution:
bash$ sed -e 's/<[^>]*>//g' text.html
References
The elleff
-Language:
Every vowel c in a word is substituted with clcfc.
--> The ampersand (&) holds the matched string:
bash$ sed -e 's/[aeiou]\+/&l&f&/g'
Referencing a sub-string
Sub-strings enclosed with \(
and \)
can be referenced with \n
(n is a digit from 1 to 9)
bash$ sed -e 's/\([^ ]\+\) *\([^ ]\+\) *\([^ ]\+\)/\3 \2 \1/'
- swaps the first three words in a line
- does nothing if the line contains less than 3 words.
The elleff
back-transform
The RE [aeiou]l[aeiou]f[aeiou]
matches strings which are not ellef
vowels.
Basic REs can use the back-reference in the RE itself!
bash$ sed -e 's/\([aeiou]\+\)l\1f\1/\1/g'
Space Balls
- The patterns are manipulated in the pattern space
- The hold space can store multiple lines, separated by newline.
- There are commands to fill/empty the hold space
- There aren't any commands to work directly on the hold space
- D
- Delete text in the pattern space up to the first newline
- N
- Add a newline to the pattern space, then append the next line of input to the pattern space
- P
- Print out the portion of the pattern space up to the first newline
- h
- Replace the contents of the hold space with the contents of the pattern space
- H
- Append a newline to the contents of the hold space, and then append the contents of the pattern space to that of the hold space
- g
- Replace the contents of the pattern space with the contents of the hold space
- G
- Append a newline to the contents of the pattern space, and then append the contents of the hold space to that of the pattern space
- x
- Exchange the contents of the hold and pattern spaces
Space Balls: Example
Print the first line as last
bash$ sed -n -e '1h;1!p;${g;p;}'
- h
- hold space <- pattern space
- g
- pattern space <- hold space
Emulation of tac
bash$ sed -n -e 'G;h;$p'
- G
- pattern space <<- '\n' hold space
Problem:
The output shows a exceeding newline at the end: it is because G
adds a newline followed by the content of the hold buffer to the pattern buffer, even in the first line (which is printed at the end).
tac improved
bash$ sed -n -e 'G;h;$s/.$//p'
bash$ sed -n -e '1!G;h;$p'
A simple counter in sed
/^[[:digit:]][[:digit:]]*$/!n; # the line must contain only digits
x;s/.*//;x; # clear the hold space
: add
/9$/{s/9$//;x;s/.*/0&/;x;b add;}; # eliminate the last 9 from the p.s.
# and add a 0 in front of the h.s.
s/8$/9/
s/7$/8/
s/6$/7/
s/5$/6/
s/4$/5/
s/3$/4/
s/2$/3/
s/1$/2/
s/0$/1/
s/^$/1/
G;s/\n//g; # add the content of the h.s to the p.s
Branches
- : label
- Definition of label (up to 8 characters)
- b label
- unconditionally branch to label
- t label
- branch to label only if there has been a successful 's'ubstitution since the last input line was read or 't' branch was taken
If label is omitted in the b or t command, then the next cycle is started.
- one-line comments (K++): kk...
- multi-line-comments (K): ko...ok
#!/bin/sed -f
# delete K++ comments
/^[[:blank:]]*kk.*/d
s/kk.*//
# If no comment is found, then start a new cicle
: test
/ko/!b
# Append new lines to the pattern space until a entire K-comment is in the
# pattern space
: append
/ok/!{N;b append;}
# delete every K-comment (but don't be greedy!)
s/ko\([^o]\|o[^k]\)*o\?ok//g
t test
awk
- was written by Aho, Weinberger, and Kernighan
- was first described in Software Practice and Experience in July, 1978
- uses Extended Regular Expressions
- has a rich grammar (with if, while, for etc.)
- works as follow
- execute the BEGIN block
- read a entire line from stdin into $0
- elaborate it according to the code in the program body
- if not EOF (or exit) then goto 2
- execute the END block
- isn't at all awkward
Program Structure
organisation of an awk program
pattern { action }
A pattern can be:
- empty
- BEGIN
- END
- expression
- expression , expression
A simple program
BEGIN { print "START" }
{ print }
END { print "STOP" }
A simple program with quit command
BEGIN { print "START" }
/quit/{ exit }
{ print }
END { print "STOP" }
Use of variables
bash$ awk '{ a++ } END{ print a, "lines." }'
- are created automatically when they are referenced the first time
- they are initialised to 0 or to the empty string
- can be of the type integer and/or string (or arrays)
- the type is determined by its content
Example: Slicing the input
bash$ ls -lg | awk '{ print $3, ":", $7 }'
Who tells awk which character to take as field separator?
And why are there spaces between the fields in the output string?
The field separator can be specified with the FS variable.
BEGIN { FS=":"; OFS=""; }
{ print $1, "'s name is: ", $5 }
called as
awk -f programfile /etc/passwd
Some built-in variables
- FS field separator
- OFS output field separator
- NF number of fields
- RS row separator (default:
\n
) - ORS output row separator (default:
\n
) - ARGC number of command line arguments
- ARGV vector of command line arguments
- etc.
Example: Emulation of wc -w
BEGIN{ w=0 }
{ w+= NF }
END{ print w }
or
END{ print w }; { w+=NF }; BEGIN{ w=0 }
would work too.
Example: String manipulation
bash$ awk '{ sub(/[^ ]* */,""); print $0 }'
- The content of the variables NF and $n is reassigned.
- It is no more possible to reference sub-strings of REs (like \1 in sed)
- gsub() for global substitution
A simple calculator
BEGIN{ print "type a number" }
{ print $1 "square =" $1*$1 }
Example: Rotating the input column
{ j=1+j%3; print $j }
Control structures
Conditions
if (expr) statement
if (expr) statement else statement
Loops
while (expr) statement
do statement while (expr)
for (opt_expr ; opt_expr ; opt_expr) statement
for (var in array) statement
and also
continue
break
Arrays
One-dimensional Arrays
- all arrays have string indexes (A[2] is equivalent to A["2"])
- a element can be deleted with delete A[expr]
printing all elements of a array:
for (i in A)
print A[i]
Multidimensional Arrays
- are mapped to one-dimensional arrays by concatenating the indices, separated by SUBSEP
- A[i,j] is equivalent to A[i SUBSEP j]
for ( (i,j) in A ) print A[i,j]
Functions
A function is defined as
function name( args ) { statements }
and can return a value
return expression
All variables are global..
function set_n(i)
{ n=i; }
BEGIN{ n=6; set_n(1); print n }
.. but arguments are local
function set_n(i, n)
{ n=i; }
BEGIN{ n=6; set_n(1); print n }
I/O
- print
- writes $0 ORS to standard output.
- print expr1, expr2, ..., exprn
- writes expr1 OFS expr2 OFS ... exprn ORS to standard output.
- printf format, expr-list
- duplicates the printf C library function writing to standard output.
- print > file
- writes $0 ORS to file
- getline
- reads into $0, updates the fields, NF, NR and FNR
- getline < file
- reads into $0 from file, updates the fields and N
- getline var
- reads the next record into var, updates NR and FNR
- getline var < file
- reads the next record of file into var
- command | getline
- pipes a record from command into $0 and updates the fields and NF
- command | getline var
- pipes a record from command into var
Dividing even/odd pages of a Text (RFC)
Assuming the pages pre-formatted and separated by ^L (0x0c)
BEGIN{ job = 1 }
{ print > "txt.out." job }
/^\x0c$/ { job = job % 2 + 1 }
Passing variables to awk
Let us suppose we have a shell variable $SearchString, and we want to pass it to an awk program (which emulates grep)
First Try
awk '/$SearchString/{ print }' textfile.txt
This doesn't work, because the shell inhibits variable expansion between single quotes (').
awk /$SearchString'/{ print }' textfile.txt
What happens if $SearchString contains a space?
Second Try
awk /"$SearchString"'/{ print }'
Another solution
awk -v ss="$SearchString" '$0 ~ ss { print }'
The -v option is available with POSIX compliant awk implementations. mawk and gawk support it, but oawk does not. Some of the nawk implementations support it, some do not.
Statistic of password generators
BEGIN { bytes = 0 }
{
n=length($0)
for(i=1; i<=n; i++)
A[substr($0,i,1)]++
bytes+=n
}
END {
n = 0
med = 0
print bytes, "bytes"
for(i in A)
{
med+=A[i]
n++
}
print n, "chars"
med/=n
print "average frequency =", med
var = 0;
for(i in A)
var+=(A[i]-med)^2/n
print "variance =", var
print "std. dev =", sqrt(var)
}
base64-encoded random values
bash$ (base64-encode < /dev/urandom | tr -d +/\\n | \
head -c "${1:-8000}" 2> /dev/null ; echo) | awk -f stat
8000 bytes
62 chars
average frequency = 129.032
variance = 108.354
passwords generated with pwgen
bash$ pwgen -c -n 8 1000|awk -f stat
8000 bytes
56 chars
average frequency = 142.857
variance = 38521.9
ed
- was written by Ken Thompson (1972 or earlier)
- uses Basic Regular Expressions (with some extensions)
- is a fully featured editor
- has two distinct modes: command and input
- reads the user commands from stdin
- has one(!) error message: '?'
The form of a ed command
[address [,address]]command[parameters]
The addresses are like those of sed with many extensions:
- '.' for the present line
- '-' and '+' for the previous and next line
- '%' for the whole document
- '/re/' for the next line matching re
- '?re?' for the previous line matching re
- etc...
Some of the commands are
- 'a' append text to the selected line
- 'd' delete the selected lines from buffer
- 'g/re/command-list' apply command-list to each of the addressed lines matching re
- 'p' print the selected lines
- 's/re/repl/fl' substitute re with repl
- 'w' file write buffer to file
- 'q' quit
Examples
Invocation of ed
bash$ ed textfile.txt < commandfile
bash$ ed textfile.txt <<EOF
a
This is now the last line.
.
wq
EOF
Inserts a line at the end of a file (the initial position is the last line).
A single period exits from insert mode
The notation <<string\n...\nstring
: here-document of the shell
Print all lines matching a RE
bash$ ed textfile.txt <<EOF
g/re/p
q
EOF
Useful readings
- Notes on the history of sed, awk, ed etc.
- Documentation of sed
- Documentation of awk
- Other resources