Chapter 6 Sed, awk, and regular expressions
In the course of doing bioinformatics, you will be dealing with myriad different
text files. As we noted in the previous chapters, Unix, with its file I/O model,
piping capabilities, and numerous utilities, is well-suited to handling
large text files. Two utilities found on every Unix installation—awk
and sed
—merit
special attention in this context. awk
is a lightweight scripting language that lets
you write succinct programs to operate line-by-by line on the contents of text files. It is
particularly useful for handling text files that have columns of data separated by white spaces
or tabs. sed
on the other hand, is particularly useful for automating
“find-and-replace” operations on text files. Each of them is optimized to
handle large files without storing a lot of information in memory, so they can be
useful for quick operations on large bioinformatic data sets. Neither is a fully-featured
programming language that you would want to write large, complex programs in (that said,
I did once implement a complete program for full-sibling inference from multiallelic markers
in awk
); however they do share many of the
useful text-manipulation capabilities of such languages, such as Perl and Python. Additionally,
awk
and sed
are deployed in a consistent fashion across most Unix operating systems, and they
don’t require much time to learn to use effectively for common text-processing tasks.
As a consequence awk
and sed
are a useful addition to the bioinformatician’s toolbox.
Both awk
and sed
rely heavily on regular expressions to describe patterns
in text upon which some operation should be performed. You can think of
regular expressions as providing a succinct language for performing very advanced “find”
and “find-and-replace” operations
in a text file.
In this chapter we will only scratch the surface of what can be done with awk
and sed
. Indeed,
there is an entire 432-page book published decades
ago by O’Reilly about awk
and sed
. Our goal here is to provide an introduction to a few basic
maneuvers with both awk
and sed
and to describe instances where they can be useful, as well as
to give an introduction to many (but certainly not all) of the patterns that can be expressed
using regular expressions. We will start with a basic overview of how awk
works. Then we will
have a short look at regular expressions, then we will use those further in awk
, and finally we
will play with sed
a little bit.
In order to have a set of files to use in the examples, I have made a small GitHub
repository called awk-and-sed-inputs
. You can get that with
All of the examples in this chapter that use such external files
assume that the current working directory is
the repository directory awk-and-sed-inputs
.
6.1 awk
6.1.1 Line-cycling, tests and actions
awk
operates by cycling through a file (or an input stream from stdin) line-by-line. At each
new line awk
tests whether it should do anything with the contents of the line. If the answer
to that test is “yes”, then it executes the code that describes what should be done to the
contents of the line. The scope of ways that awk
can operate on text is quite wide, but,
the most common use of awk
is to print parts of the line, or do small calculations on parts of the
line. The instructions that describe the tests and the actions are included in the awk
“script”, though,
in practice, this script is often given to awk
not as a file, but, rather, as text between single-quotes on the
command line. The basic syntax looks like this:
Where file
is the path of the file you want awk
to read, and the “tests” and “actions”
above are just placeholders showing where those parts of the script are written. The layout
makes it clear that if test1
is TRUE, action1
will be executed, and if test2
is TRUE, then
action2
is executed. Code describing the actions must always appear within a set of
curly braces. We will refer to the text within those curly braces as an action block.
Two things to note: first, different tests and actions do not have to be written on separate lines. That makes
things easier to read, but if you are writing these short scripts on the command line, it is often
easier to put everything on a single line, like 'test1 {action1} test2 {action2}
, and that is fine.
Second, if you don’t supply a file path to awk
it expects (and gladly processes) data from stdin.
This makes it easy to pipe data into awk
. For example, putting the above two points together,
the above awk
script skeleton could
have been written this way:
Before we get too abstract talking about tests and actions, let’s look at a few examples:
# print lines starting with @SQ in the SAM header of a file
awk '/^@SQ/ {print}' data/DPCh_plate1_F12_S72.sam
# print only those @SQ lines with a sequence name tag starting with "NC_"
awk '/^@SQ/ && /SN:NC_/ {print}' data/DPCh_plate1_F12_S72.sam
# same as above, but quit processing the file as soon as you hit
# an @SQ line with a sequence name starting with "NW_"
awk '
/^@SQ/ && /SN:NC_/ {print}
/^@SQ/ && /SN:NW_/ {exit}
' data/DPCh_plate1_F12_S72.sam
# print only lines 101, and 103 from the fastq file
gzcat data/DPCh_plate1_F12_S72.R1.fq.gz | awk 'NR==101 || NR==102 {print}'
The above examples show only two different actions (print
and exit
) and
a variety of tests based on matching substrings in each line, or on which
line number (NR
, which stands for “number of the record”).
Take a moment to make sure you understand which parts are the tests, and which are the actions in the above examples.
6.1.2 Column splitting, fields, -F
, $
, NF
, print
, OFS
and BEGIN
Every time awk
processes a line of text it breaks it into different fields, which you
can think of as columns (as in a spreadsheet). By default, any number of whitespace (space or TAB)
characters constitutes a break between columns. If you want field-splitting to be done using
more specific characters, you can specify that on the command line with the the -F
option.
For example:
awk -F"\t" 'test {action}' file # split lines into fields on single TAB characters
awk -F"," 'test {action}' file # split lines into fields on single commas
awk -F":" 'test {action}' file # split lines into fields on single colons
awk -F";" 'test {action}' file # split lines into fields on single semicolons
# and so forth...
While in all those examples, the field separator is a single character, in full-blown practice, the field separator can be specified as a regular expression (see below).
Within an action block, you can access the different fields that awk
has split out
of the line using a $
followed immediately by a field number, i.e., $1
, $2
, $3
, … . When you get
to fields with indexes greater than 9, you have to wrap the number in parentheses, like
$(10)
, $(11)
,… . The special value $0
denotes the whole line, not just one of its
fields. (Note that if you give the action print
without any arguments, that also just
prints the whole line.) The special variable NF
is the “Number of Fields”, so
$NF
is the “last column.”
Let’s say we have a tab-separated file in data/wgs-chinook-samples.tsv
that looks like this:
vcf_name ID_Berk NMFS_DNA_ID BOX_ID BOX_POSITION Population Concentration (ng/ul) run_type
DPCh_plate1_A01_S1 CH_Plate1_1A T144767 T1512 1A Salmon River Fall 23.4 Fall
DPCh_plate1_A02_S2 CH_Plate1_2A T144804 T1512 5F Salmon River Fall 67.4 Fall
DPCh_plate1_A03_S3 CH_Plate1_3A T145109 T1515 7G Salmon River Spring 3.52 Spring
DPCh_plate1_A04_S4 CH_Plate1_4A T145118 T1515 8H Salmon River Spring 10.3 Spring
DPCh_plate1_A05_S5 CH_Plate1_5A T144863 T1513 1A Feather River Hatchery Fall 220 Fall
Then, if we wanted to print the first and the last columns (fields) we could do
awk -F"\t" '{print $1, $NF}' data/wgs-chinook-samples.tsv
# the first few lines of output look like:
vcf_name run_type
DPCh_plate1_A01_S1 Fall
DPCh_plate1_A02_S2 Fall
DPCh_plate1_A03_S3 Spring
DPCh_plate1_A04_S4 Spring
Notice that if there is no test
before the action block, then the action is
done on every line.
The print
command prints variables to stdout, if you separate those variables with
a comma, then in the output, they will be separated by the output field sepator which,
by default, is a single space. You can set the output field separator using the OFS
variable, in an action block that is forced to run at the beginning of execution using the
special BEGIN
test keyword:
awk -F"\t" '
BEGIN {OFS=";"}
{print $1, $NF}
' data/wgs-chinook-samples.tsv
# makes output like:
vcf_name;run_type
DPCh_plate1_A01_S1;Fall
DPCh_plate1_A02_S2;Fall
DPCh_plate1_A03_S3;Spring
DPCh_plate1_A04_S4;Spring
DPCh_plate1_A05_S5;Fall
The comma between the arguments to print
is what makes awk
print the output field
separator between the items. If you
separate arguments to print
with just a space (and not a comma), there will be nothing
printed between the two arguments on output. This, coupled with the fact that
print
will happily print any strings and numbers (in addition to variables!) you pass to it,
provides a way to do some light formatting of the output text:
awk -F"\t" 'NR > 1 {print "sample_name:" $1, "run_type:" $NF}' data/wgs-chinook-samples.tsv
# gives output like:
sample_name:DPCh_plate1_A01_S1 run_type:Fall
sample_name:DPCh_plate1_A02_S2 run_type:Fall
sample_name:DPCh_plate1_A03_S3 run_type:Spring
sample_name:DPCh_plate1_A04_S4 run_type:Spring
sample_name:DPCh_plate1_A05_S5 run_type:Fall
Note that if you don’t provide an action block after a test, the default action, if the
test is true, is assumed to be “print the whole line as is.” Thus you can use awk
to
print matching lines simply like this:
where /regex/
just means “some regular expression,” as is explained in the
next section.
Now it’s your turn!
6.1.3 A brief introduction to regular expressions
In a few of the above examples you will see tests that look like: /^@SQ/
. This is a regular expression. In
awk
, regular expressions are enclosed in forward slashes, so the actual regular expression part
in the above is ^@SQ
, the enclosing forward slashes are just delimiters that are telling awk
, “Hey! The stuff inside here should
be interpreted as a regular expression.”
At this stage, you can think of a regular expression as a “search-string” that awk
will try to match
against the text in a line. At its simplest, a regular expression just describes how characters should
match between the regular expression and the line being matched. For example:
/ACGGTC/
Is saying, “search for the word ACGGTC
,” which is something that might find in a DNA string.
If all that regular expressions did was express a search word (like your familiar find function in
Microslop Word, for example), then they would be very easy to learn, but also very limited in
utility. The good news is that all your standard numerals and upper- and lowercase letters work
in regular expressions just like they do in your vanilla “find” function. So, all of the following
regular expressions are just requesting that the word enclosed in the /
’s be found:
/Fall/
/A01/
/plate/
/LN/
Regular expressions get more complicated (and useful) because some of the familiar punctuation
marks have special meaning within a regular expression. These are called metacharacters.
The fundamental metacharacters in awk
are listed in Table 6.1. It is
worth getting familiar with all of these as most are common to all languages that use
regular expressions, such as R, python, and perl, so learning these will be helpful not
just in using awk
, but also in your programming, in general.
Metacharacter(s) | Description |
---|---|
. |
Match any single character |
[ ] |
Match any single character from a character class (see below) |
^ |
Signifies/matches the
beginning of the line or
word (remember /^@SQ/ ) |
$ |
Signifies/matches the end of the line of word |
* |
Match zero or more occurrences of the character (or grouped pattern) immediately to the left |
? |
Match zero or one occurrences of the character (or grouped pattern) immediately to the left |
+ |
Match one or more occurrences of the character (or grouped pattern) immediately to the left |
{n} |
Match n occurrences of the character (or grouped pattern) immediately to the left |
{m,n} |
Match any number between m and n occurrences of character (or grouped pattern) immediately to the left |
| |
Combine regular expressions with an OR |
\ |
Use it if you want to match a
literal metacharacter. For
example \. matches an actual
period, and \? matches an
acutal question mark |
( ) |
Used to group characters into
a single unit to which
modifiers like * or + can
be applied, to delimit the
extent of | ’s (or to be used
in more advanced expressions
for inserting replacement
groups) |
We note in the above table that more explanation is needed for the concept of character classes defined by [ ]
.
These are similar to what you are already
familiar with in terms of globbing file names. In short, if you
put a variety of characters between square brackets, it means “match any one of these characters.”
For example, /[aceACE]/
means “match any upper- or lowercase a
, c
, or e
. Within
those square brackets, ^
and -
have special meaning depending on where between
the square brackets they occur:
- A
-
, when between letters or numbers, indicates a range:/[a-zA-Z]/
means any letter; /[0-5]/ means any numeral between 0 and 5. - A
-
at the beginnging or end of the characters inside the square brackets just means-
:/[-;ab]/
means match any of the characters-
,;
,a
, orb
. - A
^
at the beginning of the characters inside the square brackets negates the character class, meaning the match will be to anything not within the character class, i.e.,[^ABC]
means match any character that is notA
norB
norC
. If the^
is not at the beginning of the characters within the[ ]
then it carries no special meaning:[:;^%]
matches any of those four punctuation characters.’ - All characters, except
-
and^
in the correct positions, and the backslash\
, are interpreted literally (i.e., not as metacharacters) within the[ ]
. So you can match any one of?
,*
,(
,)
, or+
, for instance, with/[?*()+]/
.
All these metacharacters might seem a little cryptic, so I provide a few
examples, here, that might help to make it more clear. They are presented
(i.e. “match any line”) as if they are part of an awk
test.
# match any line that starts with @RG
/^@RG/
# match any line that starts with @ followed by any two characters
/^@../
# match any line that starts with @ followed by any
# two uppercase letters
/^@[A-Z][A-Z]/
# the above could also be written as:
/^@[A-Z]{2}/
# This matches phone numbers formatted either like
# (###) ###-####, or ###-###-####
/\(?[0-9]{3}(\) |-)[0-9]{3}-[0-9]{4}/
# And if you can parse that out, you get an A for the day!
# match anything that starts with T456, then has any
# number (including 0) of any other characters, then
# ends with ".fq"
/T456.*\.fq/
# Note the use of backslash to escape the .
# Match either "big mess" or "huge mess"
/(big|huge) mess/
Now it’s your turn
Here are some things to try.
# 1. use alternation (the |) to match lines in the SAM file data/DPCh_plate1_F12_S72.sam
# that start with either @PG or @RG
# 2. search through the fastq file data/DPCh_plate1_F12_S72.R1.fq.gz
# to find any line with exactly two consecutive occurrences of any of the following characters:
# ! " # $ % &
# What are we searching for in the above?
6.1.4 A variety of tests
When you want to test whether a line will be processed by an action block, or not, you have many options. Among the major ones (each given with an example or two) are:
match a regular expression anywhere in a line:
match a regular expression in a specific column/field
test for equality, using
==
, of a single column to a string, or a number. Note these are not regular expressions, but actual statements of equality. Strings in such a context are surrounded by double quotes.Use comparison operators
<
,<=
,>
,>=
, and!=
(not equals) to compare a column to a string (compared by lexicographical order) or a number (compared by numerical order)test the value of a user-defined variable that may be changing as lines are getting processed.
test the value of an internal variable, like
NR
orNF
As these tests are effectively things that return a value of TRUE or FALSE, they can
be combined with logical AND and logical OR operators. In awk, the logical AND is
&&
and the logical OR is ||
. To make the intent of long combinations clear, and to
specify specific combinations, the tests can be grouped with parentheses:
6.1.5 Code in the action blocks
Within the action blocks you write computer code to do things, and awk
has many features
you expect in a programming language.
Separate lines of code can be ended with a line return (i.e., in scripts), or they can be
ended with a semicolon.
Variable assignment is done with the =
sign. Unlike the bash shell, you can have spaces around
the =
. One interesting aspect of awk
is that the
variables are untyped. This means that you don’t have to tell awk
ahead of time whether
a variable is going to hold a number, or a string, etc. Everything is stored as a string,
but when used in a numeric context, if it makes sense, the variable will be treated
as a number. So, you don’t have to worry too much about the type of variables.
If a value has not yet been assigned to a variable, but it is used in a numeric
context, the value is assumed to be 0; if used in a string context, its value
is assumed to be the empty string ""
.
You can use for
loops within awk
. They have the syntax of the C language:
For example, this cycles i
over values starting from 1, until 10, each
time incrementing the value by 1:
Note that the ++
means “add one to the variable to my left.”
You can also increment by larger amounts. What do you think this would do?
If the body of the loop consists of multiple lines of code (or even just one) those lines
can be grouped using curly braces.
Here is an examle of using a for
loop to print the columns of the Chinook sample sheet
in row format:
awk -F"," '
NR == 1 {for(i=1;i<=NF;i++) head[i] = $i; next}
{for(i=1;i<=NF;i++) {
print i ". " head[i] ": " $i;
}
print "----------------------"
}
' data/wgs-chinook-samples.csv
Arrays are implemented as associative arrays. Thus, rather than being arrays
with elements indexed (and accessed) by natural numbers (i.e. array[1]
, array[2]
, etc.),
arrays are indexed by any string. This is sometimes called a hash. In python it is
called a dictionary. So, within an awk action block you could see
which means “find the element of the array n
that is associated with the key $3, and then
add 1 to that element”. This can
be quite useful for counting up the occurrences of different strings or values
in a column. Of course, in order to actually see what the values are, after they
have been counted up, you need to be able
to cycle over all the different keys of the array, and print the key and the
value for each. In awk
, you cycle over the keys of an array using
for (i in array)
where for
and in
are keywords, i
is any variable name you want, and array
is the
name of an array variable. Unfortunately, you have no control over the order in which the
different keys in the array are visited! Example:
# count the number of fish from different sampling collections
# in the wgs-chinook-samples.tsv file:
awk -F"\t" '
NR > 1 {n[$6]++}
END {for(i in n) print i ":", n[i]}
' data/wgs-chinook-samples.tsv
# gives us:
Coleman Hatchery Late Fall: 16
Feather River Hatchery Spring: 16
Feather River Hatchery Fall: 16
Salmon River Spring: 16
Trinity River Hatchery Fall: 16
Trinity River Hatchery Spring: 16
Butte Creek Spring: 16
Sacramento River Winter: 16
Salmon River Fall: 16
San Joaquin River Fall: 16
The above demonstrates the use of the very important END
specifier in the
test position. The END
there means “perform the actions in the action block
once all the lines of the file have been processed.”
Now it’s your turn!
We want to cycle through the file DPCh_plate1_F12_S72.R1.fq.gz
and
count up the number of times different base-quality sequences occur in the file. The
base quality scores occur on lines 4, 8, 12, … (so, the line number divided by 4 has
no remainder. Note that x % 4
gives the remainder when x
is divided by 4. We want the
output on each line to be in the format:
Number_of_occurrences Base_quality_score_sequence
After doing that, you might wish that things were sorted differently, like highest to lowest in terms of # of occurrences. Try piping the output into:
Pipe that to less and look through it. It is actually pretty cool.
Conditional tests use if
and else
, with blocking by curly braces.
Below we will demonstrate the use of if
and else
and, how they can be put together
to make an ifelse
-like construction. At the same time we demonstrate how
output from print
statements in awk
can be redirected to files, from within
the awk
script itself. This uses much the same syntax as the shell.
Imagine that we want to put the IDs (called the vcf_names
) of the fish
in data/wgs-chinook-samples.csv
into separate files, one for each
of the run_types
of “Fall”, “Winter”, and “Spring”, and another file
for anything else. We could do that like this:
awk -F"," '
NR > 1 {
if($NF == "Fall") {
print $1, $NF > "fall.txt"
} else if($NF == "Winter") {
print $1, $NF > "winter.txt"
} else if($NF == "Spring") {
print $1, $NF > "spring.txt"
} else {
print $1, $NF > "other.txt"
}
}
' data/wgs-chinook-samples.csv
Now look at the four files produced.
NB: Parsing CSV files with awk
is not always so straightforward as doing -F","
because
the CSV specification allows for commas that do not separate fields to be hidden within quotation marks.
Don’t expect to parse complex CSV files made by Excel, for example, to be parsed this easily
with awk
. It is better to save them as TAB separate files, typically.
Mathematical operations: awk
has got them all: +
, -
, *
, /
, %
, as well as “operate and reassign”
versions: +=
, -=
, *=
, /=
, as well as: exp
, log
, sqrt
, sin
, cos
, and atan2
Built in functions: awk
has a limited set of built-in functions. See man awk
for a full
listing. The ones I find I use all the time are:
length(n)
: return the number of characters of the variablen
substr(s, m, n)
: return the portion of strings
that begins at positionm
(counted starting from 1), and isn
characters longsub(r, t, s)
: substitute stringt
for the first occurrence of the regular expressionr
in the strings
. Ifs
is not given,$0
is used.gsub
: just likesub
but replaces all (rather than just the first) occurrences of the regular expressionr
in strings
.split(s, a, fs)
: split the strings
into an array variable nameda
at occurrences of the regular expressionfs
. The function returns the number of pieces the string was split into. Afterward each part of the string can be accessed likea[1]
,a[2]
, and so forth. (Good example = splitting out fields from a column in a VCF file.)
Take control of output formatting with the printf()
function:
I still need to write this.
6.1.6 Using awk
to assign to shell variables
We leave our discussion of awk
by noting that its terse syntax makes it
a perfect tool for creating shell variables that we can do things (like
cycling over them) with.
For example, imagine we wanted to do something to all the FASTQ files associated with
the fish in data/wgs-chinook-samples.csv
that are of run_type
= “Winter”. We can
get all their IDs using command substitution and then cycle over them in the shell like this:
6.2 sed
sed
is a “stream editor”. Though it is capable of all sorts of things,
to be honest, I use it almost exclusively to
do simple find-and-replace operations. The syntax for that is:
where regex
is a regular expression and replacement
is the string you wish to replace
any segments of file
that match regex
. The s
means “substitute” and is like
a command to sed
, and the g
means “globally”, without which only the first
match of regex
on each line would be replaced.
Multiple instances of the “s” command can be given, like:
The separate commands are done in order, line by line.
Now it’s your turn!
Take the output of:
WINTERS=$(awk -F"," '$NF == "Winter" {print $1}' data/wgs-chinook-samples.csv)
for i in $WINTERS; do
echo "FASTQS are: $i.R1.fq.gz $i.R2.fq.gz"
done
and pipe it into sed
to changes the R1
and R2
into r1
and r2
,
and to remove the .gz
from the end of each file name. Remember to
backslash-escape that period!