Chapter 6 Sed, awk, and regular expressions

In the course of doing bioinformatics, you will be dealing with myriad different text files. As we noted in the previous chapters, Unix, with its file I/O model, piping capabilities, and numerous utilities, is well-suited to handling large text files. Two utilities found on every Unix installation—awk and sed—merit special attention in this context. awk is a lightweight scripting language that lets you write succinct programs to operate line-by-by line on the contents of text files. It is particularly useful for handling text files that have columns of data separated by white spaces or tabs. sed on the other hand, is particularly useful for automating “find-and-replace” operations on text files. Each of them is optimized to handle large files without storing a lot of information in memory, so they can be useful for quick operations on large bioinformatic data sets. Neither is a fully-featured programming language that you would want to write large, complex programs in (that said, I did once implement a complete program for full-sibling inference from multiallelic markers in awk); however they do share many of the useful text-manipulation capabilities of such languages, such as Perl and Python. Additionally, awk and sed are deployed in a consistent fashion across most Unix operating systems, and they don’t require much time to learn to use effectively for common text-processing tasks. As a consequence awk and sed are a useful addition to the bioinformatician’s toolbox.

Both awk and sed rely heavily on regular expressions to describe patterns in text upon which some operation should be performed. You can think of regular expressions as providing a succinct language for performing very advanced “find” and “find-and-replace” operations in a text file.

In this chapter we will only scratch the surface of what can be done with awk and sed. Indeed, there is an entire 432-page book published decades ago by O’Reilly about awk and sed. Our goal here is to provide an introduction to a few basic maneuvers with both awk and sed and to describe instances where they can be useful, as well as to give an introduction to many (but certainly not all) of the patterns that can be expressed using regular expressions. We will start with a basic overview of how awk works. Then we will have a short look at regular expressions, then we will use those further in awk, and finally we will play with sed a little bit.

In order to have a set of files to use in the examples, I have made a small GitHub repository called awk-and-sed-inputs. You can get that with

git clone https://github.com/eriqande/awk-and-sed-inputs

All of the examples in this chapter that use such external files assume that the current working directory is the repository directory awk-and-sed-inputs.

6.1 awk

6.1.1 Line-cycling, tests and actions

awk operates by cycling through a file (or an input stream from stdin) line-by-line. At each new line awk tests whether it should do anything with the contents of the line. If the answer to that test is “yes”, then it executes the code that describes what should be done to the contents of the line. The scope of ways that awk can operate on text is quite wide, but, the most common use of awk is to print parts of the line, or do small calculations on parts of the line. The instructions that describe the tests and the actions are included in the awk “script”, though, in practice, this script is often given to awk not as a file, but, rather, as text between single-quotes on the command line. The basic syntax looks like this:

% awk '
  test1  {action1}
  test2  {action2}
' file

Where file is the path of the file you want awk to read, and the “tests” and “actions” above are just placeholders showing where those parts of the script are written. The layout makes it clear that if test1 is TRUE, action1 will be executed, and if test2 is TRUE, then action2 is executed. Code describing the actions must always appear within a set of curly braces. We will refer to the text within those curly braces as an action block.

Two things to note: first, different tests and actions do not have to be written on separate lines. That makes things easier to read, but if you are writing these short scripts on the command line, it is often easier to put everything on a single line, like 'test1 {action1} test2 {action2}, and that is fine. Second, if you don’t supply a file path to awk it expects (and gladly processes) data from stdin.
This makes it easy to pipe data into awk. For example, putting the above two points together, the above awk script skeleton could have been written this way:

% cat file | awk 'test1  {action1} test2  {action2}'

Before we get too abstract talking about tests and actions, let’s look at a few examples:

# print lines starting with @SQ in the SAM header of a file
awk '/^@SQ/ {print}' data/DPCh_plate1_F12_S72.sam

# print only those @SQ lines with a sequence name tag starting with "NC_"
awk '/^@SQ/ && /SN:NC_/ {print}' data/DPCh_plate1_F12_S72.sam

# same as above, but quit processing the file as soon as you hit
# an @SQ line with a sequence name starting with "NW_"
awk '
  /^@SQ/ && /SN:NC_/ {print}
  /^@SQ/ && /SN:NW_/ {exit}  
' data/DPCh_plate1_F12_S72.sam


# print only lines 101, and 103 from the fastq file
gzcat data/DPCh_plate1_F12_S72.R1.fq.gz | awk 'NR==101 || NR==102 {print}'

The above examples show only two different actions (print and exit) and a variety of tests based on matching substrings in each line, or on which line number (NR, which stands for “number of the record”).

Take a moment to make sure you understand which parts are the tests, and which are the actions in the above examples.

6.1.2 Column splitting, fields, -F, $, NF, print, OFS and BEGIN

Every time awk processes a line of text it breaks it into different fields, which you can think of as columns (as in a spreadsheet). By default, any number of whitespace (space or TAB) characters constitutes a break between columns. If you want field-splitting to be done using more specific characters, you can specify that on the command line with the the -F option. For example:

awk -F"\t" 'test {action}' file  # split lines into fields on single TAB characters
awk -F"," 'test {action}' file  # split lines into fields on single commas
awk -F":" 'test {action}' file  # split lines into fields on single colons
awk -F";" 'test {action}' file  # split lines into fields on single semicolons
# and so forth...

While in all those examples, the field separator is a single character, in full-blown practice, the field separator can be specified as a regular expression (see below).

Within an action block, you can access the different fields that awk has split out of the line using a $ followed immediately by a field number, i.e., $1, $2, $3, … . When you get to fields with indexes greater than 9, you have to wrap the number in parentheses, like $(10), $(11),… . The special value $0 denotes the whole line, not just one of its fields. (Note that if you give the action print without any arguments, that also just prints the whole line.) The special variable NF is the “Number of Fields”, so $NF is the “last column.”

Let’s say we have a tab-separated file in data/wgs-chinook-samples.tsv that looks like this:

vcf_name    ID_Berk NMFS_DNA_ID BOX_ID  BOX_POSITION    Population  Concentration (ng/ul)   run_type
DPCh_plate1_A01_S1  CH_Plate1_1A    T144767 T1512   1A  Salmon River Fall   23.4    Fall
DPCh_plate1_A02_S2  CH_Plate1_2A    T144804 T1512   5F  Salmon River Fall   67.4    Fall
DPCh_plate1_A03_S3  CH_Plate1_3A    T145109 T1515   7G  Salmon River Spring 3.52    Spring
DPCh_plate1_A04_S4  CH_Plate1_4A    T145118 T1515   8H  Salmon River Spring 10.3    Spring
DPCh_plate1_A05_S5  CH_Plate1_5A    T144863 T1513   1A  Feather River Hatchery Fall 220 Fall

Then, if we wanted to print the first and the last columns (fields) we could do

awk -F"\t" '{print $1, $NF}' data/wgs-chinook-samples.tsv

# the first few lines of output look like:
vcf_name run_type
DPCh_plate1_A01_S1 Fall
DPCh_plate1_A02_S2 Fall
DPCh_plate1_A03_S3 Spring
DPCh_plate1_A04_S4 Spring

Notice that if there is no test before the action block, then the action is done on every line.

The print command prints variables to stdout, if you separate those variables with a comma, then in the output, they will be separated by the output field sepator which, by default, is a single space. You can set the output field separator using the OFS variable, in an action block that is forced to run at the beginning of execution using the special BEGIN test keyword:

awk -F"\t" '
  BEGIN {OFS=";"}
  {print $1, $NF}
' data/wgs-chinook-samples.tsv

# makes output like:
vcf_name;run_type
DPCh_plate1_A01_S1;Fall
DPCh_plate1_A02_S2;Fall
DPCh_plate1_A03_S3;Spring
DPCh_plate1_A04_S4;Spring
DPCh_plate1_A05_S5;Fall

The comma between the arguments to print is what makes awk print the output field separator between the items. If you separate arguments to print with just a space (and not a comma), there will be nothing printed between the two arguments on output. This, coupled with the fact that print will happily print any strings and numbers (in addition to variables!) you pass to it, provides a way to do some light formatting of the output text:

awk -F"\t" 'NR > 1 {print "sample_name:" $1, "run_type:" $NF}' data/wgs-chinook-samples.tsv

# gives output like:
sample_name:DPCh_plate1_A01_S1 run_type:Fall
sample_name:DPCh_plate1_A02_S2 run_type:Fall
sample_name:DPCh_plate1_A03_S3 run_type:Spring
sample_name:DPCh_plate1_A04_S4 run_type:Spring
sample_name:DPCh_plate1_A05_S5 run_type:Fall

Note that if you don’t provide an action block after a test, the default action, if the test is true, is assumed to be “print the whole line as is.” Thus you can use awk to print matching lines simply like this:

awk '/regex/'

where /regex/ just means “some regular expression,” as is explained in the next section.

Now it’s your turn!

# 1. Using the file data/wgs-chinook-samples.csv, print out the
# NMFS_DNA_ID the BOX_ID and the BOX_POSITION, separated by periods

6.1.3 A brief introduction to regular expressions

In a few of the above examples you will see tests that look like: /^@SQ/. This is a regular expression. In awk, regular expressions are enclosed in forward slashes, so the actual regular expression part in the above is ^@SQ, the enclosing forward slashes are just delimiters that are telling awk, “Hey! The stuff inside here should be interpreted as a regular expression.”

At this stage, you can think of a regular expression as a “search-string” that awk will try to match against the text in a line. At its simplest, a regular expression just describes how characters should match between the regular expression and the line being matched. For example:

/ACGGTC/ 

Is saying, “search for the word ACGGTC,” which is something that might find in a DNA string.

If all that regular expressions did was express a search word (like your familiar find function in Microslop Word, for example), then they would be very easy to learn, but also very limited in utility. The good news is that all your standard numerals and upper- and lowercase letters work in regular expressions just like they do in your vanilla “find” function. So, all of the following regular expressions are just requesting that the word enclosed in the /’s be found:

/Fall/
/A01/
/plate/
/LN/

Regular expressions get more complicated (and useful) because some of the familiar punctuation marks have special meaning within a regular expression. These are called metacharacters. The fundamental metacharacters in awk are listed in Table 6.1. It is worth getting familiar with all of these as most are common to all languages that use regular expressions, such as R, python, and perl, so learning these will be helpful not just in using awk, but also in your programming, in general.

TABLE 6.1: Basic metacharacters used in regular expressions with awk.
Metacharacter(s) Description
. Match any single character
[ ] Match any single character from a character class (see below)
^ Signifies/matches the beginning of the line or word (remember /^@SQ/)
$ Signifies/matches the end of the line of word
* Match zero or more occurrences of the character (or grouped pattern) immediately to the left
? Match zero or one occurrences of the character (or grouped pattern) immediately to the left
+ Match one or more occurrences of the character (or grouped pattern) immediately to the left
{n} Match n occurrences of the character (or grouped pattern) immediately to the left
{m,n} Match any number between m and n occurrences of character (or grouped pattern) immediately to the left
| Combine regular expressions with an OR
\ Use it if you want to match a literal metacharacter. For example \. matches an actual period, and \? matches an acutal question mark
( ) Used to group characters into a single unit to which modifiers like * or + can be applied, to delimit the extent of |’s (or to be used in more advanced expressions for inserting replacement groups)

We note in the above table that more explanation is needed for the concept of character classes defined by [ ]. These are similar to what you are already familiar with in terms of globbing file names. In short, if you put a variety of characters between square brackets, it means “match any one of these characters.” For example, /[aceACE]/ means “match any upper- or lowercase a, c, or e. Within those square brackets, ^ and - have special meaning depending on where between the square brackets they occur:

  • A -, when between letters or numbers, indicates a range: /[a-zA-Z]/ means any letter; /[0-5]/ means any numeral between 0 and 5.
  • A - at the beginnging or end of the characters inside the square brackets just means -: /[-;ab]/ means match any of the characters -, ;, a, or b.
  • A ^ at the beginning of the characters inside the square brackets negates the character class, meaning the match will be to anything not within the character class, i.e., [^ABC] means match any character that is not A nor B nor C. If the ^ is not at the beginning of the characters within the [ ] then it carries no special meaning: [:;^%] matches any of those four punctuation characters.’
  • All characters, except - and ^ in the correct positions, and the backslash \, are interpreted literally (i.e., not as metacharacters) within the [ ]. So you can match any one of ?, *, (, ), or +, for instance, with /[?*()+]/.

All these metacharacters might seem a little cryptic, so I provide a few examples, here, that might help to make it more clear. They are presented (i.e. “match any line”) as if they are part of an awk test.

# match any line that starts with @RG
/^@RG/

# match any line that starts with @ followed by any two characters
/^@../

# match any line that starts with @ followed by any
# two uppercase letters
/^@[A-Z][A-Z]/

# the above could also be written as:
/^@[A-Z]{2}/

# This matches phone numbers formatted either like 
# (###) ###-####, or ###-###-####
/\(?[0-9]{3}(\) |-)[0-9]{3}-[0-9]{4}/

# And if you can parse that out, you get an A for the day!

# match anything that starts with T456, then has any
# number (including 0) of any other characters, then
# ends with ".fq"
/T456.*\.fq/

# Note the use of backslash to escape the .

# Match either "big mess" or "huge mess"
/(big|huge) mess/

Now it’s your turn

Here are some things to try.

# 1. use alternation (the |) to match lines in the SAM file data/DPCh_plate1_F12_S72.sam
# that start with either @PG or @RG


# 2. search through the fastq file data/DPCh_plate1_F12_S72.R1.fq.gz 
# to find any line with exactly two consecutive occurrences of any of the following characters:
# ! " # $ % &


# What are we searching for in the above?

6.1.4 A variety of tests

When you want to test whether a line will be processed by an action block, or not, you have many options. Among the major ones (each given with an example or two) are:

  1. match a regular expression anywhere in a line:

    awk '/^@RG/'
    awk '/Fall/'
  2. match a regular expression in a specific column/field

    awk '$3 ~ /Sacramento/'
    awk '$1 ~ /Chr[123]/'
  3. test for equality, using ==, of a single column to a string, or a number. Note these are not regular expressions, but actual statements of equality. Strings in such a context are surrounded by double quotes.

    awk '$1 == "NC_07124.1"'
    awk '$5 == 100'
  4. Use comparison operators <, <=, >, >=, and != (not equals) to compare a column to a string (compared by lexicographical order) or a number (compared by numerical order)

    awk '$2 <= "Aardvark"'
    awk '$7 > 25'
  5. test the value of a user-defined variable that may be changing as lines are getting processed.

    awk 'n > 356'
    awk 'my_word == "Loony"'
  6. test the value of an internal variable, like NR or NF

    awk 'NR < 25'
    awk 'NF == 13'

As these tests are effectively things that return a value of TRUE or FALSE, they can be combined with logical AND and logical OR operators. In awk, the logical AND is && and the logical OR is ||. To make the intent of long combinations clear, and to specify specific combinations, the tests can be grouped with parentheses:

awk '(units = "days" && n > 356) || (units == "months" && n > 12) {print "More than a year!"}'

6.1.5 Code in the action blocks

Within the action blocks you write computer code to do things, and awk has many features you expect in a programming language.

Separate lines of code can be ended with a line return (i.e., in scripts), or they can be ended with a semicolon. Variable assignment is done with the = sign. Unlike the bash shell, you can have spaces around the =. One interesting aspect of awk is that the variables are untyped. This means that you don’t have to tell awk ahead of time whether a variable is going to hold a number, or a string, etc. Everything is stored as a string, but when used in a numeric context, if it makes sense, the variable will be treated as a number. So, you don’t have to worry too much about the type of variables.

If a value has not yet been assigned to a variable, but it is used in a numeric context, the value is assumed to be 0; if used in a string context, its value is assumed to be the empty string "".

You can use for loops within awk. They have the syntax of the C language:

for(var = initial; test; increment)

For example, this cycles i over values starting from 1, until 10, each time incrementing the value by 1:

for(i=1;i<=10;i++)

Note that the ++ means “add one to the variable to my left.”

You can also increment by larger amounts. What do you think this would do?

echo boing | awk '{for(i=5; i<=25; i+=5) print i}' 

If the body of the loop consists of multiple lines of code (or even just one) those lines can be grouped using curly braces. Here is an examle of using a for loop to print the columns of the Chinook sample sheet in row format:

awk -F"," '
  NR == 1 {for(i=1;i<=NF;i++) head[i] = $i; next}
  {for(i=1;i<=NF;i++) {
      print i ". " head[i] ": " $i; 
    }
    print "----------------------"
  }
' data/wgs-chinook-samples.csv

Arrays are implemented as associative arrays. Thus, rather than being arrays with elements indexed (and accessed) by natural numbers (i.e. array[1], array[2], etc.), arrays are indexed by any string. This is sometimes called a hash. In python it is called a dictionary. So, within an awk action block you could see

n[$3]++

which means “find the element of the array n that is associated with the key $3, and then add 1 to that element”. This can be quite useful for counting up the occurrences of different strings or values in a column. Of course, in order to actually see what the values are, after they have been counted up, you need to be able to cycle over all the different keys of the array, and print the key and the value for each. In awk, you cycle over the keys of an array using

for (i in array)

where for and in are keywords, i is any variable name you want, and array is the name of an array variable. Unfortunately, you have no control over the order in which the different keys in the array are visited! Example:

# count the number of fish from different sampling collections
# in the wgs-chinook-samples.tsv file:
awk -F"\t" '
  NR > 1 {n[$6]++}
  END {for(i in n) print i ":", n[i]}
' data/wgs-chinook-samples.tsv 

# gives us:
Coleman Hatchery Late Fall: 16
Feather River Hatchery Spring: 16
Feather River Hatchery Fall: 16
Salmon River Spring: 16
Trinity River Hatchery Fall: 16
Trinity River Hatchery Spring: 16
Butte Creek Spring: 16
Sacramento River Winter: 16
Salmon River Fall: 16
San Joaquin River Fall: 16

The above demonstrates the use of the very important END specifier in the test position. The END there means “perform the actions in the action block once all the lines of the file have been processed.”

Now it’s your turn!
We want to cycle through the file DPCh_plate1_F12_S72.R1.fq.gz and count up the number of times different base-quality sequences occur in the file. The base quality scores occur on lines 4, 8, 12, … (so, the line number divided by 4 has no remainder. Note that x % 4 gives the remainder when x is divided by 4. We want the output on each line to be in the format:

Number_of_occurrences  Base_quality_score_sequence
# go for it!

After doing that, you might wish that things were sorted differently, like highest to lowest in terms of # of occurrences. Try piping the output into:

sort -n -b -r -k 1

Pipe that to less and look through it. It is actually pretty cool.

Conditional tests use if and else, with blocking by curly braces.
Below we will demonstrate the use of if and else and, how they can be put together to make an ifelse-like construction. At the same time we demonstrate how output from print statements in awk can be redirected to files, from within the awk script itself. This uses much the same syntax as the shell.

Imagine that we want to put the IDs (called the vcf_names) of the fish in data/wgs-chinook-samples.csv into separate files, one for each of the run_types of “Fall”, “Winter”, and “Spring”, and another file for anything else. We could do that like this:

awk -F"," '
  NR > 1 {
    if($NF == "Fall") {
      print $1, $NF > "fall.txt"
    } else if($NF == "Winter") {
      print $1, $NF > "winter.txt"
    } else if($NF == "Spring") {
      print $1, $NF > "spring.txt"
    } else {
      print $1, $NF > "other.txt"
    }
  }
' data/wgs-chinook-samples.csv

Now look at the four files produced.

NB: Parsing CSV files with awk is not always so straightforward as doing -F"," because the CSV specification allows for commas that do not separate fields to be hidden within quotation marks. Don’t expect to parse complex CSV files made by Excel, for example, to be parsed this easily with awk. It is better to save them as TAB separate files, typically.

Mathematical operations: awk has got them all: +, -, *, /, %, as well as “operate and reassign” versions: +=, -=, *=, /=, as well as: exp, log, sqrt, sin, cos, and atan2

Built in functions: awk has a limited set of built-in functions. See man awk for a full listing. The ones I find I use all the time are:

  • length(n) : return the number of characters of the variable n
  • substr(s, m, n) : return the portion of string s that begins at position m (counted starting from 1), and is n characters long
  • sub(r, t, s) : substitute string t for the first occurrence of the regular expression r in the string s. If s is not given, $0 is used.
  • gsub : just like sub but replaces all (rather than just the first) occurrences of the regular expression r in string s.
  • split(s, a, fs) : split the string s into an array variable named a at occurrences of the regular expression fs. The function returns the number of pieces the string was split into. Afterward each part of the string can be accessed like a[1], a[2], and so forth. (Good example = splitting out fields from a column in a VCF file.)

Take control of output formatting with the printf() function:
I still need to write this.

6.1.6 Using awk to assign to shell variables

We leave our discussion of awk by noting that its terse syntax makes it a perfect tool for creating shell variables that we can do things (like cycling over them) with.

For example, imagine we wanted to do something to all the FASTQ files associated with the fish in data/wgs-chinook-samples.csv that are of run_type = “Winter”. We can get all their IDs using command substitution and then cycle over them in the shell like this:

WINTERS=$(awk -F"," '$NF == "Winter" {print $1}' data/wgs-chinook-samples.csv)

for i in $WINTERS; do
  echo "FASTQS are: $i.R1.fq.gz     $i.R2.fq.gz"
done

6.1.7 Passing Variables into awk with -v

Gotta let people know about this!

6.1.8 Writing awk scripts in files

Need to do this.

6.2 sed

sed is a “stream editor”. Though it is capable of all sorts of things, to be honest, I use it almost exclusively to do simple find-and-replace operations. The syntax for that is:

sed 's/regex/replacement/g;' file

where regex is a regular expression and replacement is the string you wish to replace any segments of file that match regex. The s means “substitute” and is like a command to sed, and the g means “globally”, without which only the first match of regex on each line would be replaced.

Multiple instances of the “s” command can be given, like:

sed '
  s/regex1/replacement1/g;
  s/regex2/replacement2/g;
  s/regex3/replacement3/g;
' file

The separate commands are done in order, line by line.

Now it’s your turn!
Take the output of:

WINTERS=$(awk -F"," '$NF == "Winter" {print $1}' data/wgs-chinook-samples.csv)

for i in $WINTERS; do
  echo "FASTQS are: $i.R1.fq.gz     $i.R2.fq.gz"
done

and pipe it into sed to changes the R1 and R2 into r1 and r2, and to remove the .gz from the end of each file name. Remember to backslash-escape that period!