Chapter 5 Shell programming

In our first foray into Unix and the shell, we restricted ourselves mostly to navigating the file system, handling files, and working with streams of data (via redirection and pipes). These are all crucial skills, but the bash shell becomes truly powerful when we start to adopt it as a sort of programming language. That’s right, even though the functionality of bash is geared toward running jobs and calling commands, it still exhibits most of the features expected in a programming language, like variables, iteration and flow control.

5.1 An example script

We start this chapter by taking a look at a short bash program (typically called a script) that the author wrote in order to efficiently download (clone, really) repositories from GitHub that have been submitted by students to GitHub Classroom. The program is not long, but exhibits many useful features of bash as a programming language. If you are reading this, not in the context of a class with lectures, just go ahead and read through it and see if you can figure out what is going on in each line of the script. Afterward, we will address many features of bash by referencing different parts of the script.

The script, which happens to be stored in a file called is printed below, with linenumbers, since we will be referring back to specific sections of the script later.


# define a function to print the usage or "help" for the script
function usage {
      echo Syntax:
      echo "  $(basename $0)  GH_Prefix  Repo_Prefix  Branch  Dir
      GH_Prefix: the URL of the GitHub site where the repository exists.
      Repo_Prefix: the prefix of the name of each repository to be cloned.
      Branch: the name of the branch to create and switch to in the repository,
         once the repo has been cloned.
      Dir: path to the directory (will be created if necessary) to clone all
         the repositories to.
      $(basename $0)  illumina-video-questions- erics-edits  /tmp/illumina-questions

#  test for right number of required args.  If not, print usage message
if [ $# -ne 4 ]; then
    exit 1;

# copy positional parameters into other variables

# assign string with student GitHub handles into a variable

# assign my GitHub username to the variable USER

# assign the current working directory to the variable RUNDIR

# make a new directory named whatever the user wanted for the output directory
mkdir -p $DD

# make variables to hold log and error file names

# print the date/time when the process is starting
echo "STARTING at $(date)"

# make a clean slate. remove any files with the name
# of the error output file
rm -f $ERR

# cycle over the student GitHub names, and for each one *do*
# the commands that appear before the *done* keyword. Indenting
# is used to make it easier to read, but is not essential.
for L in $GHNAMES; do

    echo "Working on $L, starting at $(date)"  # print a progress line to stdout
    REPO=$GHP/${RP}$L     # combine variables into new variables that
    echo $REPO            # hold the URL for the repository to be
    DEST=$DD/$L           # cloned and the path where it should be cloned to
    # store the commands themselves into variables. Note the 
    # use of double quotes.
    CLONE_IT="git clone ${REPO/$} $DEST" 
    BRANCH_IT="git checkout -B $BRANCH"
    PUSH_IT="git push -u origin $BRANCH"
    # now, run those commands, chained together by exit-status-AND
    # operators (so it will stop if any one part fails), while
    # all the while appending error statements to the Error file. Run it
    # all within an "if" statement so you can deliver a report as to
    # whether the whole shebang succeeded or failed.
    if $CLONE_IT 2>> $ERR  && \
        cd $DEST && \
        $BRANCH_IT 2>> $ERR  && \
        $PUSH_IT 2>> $ERR  && \
        cd $RUNDIR   # at the very end make sure to return to the original working directory
        echo "FULL SUCCESS $L"
        cd $RUNDIR  # get back to the working directory from which the original command was run.
                    # so we are ready to handle the next student repo.
done  # signifies the end of the for loop we are cycling over

If my current working directory is where the script resides, I can run it like this:

% ./ 

And if I wanted to be fancy, I could put the script in a directory (like ~/bin perhaps) that I have included in my PATH variable. In which case I could run it like:


from anywhere on my computer.

When I run the script in any of those two ways, because I have not provided the proper number of arguments to the command, it returns a message telling me what syntax is required to use it (i.e., its usage syntax):

Syntax:  GH_Prefix  Repo_Prefix  Branch  Dir
      GH_Prefix: the URL of the GitHub site where the repository exists.
      Repo_Prefix: the prefix of the name of each repository to be cloned.
      Branch: the name of the branch to create and switch to in the repository,
         once the repo has been cloned.
      Dir: path to the directory (will be created if necessary) to clone all
         the repositories to.
  illumina-video-questions- erics-edits  /tmp/illumina-questions

That is handy, and the code to do it exists in the script itself. Looking at the output, how many arguments do you think the script is expecting?

Now, if I wanted to clone all of the student GitHub repos associated with the illumina-video-questions homework set, and then, once cloned, set up a new git branch called eric-edits so that I can make edits and/or comments and send those to students via a pull request, here is the command I would give (remembering, again that the % signifies the command prompt, here):

%  illumina-video-questions- erics-edits  /tmp/illumina-questions

And when I do, I see output like this:

STARTING at Thu Feb 13 06:02:05 MST 2020
Working on AmandaCicchino, starting at Thu Feb 13 06:02:05 MST 2020
Branch erics-edits set up to track remote branch erics-edits from origin.
FULL SUCCESS AmandaCicchino
Working on BrennaF, starting at Thu Feb 13 06:02:07 MST 2020
Branch erics-edits set up to track remote branch erics-edits from origin.
Working on CaitlinWells, starting at Thu Feb 13 06:02:08 MST 2020
Working on EllenMCampbell, starting at Thu Feb 13 06:02:09 MST 2020
Branch erics-edits set up to track remote branch erics-edits from origin.
Working on FayDong, starting at Thu Feb 13 06:02:10 MST 2020
Branch erics-edits set up to track remote branch erics-edits from origin.
Working on LibbyGH, starting at Thu Feb 13 06:02:12 MST 2020
Branch erics-edits set up to track remote branch erics-edits from origin.
Working on NathanPhipps, starting at Thu Feb 13 06:02:14 MST 2020
Branch erics-edits set up to track remote branch erics-edits from origin.
Working on RGCheek, starting at Thu Feb 13 06:02:15 MST 2020
Branch erics-edits set up to track remote branch erics-edits from origin.
Working on Ronan17, starting at Thu Feb 13 06:02:17 MST 2020
Branch erics-edits set up to track remote branch erics-edits from origin.
Working on abeulke, starting at Thu Feb 13 06:02:19 MST 2020
Branch erics-edits set up to track remote branch erics-edits from origin.
Working on carolazari, starting at Thu Feb 13 06:02:21 MST 2020
Branch erics-edits set up to track remote branch erics-edits from origin.
FULL SUCCESS carolazari
Working on cbossu, starting at Thu Feb 13 06:02:23 MST 2020
Branch erics-edits set up to track remote branch erics-edits from origin.
Working on ccolumbu, starting at Thu Feb 13 06:02:25 MST 2020
Branch erics-edits set up to track remote branch erics-edits from origin.
Working on elenacorrea, starting at Thu Feb 13 06:02:26 MST 2020
Branch erics-edits set up to track remote branch erics-edits from origin.
FULL SUCCESS elenacorrea
Working on eriqande, starting at Thu Feb 13 06:02:28 MST 2020
Branch erics-edits set up to track remote branch erics-edits from origin.
Working on jenleon07, starting at Thu Feb 13 06:02:29 MST 2020
Working on kimhoke, starting at Thu Feb 13 06:02:30 MST 2020
Working on kruegg, starting at Thu Feb 13 06:02:30 MST 2020
Working on lauracgoetz, starting at Thu Feb 13 06:02:30 MST 2020
Working on mdrod110, starting at Thu Feb 13 06:02:31 MST 2020
Branch erics-edits set up to track remote branch erics-edits from origin.
Working on mgdesaix, starting at Thu Feb 13 06:02:32 MST 2020
Branch erics-edits set up to track remote branch erics-edits from origin.
Working on raven-wings, starting at Thu Feb 13 06:02:34 MST 2020
Working on seamus100, starting at Thu Feb 13 06:02:34 MST 2020
Branch erics-edits set up to track remote branch erics-edits from origin.
FULL SUCCESS seamus100
Working on taylorbobowski, starting at Thu Feb 13 06:02:36 MST 2020
Branch erics-edits set up to track remote branch erics-edits from origin.
FULL SUCCESS taylorbobowski
Working on wcfunk, starting at Thu Feb 13 06:02:37 MST 2020
Branch erics-edits set up to track remote branch erics-edits from origin.

From that report, it is clear that it takes about 1 to 2 seconds to handle each repository. If I were doing each repository by hand, (i.e. cloning and branching through a GUI interface like RStudio’s) each repository would probably take me about 30 seconds to a minute, with a lot of copying and pasting and chance for errors, and I would have destroyed my wrists with all the repetitive tasks. So, this is a HUGE deal.

It is also easy to scan through the results and see, “Holy Moly! These are some dedicated students!” Everyone has successfully submitted their homework repositories (and thus we were SUCCESS-ful in cloning them), except for a handful who were traveling or otherwise occupied and had warned me they wouldn’t be able to do the assignment.

When all is said and done, I have the following git repositories on my laptop which I can peruse at my leisure:

% ls /tmp/illumina-questions/
AmandaCicchino/ FayDong/        RGCheek/        carolazari/     elenacorrea/    mgdesaix/       wcfunk/
BrennaF/        LibbyGH/        Ronan17/        cbossu/         eriqande/       seamus100/
EllenMCampbell/ NathanPhipps/   abeulke/        ccolumbu/       mdrod110/       taylorbobowski/

Additionally, in the directory where I ran the command, I have a file called illumina-video-questions-log.stderr that gives me a more detailed report of things when they worked or failed.

If you are new to Unix, then the above script likely appears a bit daunting. Our goal by the end of the chapter is to have described every little piece of bash syntax needed, so that you will be able to read and understand the above script. You will thus also be in a good position to start writing your own scripts to automate tasks and analyses on your computer.

We will start with an overview of the structure of a script and then delve into specific areas of syntax. For each area of syntax, we will provide some examples, and then leave some openings for you, the reader, to try your own hand at implementing each pattern that you see.

5.2 The Structure of a Bash Script

A bash script is merely a text file that is a collection of different command lines, one after the next, which the bash shell will run in sequence—one after the other. If you want to run the script, you must make sure you have set its permissions to include execute permissions (i.e., chmod ug+x

It is important to point out however, that the bash programming syntax that we will be describing in this chapter is not solely useful in the context of scripts that are stored in files. Rather, all the programming syntax can still be used directly on the shell command line itself! This means you can employ all the little tricks you will learn in this chapter while directly “hacking away” at the command line. In this context, it is worth noting that if you want to write multiple distinct commands on a single line, as if they were on separate lines, you can separate them with a semicolon, ;. For example:

echo "Put this in a file! (and catenate it later...twice!)" > tmp.txt
cat tmp.txt
cat tmp.txt
## Put this in a file! (and catenate it later...twice!)
## Put this in a file! (and catenate it later...twice!)

Is equivalent to:

echo "Put this in a file! (and catenate it later...twice!)" > tmp.txt; cat tmp.txt; cat tmp.txt
## Put this in a file! (and catenate it later...twice!)
## Put this in a file! (and catenate it later...twice!)

Sometimes when you are writing a script, (or even working on the command line) you might want a very long expression to be treated as being all part of the same command line, even though you would like to break it up over multiple lines. A backslash (\) immediately followed by a line ending (i.e., the “return” key) has the effect of treating the lines that it separates as all being on the same line.

# this:  \  \
    illumina-video-questions- \
    erics-edits  \

# is the same as this:  illumina-video-questions- erics-edits  /tmp/illumina-questions

Using backslashes (and indenting) in this way can sometimes dramatically improve readability of your scripts.

You might also notice a number of lines or statements in the script above that start with a #. The # is known as a comment character. When the bash interpreter is reading the script, it ignores the comment character and anything following it until the end of the line. This makes it quite convenient to pepper your scripts with notes to yourselves or others that can be extremely helpful when you come back to a piece of code and are trying to remember what it does! Example:

# this might report an error if there is no file
# named SillyRidiculous.what
ls -l SillyRidiculous.what
## ls: SillyRidiculous.what: No such file or directory

(Note: each line of the the output shown above follows two # symbols. This is not commenting. It is just the convention that the ‘bookdown’ package uses to signify that what it is showing you is output, rather than input. No relation, really, to commenting…

In the script, listed above, on the very top line (line 1) you will see a special statement that follows a # comment character. This is one of the few cases you will see when the contents after a comment character are not ignored (the other place you will see this is when preparing additional statements for schedulers for high-performance computing systems!). In this case the combination #! on the first line is telling Unix to get ready to learn how to interpret the contents of the script:


The part after the #! is just the path to the program bash that implements the bash shell. It is telling Unix how to run the script. While it is common practice to put this top line on a script, on most systems, if the line is absent, Unix will interpret the script using bash, anyway.

If you ever find yourself having a hard time remembering the order of those first two symbols (i.e., is it #! or !#?), just remember that it is sometimes called the shebang. You know the first part should be a comment character so the line is treated a little differently, but then that has to be followed by the “bang” which is !.

5.2.1 A bit more on ; and &

Recall the semicolon: it provides a way to combine multiple commands on a single line. In fact, when you think of putting multiple commands on different lines, you could think of each one being followed by a semicolon. Like this:

# this
echo one
echo two

# is the same as
echo one;
echo two;

# is the same as
echo one; echo two;

However, there is another character, beyond the semicolon, with which you might follow commands. It is the single & symbol. Thus, you could use it like a semicolon, writing and executing this:

echo one & echo two &

But, if you do that on your Unix system, you should see some numbers and a report about jobs, like this:

[1] 1007
[2] 1008
[1]-  Done                    echo one
[2]+  Done                    echo two

Whoa! What the heck is going on there? The & symbol means, “run the command that came before it, but don’t bother waiting for it to be done before running the next one.” As a consequence of this, when you run a command followed by a &, the computer returns to you the job ID numbers for the jobs that have been started, and when they are done, it also tells you that the jobs that have completed.

If you want to run multiple jobs on your own computer, the & syntax can be helpful. But, most of the time as a bioinformatician, you will be vying with countless others to run jobs on a large server or cluster. In those cases there are more refined systems for allocating jobs, and, as a consequence, you may rarely use the single & syntax when working on a high performace computing system.

5.3 Variables

When your goal is to script up repetitive tasks, one of the main ingredients to your success is bash’s ability to assign values to variables, and then later, retrieve those values and replace a variable in your script with its value, a process known as variable substitution. Both variable assignment and variable subtitution happen all over the script, in lines like:


We are going to break this down because it is of central importance.

5.3.1 Assigning values to variables

The syntax to assign a value to a variable is, “put the variable on the left, follow it with an equals sign with absolutely no spaces around it, and the put the value on the right,” like this:

HUNGRY="My cat, Oliver"
SWEET="A chocolately treat"
lowercase_variables_work_too="oh yeah."

Identify the variables and the values in the above.

The names of variables must start with either an upper- or lowercase letter or an underscore. After that initial character, valid variable names can then include any combination of underscores, upper- and lowercase letters and numerals.

In the following, identify the variable names that are valid and those that are not on each line. After you have made your choices, paste all these lines into your terminal to see which ones work and which do not.

1_tough_cookie="hard to eat with false teeth"
_bring_it_on="A fine musical"

Ha! That is a pretty easy task because syntax highlighter in ‘bookdown’ colors variables differently than other parts of a script. Oh well, you still get the point!

The value of the variable, on the other hand can be pretty much any string (as long as it doesn’t confuse the shell with characters like &, ;, or !). The shell typically understands that strings are delimited by whitespace, so, if your string (value) should include multiple words separated by spaces, you must enclose them in quotation marks:

NAME="Eric C. Anderson"

You can use either pairs of single quotes ('this is single-quoted') or double quotes ("this is double-quoted"), but the shell treats these very differently, as we will see. Most of the time, in bioinformatics, you will want to be using double quotes.

If you want to include, in a variable’s value, characters that have special meaning to the shell, like *, [, {, &, and ;, among others, you must enclose the string inside quotation marks to assign it to your variable:

FOO="bar&grill"  # works
FOO=bar&grill    # won't work the way you want it to

5.3.2 Accessing values from variables

This is called “variable substitution.” Remember, when you want something from someone, (or even just from a variable) it might cost you some money. Which is how you can remember that you need to use the $ to access values from variables. The $ tells the shell that you want the value of the variable, in a process called variable substitution. It is called that, because if you write $VAR, somewhere in a command line, then the shell will happily go along and substitute the value of the variable $VAR in place of the variable itself, after which the shell will evaluate the command line.

If you have a variable called VAR, the writing $VAR subtitutes its value on the command line. The same occurs if you write ${VAR}. The latter is a little more formal, but is also, in some sense a little more flexible, because it lets you append text immediately after the variable:


# this works:
echo "I like the word ${FIRST_PART}moron"

# this doesn't
echo "I like the word $FIRST_PARTmoron"

# but this would
echo "Remove the stain with $FIRST_PART-clean"
## I like the word oxymoron
## I like the word 
## Remove the stain with oxy-clean

In the case above that does not work, the shell is subtituting the value of the variable FIRST_PARTmoron which actually does not exist, so it substitutes nothing (and doesn’t even give you an error). This type of mistake typically occurs when you forget the the underscore is part of a valid variable name, like:


cd $GENUS_$SPECIES  # Fail!

cd ${GENUS}_${SPECIES}  # Works!

Now it is your turn: save your three favorite foods into the variables ONE, TWO, and THREE, and then use the echo command to print My three favorite foods are... where you include those foods via variable substitution:

Are we starting to feel the power of Unix scripting yet?

5.3.3 What does the shell do with the value substituted for a variable?

This is a great question, and it gets at the heart of why Unix is so powerful for scripting. Recall that variable substitution occurs before command evaluation. So, basically, after variable substitution, the shell has a command line that includes the values instead of the variables. Then it just evaluates that command line. So, what happens to the values that have been substituted for the variables depends on what context they appear in, in the command line!

Follow along with this example:

# Start by assigning to DIR the absolute path of a directory you often go to
# Note, you should use a path from your own computer!

# Now, note that we can do different things with that variable
# depending on how/where we put it in a command line.

# print it
echo $DIR

# list its contents
ls $DIR

# from your home directory (reached with `cd` and nothing after it)
# you can go directly to DIR like so:
cd $DIR

# If you just put $DIR on the command line by itself, the shell
# interprets it as a command, and tries to execute it, which give an error:

# On the other hand, if you wanted to make a new variable:

# then, that would work as a command on its own:

Oh! Now I am getting excited. Notice that in the script I did this quite a bit, making a few different command lines…

    # store the commands themselves into variables. Note the 
    # use of double quotes.
    CLONE_IT="git clone ${REPO/$} $DEST" 
    BRANCH_IT="git checkout -B $BRANCH"
    PUSH_IT="git push -u origin $BRANCH"

…that I would then call later:

    $CLONE_IT 2>> $ERR  && \
        cd $DEST && \
        $BRANCH_IT 2>> $ERR  && \
        $PUSH_IT 2>> $ERR  && \
        cd $RUNDIR

While you can count on subtituted variables to be evaluated in context, the above-displayed behaviors of the shell don’t always work. Because of the intricacies of how the bash shell command parser works, and the order of parsing and variable substition, there are times when the string which is a substituted variable will not be evaluated exactly as that same string would be evaluated if it were just text in a script. Typically the differences in behavior are found when = signs or redirects are found in the string.

An example should make this clear:

# here is a command that is a simple string
echo me oh my, i like pie.

# if we run it as a command we see:
me oh my, i like pie.

# here is that same command stored in a variable
boing="echo me oh my, i like pie."

# Variable substitution for boing on the command line looks like:

# and the result we see is:
me oh my, i like pie.

# However, if we type this:
echo me oh my, i like pie. > /tmp/out

# then the the file /tmp/out will hold:
me oh my, i like pie.

# But if we say
bonk="echo me oh my, i like pie. > /tmp/out"

# And do:

# We get:
me oh my, i like pie. > /tmp/out

# which is clearly not the same thing

In the above, example, when > occurred within a variable, the command parser did not recognize it as the redirection operator.

Here is another example: imagine that we want to assign three values to three different variables. If we just assigned them on the command line we could do something like this:

# simple variable assignment
A=one; B=two; C=three

# after that, if we do:
echo $A $B $C

# we get
one two three

But, what if we wanted to assign three values to three variables within a string that gets assigned to a variable, and then substitute that variable on the command line to actually assign those variables. For example:

ASSIGNMENTS="D=four; E=five; F=six"

# then try to substitute ASSIGNMENTS to actually make
# the variable assignments:

# bash replies with an error message
-bash: D=four;: command not found

It is clear that bash is not able to interpret the = sign as the assignment operator in this context.

We can solve this problem by using the eval keyword to tell bash to explicitly evaluate the command line a second time after variable substitution has occurred. Observe. In the latter case:

# create variable that is a line with assignments
ASSIGNMENTS="D=four; E=five; F=six"

# evaluate that line after variable substitution:

# see that the variables D, E, and F have values assigned:
echo $D, $E, $F

# cool!

And in the former:

# assign complete command line with redirection to bonk
bonk="echo me oh my, i like pie. > /tmp/out"

# evalute command line after $bonk's value is substituted
eval $bonk

# now /tmp/out holds the contents:
me oh my, i like pie.
The eval keyword is not often used in bash, but it is very useful in bioinformatics for evaluating command lines, that include their redirection specifiers, and which have been stored in a variable. Why is this important? If your script stores a command line (after variable subtstitution) in a variable, then you can print that command line before evaluating it. When developing your scripts this makes it easy to test your command lines (just echo it, copy it, and paste it into a shell you have open for testing). When running scripts, if you print out each line to a log, it is easier to go back to ones that failed and figure out why.

Your Turn: Make a variable called YELL_IT that holds the command line that would print the line “Oh, I just gotta be me” and redirect it into the file yawp.txt. Once you have done that, do variable substitution without, and then with, the eval keyword. Which one actually gets the job done?

Remember that in RStudio, if you highlight some text and do CMD-Option-Return on a Mac, the text gets sent to the RStudio Unix Terminal. (I suspect the PC equivalent is cntrl-alt-return). That can be helpful for quickly testing lines you have written.

5.3.4 Double and Single Quotation Marks and Variable Substitution

Quite often you will want to save a value to a variable that, itself, includes other variables. In other words, you want to do variable subtition on the value you are assigning to the variable. Double quote let you do this: variable substition will proceed within double quotes:

BAR="I like $FOO"
echo $BAR
## I like sandwiches

However, inside single quotes, variable subtitution will not occur:

BAR='I like $FOO'
echo $BAR
## I like $FOO

Each behavior has its uses, but most of the time, as I said, you will be wanting to use double quotes.

Now it is your turn: Save three pairs of foods that you like (for example, “cookies & cream” or “bananas & walnuts”) into three variables, PAIR_ONE, PAIR_TWO, and PAIR_THREE. Then combine those into a variable called SENTENCE and print it with echo, so that the result looks something like: “I like cookies & cream, and bananas & walnuts, and rooibos and milk.”

5.3.5 One useful, fancy, variable-substitution method

Bash is full of fancy variable-substitution embellishments. One that I use all the time replaces strings in a variable with other strings. Check this out:

echo $FILE
echo ${FILE/jpg/png}
echo ${FILE/my/your}
## my_picture.jpg
## my_picture.png
## your_picture.jpg

That turns out to be some powerful stuff.

5.3.6 Integer Arithmetic with Shell Variables

Another occasionally useful feature of the bash shell is being able to do integer arithmetic with shell variables that are, properly, integers. Basically, if you have variables that are integers such as 1, -43, 15, 1400, as opposed to strings (like Boing, foo, bar, etc.) or non-integer numbers (like 3.14, -0.002, -7.3, etc.), then you can do integer calculations on them including addition, subtraction, multiplication, integer division and modulo division. Doing so involves expanding the variables within double parentheses: $(( )). Within those double parentheses, you do not need to include a $ before the variable name. You can also just use literal numbers (such as 56) in those double parentheses.

To show examples, start with assigning integer values to three shell variables: A=2; B=5; C=17. Then, see:

  • Addition and Subtraction use + and -:

    echo $((A + B))
    echo $((C - B))
  • Multiplication uses *:

    echo $((3 * C))
    echo $((A * B * C))
  • Division uses /. Note that it only gives the integer part of the result. This is called “integer division.”

    echo $((C / -3))
    echo $((B / A))
  • Modulo division, using %, returns the remainder when two integers are divided by one another:

    echo $((C % 2))
    echo $((15 % B))
  • Rules of precedence are followed, but order of operations can also be expressed using parentheses inside the double parentheses:

    echo $((2 + 3 * 5))
    # same thing grouping by parentheses:
    echo $((2 + (3 * 5) ))
    # group order differently by parentheses:
    echo $(( (2 + 3) * 5 ))

You can’t do really complex math with these features of the bash shell, but sometimes when you just need to manipulate integers, this can be handy.

5.3.7 Variable arrays

Most programming languages allow you to store things in arrays. In R they are called vectors (or, more generally, lists). The same is true in bash: you can store a number of values into a single variable, and then access each individual value one at at time. This can be quite useful sometimes.

The syntax for assigning values to array variable is like this:

ArrayVariable=(words or things 'or stuff' separated 'by whitespace')

In other words, wrap the items inside a pair of parentheses, with white-space separating them, and the shell will break those up into different, numbered parts of the array variable. It is important to note that single-quoted groups of words are treated as single values that will be assigned to an array element as a group. (Though the same is not true of double-quoted values…)

Once values are stored in an array, how do we access them, i.e., what sort of variable subtitution can we do?

Well, if we just do the traditional variable substitution, like ${ArrayVariable} or $ArrayVariable, then we just get the first element:

# doing this
echo $ArrayVariable

# produces this:

That is not super helpful. So, to get each separate element we can subscript the array by adding a number in square brackets inside the curly braces, like this:

# Doing this:
echo ${ArrayVariable[2]}

# gives:

The subscripting is done with 0 as the starting value, so ${ArrayVariable[0]} will give words, and ${ArrayVariable[1]} gives or. Here is a for loop (see below) that prints each of the different subscripts followed by their associated values:

# this loop...
for i in {0..5}; do
  echo $i: ${ArrayVariable[$i]}

# produces output like:
0: words
1: or
2: things
3: or stuff
4: separated
5: by whitespace

There are a few more important ways of accessing array variable values:

You can get all the values at once as a single string. There are two ways you can do this: either ${ArrayVariable[*]} or ${ArrayVariable[@]}. These two methods differ only when you have wrapped the variable up in double quotation marks, but that involves some serious bash arcana.

# when just printing output, both of these give the same result
echo ${ArrayVariable[*]}
echo ${ArrayVariable[@]}

# producing:
words or things or stuff separated by whitespace
words or things or stuff separated by whitespace

And finally, the length of the array (meaning, how many elements are in it) is found with a peculiar syntax ${#ArrayVariable[@]}

# this:
echo ${#ArrayVariable[@]}

# produces:

Although bash arrays can be quite useful, I must admit that I only use them in a few bioinformatic situtations—primarily when I want to break a single row of white-space delimited columns (that I grabbed from a file, for example) into its consituent words that I can then manipulate.

Your turn: Create a variable called MY_ARRAY in which each element is the first name of 5 different people in our bioinformatics class. Then print the 1st, 2nd and 5th of them each on a separate line.

5.4 Evaluate a command and substitute the result on the command line

Sometimes what you want to put on a command line isn’t just a variable you have previously defined but rather the result of a command that is executed. We see this in around line 75:

echo "STARTING at $(date)"

In general, if you put a command inside a $(), like $(command) it means take the output of the command and insert it into the command line.

You can even assign the result to a variable, like RESULT=$(command).

For example, try this:

# what do you think this is doing?
HOME_LIST=$(ls -l ~)

# what about this?

Whoa! What happened to all my carriage returns? The results gets put onto the command line and parsed there. The command line parser sometimes treats line endings as just more whitespace, and it converts all runs of whitespace to a single space…

Your Turn: Make a three line script. In the first one, assign the output of the date command to the variable BEFORE, then give this command sleep 5, then in the third line, assign the output of the date command to the variable AFTER. Evaluate all three lines at once (by CMD-OPTION-Return or copying and pasting into the terminal.) Then, look at BEFORE and AFTER to compare.

5.5 Grouping/Collecting output from multiple commands: (commands) and { commands; }

Quite often you may wish to think of the result of a group of commands (taken together such a group is called “list” in bash parlance) as being the result of a single command. Typical use cases are when you want to redirect the output from three separate commands into a single file. For example if you want to put the contents of two files, FileA and FileB, into a file called “Both,” but separate the contents of FileA and FileB by a short line of x’s, you could do this:

cat FileA > Both
echo xxxxxxxxxxxx >> Both
cat FileB >> Both

But, it is easier to see what is going on and to maintain code that looks like this:

(cat FileA; echo xxxxxxxxxxxx; cat FileB) > Both

or like:

{ cat FileA; echo xxxxxxxxxxxx; cat FileB;} > Both

Both parentheses and curly braces can be used to group commands into one grouped command, thereby making it easy to redirect the output from that command. These two forms of grouping have subtle differences.

When you group commands into parentheses, all the commands get evaluated in a separate subshell. By contrast, grouped commands inside curly braces are all evaluated within the current shell. In many cases this will make almost no difference to you. However, if you are assigning variable values within the grouped commands, then, using parentheses for grouping, you won’t have access to those variable values in the current shell:

# include a variable assignment in parentheses
(cat FileA; echo xxxxx; cat FileB; NewVar=15) > Both

# if you try this:

# the shell knows nothing about it

Using curly braces, the variable assignments will be known in the current shell you are in:

# include a variable assignment in curly braces
{ cat FileA; echo xxxxx; cat FileB; NewVar=15; } > Both

# Now, if you try this:

# the shell knows that $NewVar has a value:

The syntax for using curly braces for grouping, however, is more finicky than it is with parentheses: the left curly brace cannot be touching anything on the right, and the last command in the group must be followed by a semicolon or a newline. Thus, both of these would fail:

{cat FileA; echo xxxxxxxxxxxx; cat FileB;} > Both
{ cat FileA; echo xxxxxxxxxxxx; cat FileB} > Both

5.6 Exit Status

When you run a command in Unix, that command might do a lot of different things, like print something to stdout, or copy a file around, or delete a file, or index a whole genome. Regardless of all the things a command might do, it also should let the operating system know whether it was successful or not. The exit status of a Unix command records whether it finished its task successfully or not. It is important to understand how exit statuses work so that you can design bioinformatic pipelines that will stop when something has gone wrong, and will let you know about that.

If a command exits normally (SUCCESS!) the exit status is 0. If the command does not exit normally (it may have been unsuccessful) then its exit status will be anything but 0. Some programs, when they fail, return a non-zero integer as their exit status, and the value can tell you what kind of error occurred. Other programs might just return 1.

Exit statuses are not typically seen by the user, but they do get passed to the shell. You can always access the exit status of the last command with $?. You can remember that because, with the question mark, it is kind of like you are asking the operating system, “What’s up!!??”.

Here is an example: if we try to list a file, using ls, that exists, the file gets listed and the exit status is 0 (SUCCESS!). If the file does not exist we get an error message and an exit status of 1 (NO_SUCCESS!).

# list a file we know exists
% ls -d ~/Documents 

# check exit status of last command
% echo $?

# list a file we are pretty sure does not exist
% ls I-doubt-this-file-exists.yeah.sure
ls: I-doubt-this-file-exists.yeah.sure: No such file or directory

# check the exit status of last command
% echo $?

While we don’t get to “see” the exit status of a command without looking at $?, to the bash shell, when a command has completed on the command line, it effectively becomes the value of its exit status. This is important to understand when dealing with combinations of exit statuses.

5.6.1 Combinations of exit statuses

In bioinformatics, exit statuses can be helpful in a script to “decide” whether to continue processing the next command, based on whether the previous one failed or not. bash has an elegant way of implementing this in terms of the binary operators && and || that combine exit statuses. The && is a logical-AND combinining operator. If ES1 and ES2 are two exit statuses, then ES1 && ES2 is “SUCCESS!” only if both ES1 and ES2 have exit statuses of “SUCCESS!”. Here is a quick table:

ES1 ES2 ES1 && ES2

If you study this table for a moment, you will see that in the first two cases, the value of the combination, ES1 && ES2 is apparent, just from knowing ES1. It doesn’t matter whether ES2 has an exit status of SUCCESS! or NO_SUCCESS; either way, because ES1 has failed, we know that ES1 && ES2 will be NO_SUCCESS.

This knowledge, combined with an understanding that the bash shell is typically very busy, and so is not going to do any extra work that it does not need to do will help you to understand the behavior of the shell when two (or more) commands are joined into a “compound command” with the && symbol. Consider this:

command1 && command2

When bash looks at that, it sees that whoever wrote it wants to know the &&-combination of exit statuses of command1 and command2. bash keeps that thought in the back of its mind, and then starts working through the commands from left to right. If command1 fails, the shell says, “Hey! At this point, I know the exit status of command1 && command2, so I am not even going to evaluate command2! On the other hand, if the exit status of command1 is”SUCCESS!“, then the shell will proceed to evaluating command2, because it knows that if the exit status of command1 was”SUCCESS!“, then it must evaluate command2 so as to get its exit status to properly evaluate command1 && command2.

The upshot of this is a construction like the following:

command1 && \
  command2 && \
  command3 && \

can be very useful in bioinformatics. This construction says, evaluate each command, if all the preceding commands were successful. If any of the commands fails, then none of the commands after them are evaluated. This is helpful if future steps depend on the successful completion of previous steps. In our example script, at the beginning of this chapter, we see the && used on lines 103–106. This is saying, “if we didn’t successfully clone the repository, then don’t try to make a new branch in it, and if we didn’t successfully make a new branch in it, then don’t try to push that branch back to GitHub.

The opposite of && is the || which combines exit statuses in an OR fashion with a table like the following:

ES1 ES2 ES1 || ES2

In this case, if ES1 is SUCCESS! then we know that ES1 || ES2 will have a combined exit status of “SUCCESS!” Consequently, we use the || to force evaluation of another command in case the previous one failed. Like this:

ls I-doubt-this-file-exists.yeah.sure || \
  echo "Aw shucks! That file aint there" > /dev/stderr

Note, in the above, we redirect the text, “Aw shucks! That file aint there” to a file named /dev/stderr. That file, /dev/stderr is a special file: anything that you send into it gets immediately printed on stderr.

We end by noting that the numbers assigned to SUCCESS! and NO_SUCCESS do not accord with the 0’s and 1’s used in a standard “truth-table” context. We just have to deal with that. I find it much easier to think in terms of exit statuses of SUCCESS! and NO_SUCCESS, than in terms of the 0’s and 1’s, respectively, by which those statuses are represented in the computer.

5.7 Loops and repetition

By this point, you have probably heard, or been told, many times over, that Unix shell scripting is particularly good for taking care of repetitive tasks. But, by this point in this book, it might not yet be clear how that is the case. Wait no more! This section will reveal a wonderful construct called the for loop that lets you do a task repeatedly, each time setting the value of a variable to something different that you want to be applying some commands to.

The basic syntax of the for loop is:

# here the cycled variable is "i"
for i in some things separated by whitespace; do
  commands involving $i

For example:

for i in oranges bananas apples; do
  echo "I like $i"

# produces this:
I like oranges
I like bananas
I like apples

When you write a for loop in a script, it is good practice to indent the command lines that will get evaluated multiple times. This is particularly useful if you have nested for loops (one inside another) such as:

for fruit in pears figs; do
  for who in Mark Alice; do
    echo "$who likes $fruit"

# which produces:
Mark likes pears
Alice likes pears
Mark likes figs
Alice likes figs

However, in bash (unlike Python, which is particularly obsessed with indentation) you are not required to indent things. In fact the above could have been written all on one line:

for fruit in pears figs; do for who in Mark Alice; do echo "$who likes $fruit"; done; done

It is also worth pointing out that, while many languages use curly braces to denote blocks of repeated code, bash uses the pair do...done, which I find to be quite cute.

As found with the variable GHNAMES in our example script,, you can also use variable subsitution to provide a list of terms to cycle over. Here is another small example:

ITEMS="cats dogs mice shrews"
for critters in $ITEMS; do
  echo $critters are vertebrates

You can also use globbing (path expansion) to provide a list of things (files, specifically) to cycle over. The following prints the path and the first two lines of all files in the directory table_inputs in this book’s repository, at this point in the book’s formation:

for i in table_inputs/*; do 
  echo "======= file: $i  ========="
  head -n 2 $i

# this produces:
======= file: table_inputs/minimal-tmux.txt  =========
Within tmux?   ;  Command         ;  Effect
N              ; `tmux ls`          ; List any tmux sessions the server knows about
======= file: table_inputs/sam-columns-table.txt  =========
Column   &  Field   & Data Type  & Description
1        & QNAME    & String     & Name/ID of the read (from FASTQ file)
======= file: table_inputs/sam-flag-table.txt  =========
bit-#    &   bit-gram   &  $2^x$   &   dec    &   hex     & Meaning
1    &  $\bitsa$    &  $2^0$   &    1     &   0x1    & the read is paired (i.e. comes from paired-endsequencing.)
======= file: table_inputs/tmux-pane-strokes.txt  =========
Within tmux?   ;  Command         ;  Effect
Y              ; `<cntrl>-b /`       ; Split current window/pane vertically into two panes

Those files are the ones used as input to a few of the tables in the book.

If it turns out that you want to cycle over some integers, in order (whether increasing or decreasing), from a starting value to a stopping value, you can use curly braces and two dots, like this: {1..5}, like this:

for i in {1..5}; do echo The number is: $i; done

# this makes:
Number 1
Number 2
Number 3
Number 4
Number 5

And you can have negative numbers and a reverse order, too:

for i in {4..-2}; do echo The number is: $i; done

# this makes
The number is: 4
The number is: 3
The number is: 2
The number is: 1
The number is: 0
The number is: -1
The number is: -2

One cool thing to realize is that any output that goes to stdout from within a for loop—if it is not redirected from within the loop—effectively comes “flowing out” to stdout from right after the done keyword. So, you can redirect it in one swell foop from that point in your code like this:

for i in table_inputs/*; do 
  echo "======= file: $i  ========="
  head -n 2 $i
done > file-to-redirect-it-all-into.txt 

…or, you could even pipe it to another command, like this, to print just the first column of text of each line:

for i in table_inputs/*; do 
  echo "======= file: $i  ========="
  head -n 2 $i
done | awk '{print $1}'

5.8 More Conditional Evaluation: if, then, else, and friends

We have already seen how exit statuses can be combined with && or || to control the flow of a script (i.e., if this failed, don’t do the next line…). There is also a traditional if/then/else construct in bash to control the execution of script on the basis of exit status. Thus, we can do something like:

if ls; then
  echo "We found the README"
  echo "Can't find the README"

Note that this opens an if block of code with if and then it closes it (after the then and the else) with a backward if: fi.

The general syntax is:

if exit_status; then
  Do this if exit_status = SUCCESS!
  Do this if exit_status = NO_SUCCESS

There is also an else-if construct, that is named elif, with syntax like this:

if exit_status1; then
  Do this if exit_status1 = SUCCESS!
elif exit_status2; then
  Do this if exit_status1 = NO_SUCCESS and exit_status2 = SUCCESS!
  Do this if both were NO_SUCCESS

Many of the times when you want to use an if construct, you will want to be testing things about files, or strings or variables, rather than assessing exit status of functions. Alas, all bash is capable of is assessing exit statuses. But hark! All is not lost becuase there is a function called test that let’s you test if statements are true, and if they are, then it returns an exit status of 0 (SUCCESS!).

For example, to test if a file exists, you can use:

# this file does exist in the current directory on my system...
test -f eca-bioinf-handbook.Rproj

# do this to see what the exit status was
echo $?

Or, if we want to test if the value of a variable is the same as some string.

VAR=big_and_bad # set a variable to some string value

test $VAR = small_and_sweet

# get exit status, which will be NO_SUCCESS (1)
echo $?

The value of integer variables can be tested too. See man test for all the details.

Now, the thing to remember about the test function is that there is an intuitive-looking shorthand for writing it: that is to write its arguments between a [ and a ], but not touching either of them.


# this
test $VAR = small_and_sweet

# is equivalent to:
[ $VAR = "small_and_sweet" ];

Note that the RStudio bash shell seems to be doing something weird, I don’t think the “last command” is quite what we think it should be, so $? does not seem to be reliable.

5.9 Finally…positional parameters

Way back when we started this chapter, in the example script,, right at the top we see lines like:


What the heck is that $1. That is not proper syntax for a variable name! A variable name can’t start with a number, after all. Aha! $1, $2, $3, and so forth are variables that store the positional parameters of a script. In other words if you write a script called and you invoke it with some words after it, like:  BigFile.txt  small_file.txt  Yee-ha   'Oh Yeah'

Then, when that script is executing the code inside it, the value of $1 will be BigFile.txt, the value of $2 will be small-file.txt, the value of $3 will be Yee-ha, and the value of $4 will be Oh Yeah.

The variable $# inside the script holds the number of positional parameters, and, in our example script [ $# -ne 4 ] evaluates to NO_SUCCESS if $# is not equal to 4, in which case the script prints a message about how to use it.

Your Turn: Write a script in a separate file called “” that is expecting to take three positional parameters. In side the script, have it print the actual values passed to it in reverse order for example:

Third parameters is: -----
Second parameter is: -----
First parameters is: -----

Where ----- would be replaced by the actual value of the positional parameters.

5.10 basename and dirname two useful little utilities

Paths to files and directories expand and are listed as the paths relative to the current working directory. Sometimes, you just want the name of the file. This is what basename gives you. For example:

# expand the filenames a couple directories down:

# print all those file names
echo $files

# that produces this output:
figure-creation/1.01-unix/ figure-creation/1.01-unix/file-hierarchy.pdf figure-creation/1.01-unix/file-hierarchy.png figure-creation/1.01-unix/

# do this to get just the filenames:
basename $files

# which produces this output:

# if you want just the relative path to the directories those files
# are in you can use dirname, but have to operate on one file at a time:
for i in $files; do
  dirname $i

# makes this output:

5.11 bash functions

In our example script,, we define a bash function called usage that prints a helpful message showing the syntax and an example invocation of the script. This provides an example of how you can write functions in bash. In all honesty, I only rarely use functions in bash, but it is good to know about nonetheless.

In bash, a function is just a collection of commands, grouped using curly braces, that will be evaluated when the function’s name is issued on the command line or within a script. Not only will those lines be evaluated, but you can also pass positional parameters to the function by following its name with other words/tokens/values. The positional parameters within a function are distinct from the positional parameters within, say, the main script.

Functions can be defined, most transparently, by using the function keyword. As an example, here is a silly function, called Silly that takes two positional parameters and then merely prints them separated by monkey noises.

function Silly {
  echo "$1  Ooooh-oooh  Aaaah-aaah  $2"

# now, try
Silly foo bar

Another syntax you might see is to not use the function keyword, but rather follow the function’s name with ():

Silly2() {
  echo "$1  Ooooh-oooh  Aaaah-aaah  $2"

# now, try
Silly2 boing bonk

You can even put it all on one line, but you have to make sure that you respect the curly-braces sensitivity to spacing and explicit line ending semicolons:

function Silly3 { echo "$1  Ooooh-oooh  Aaaah-aaah  $2";}

# now, try
Silly3 bing bap

5.12 reading files line by line

This is handy. Note the line can be broken into a shell array:

# this is an example of reading a file in which each row is delimited
# by whitespace, the second column is a file name and the the
# third column is a number
cat a_file | while read -r line; do 

5.13 Further reading

An excellent chapter on the development of Unix (Raymond 2003)

A nice set of bash scripting tutorials can be found at


Raymond, Eric S. 2003. Art of UNIX Programming, The. 1st ed. Addison-Wesley Professional. Part of the Addison-Wesley Professional Computing Series series.