Chapter 5 Shell programming
In our first foray into Unix and the shell, we restricted ourselves mostly to navigating the file system, handling files, and working with streams of data (via redirection and pipes). These are all crucial skills, but the bash shell becomes truly powerful when we start to adopt it as a sort of programming language. That’s right, even though the functionality of bash is geared toward running jobs and calling commands, it still exhibits most of the features expected in a programming language, like variables, iteration and flow control.
5.1 An example script
We start this chapter by taking a look at a short bash program (typically called a script) that the author wrote in order to efficiently download (clone, really) repositories from GitHub that have been submitted by students to GitHub Classroom. The program is not long, but exhibits many useful features of bash as a programming language. If you are reading this, not in the context of a class with lectures, just go ahead and read through it and see if you can figure out what is going on in each line of the script. Afterward, we will address many features of bash by referencing different parts of the script.
The script, which happens to be stored in a file called clone-classroom-repos.sh
is printed below, with linenumbers, since we will
be referring back to specific sections of the script later.
#!/bin/bash
# define a function to print the usage or "help" for the script
function usage {
echo Syntax:
echo " $(basename $0) GH_Prefix Repo_Prefix Branch Dir
GH_Prefix: the URL of the GitHub site where the repository exists.
Repo_Prefix: the prefix of the name of each repository to be cloned.
Branch: the name of the branch to create and switch to in the repository,
once the repo has been cloned.
Dir: path to the directory (will be created if necessary) to clone all
the repositories to.
Example:
$(basename $0) https://github.com/CSU-con-gen-bioinformatics-2020 illumina-video-questions- erics-edits /tmp/illumina-questions
"
echo
}
# test for right number of required args. If not, print usage message
if [ $# -ne 4 ]; then
usage;
exit 1;
fi
# copy positional parameters into other variables
GHP=$1
RP=$2
BRANCH=$3
DD=$4
# assign string with student GitHub handles into a variable
GHNAMES="AmandaCicchino
BrennaF
CaitlinWells
EllenMCampbell
FayDong
LibbyGH
NathanPhipps
RGCheek
Ronan17
abeulke
carolazari
cbossu
ccolumbu
elenacorrea
eriqande
jenleon07
kimhoke
kruegg
lauracgoetz
mdrod110
mgdesaix
raven-wings
seamus100
taylorbobowski
wcfunk"
# assign my GitHub username to the variable USER
USER=eriqande
# assign the current working directory to the variable RUNDIR
RUNDIR=$PWD
# make a new directory named whatever the user wanted for the output directory
mkdir -p $DD
# make variables to hold log and error file names
LOG=${PWD}/${RP}log
ERR=$LOG.stderr
# print the date/time when the process is starting
echo "STARTING at $(date)"
# make a clean slate. remove any files with the name
# of the error output file
rm -f $ERR
# cycle over the student GitHub names, and for each one *do*
# the commands that appear before the *done* keyword. Indenting
# is used to make it easier to read, but is not essential.
for L in $GHNAMES; do
echo "Working on $L, starting at $(date)" # print a progress line to stdout
REPO=$GHP/${RP}$L # combine variables into new variables that
echo $REPO # hold the URL for the repository to be
DEST=$DD/$L # cloned and the path where it should be cloned to
# store the commands themselves into variables. Note the
# use of double quotes.
CLONE_IT="git clone ${REPO/github.com/$USER@github.com} $DEST"
BRANCH_IT="git checkout -B $BRANCH"
PUSH_IT="git push -u origin $BRANCH"
# now, run those commands, chained together by exit-status-AND
# operators (so it will stop if any one part fails), while
# all the while appending error statements to the Error file. Run it
# all within an "if" statement so you can deliver a report as to
# whether the whole shebang succeeded or failed.
if $CLONE_IT 2>> $ERR && \
cd $DEST && \
$BRANCH_IT 2>> $ERR && \
$PUSH_IT 2>> $ERR && \
cd $RUNDIR # at the very end make sure to return to the original working directory
then
echo "FULL SUCCESS $L"
else
echo "FAILURE SOMEWHERE WITHIN $L"
cd $RUNDIR # get back to the working directory from which the original command was run.
# so we are ready to handle the next student repo.
fi
done # signifies the end of the for loop we are cycling over
If my current working directory is where the script resides, I can run it like this:
And if I wanted to be fancy, I could put the script in a directory (like ~/bin
perhaps)
that I have included in my PATH
variable. In which case I could run it like:
from anywhere on my computer.
When I run the script in any of those two ways, because I have not provided the proper number of arguments to the command, it returns a message telling me what syntax is required to use it (i.e., its usage syntax):
% clone-classroom-repos.sh
Syntax:
clone-classroom-repos.sh GH_Prefix Repo_Prefix Branch Dir
GH_Prefix: the URL of the GitHub site where the repository exists.
Repo_Prefix: the prefix of the name of each repository to be cloned.
Branch: the name of the branch to create and switch to in the repository,
once the repo has been cloned.
Dir: path to the directory (will be created if necessary) to clone all
the repositories to.
Example:
clone-classroom-repos.sh https://github.com/CSU-con-gen-bioinformatics-2020 illumina-video-questions- erics-edits /tmp/illumina-questions
That is handy, and the code to do it exists in the script itself. Looking at the output, how many arguments do you think the script is expecting?
Now, if I wanted to clone all of the student GitHub repos associated with the illumina-video-questions
homework set, and then, once cloned, set up a new git branch called eric-edits
so that I can make
edits and/or comments and send those to students via a pull request, here is the command I would give
(remembering, again that the %
signifies the command prompt, here):
% clone-classroom-repos.sh https://github.com/CSU-con-gen-bioinformatics-2020 illumina-video-questions- erics-edits /tmp/illumina-questions
And when I do, I see output like this:
STARTING at Thu Feb 13 06:02:05 MST 2020
Working on AmandaCicchino, starting at Thu Feb 13 06:02:05 MST 2020
https://github.com/CSU-con-gen-bioinformatics-2020/illumina-video-questions-AmandaCicchino
Branch erics-edits set up to track remote branch erics-edits from origin.
FULL SUCCESS AmandaCicchino
Working on BrennaF, starting at Thu Feb 13 06:02:07 MST 2020
https://github.com/CSU-con-gen-bioinformatics-2020/illumina-video-questions-BrennaF
Branch erics-edits set up to track remote branch erics-edits from origin.
FULL SUCCESS BrennaF
Working on CaitlinWells, starting at Thu Feb 13 06:02:08 MST 2020
https://github.com/CSU-con-gen-bioinformatics-2020/illumina-video-questions-CaitlinWells
FAILURE SOMEWHERE WITHIN CaitlinWells
Working on EllenMCampbell, starting at Thu Feb 13 06:02:09 MST 2020
https://github.com/CSU-con-gen-bioinformatics-2020/illumina-video-questions-EllenMCampbell
Branch erics-edits set up to track remote branch erics-edits from origin.
FULL SUCCESS EllenMCampbell
Working on FayDong, starting at Thu Feb 13 06:02:10 MST 2020
https://github.com/CSU-con-gen-bioinformatics-2020/illumina-video-questions-FayDong
Branch erics-edits set up to track remote branch erics-edits from origin.
FULL SUCCESS FayDong
Working on LibbyGH, starting at Thu Feb 13 06:02:12 MST 2020
https://github.com/CSU-con-gen-bioinformatics-2020/illumina-video-questions-LibbyGH
Branch erics-edits set up to track remote branch erics-edits from origin.
FULL SUCCESS LibbyGH
Working on NathanPhipps, starting at Thu Feb 13 06:02:14 MST 2020
https://github.com/CSU-con-gen-bioinformatics-2020/illumina-video-questions-NathanPhipps
Branch erics-edits set up to track remote branch erics-edits from origin.
FULL SUCCESS NathanPhipps
Working on RGCheek, starting at Thu Feb 13 06:02:15 MST 2020
https://github.com/CSU-con-gen-bioinformatics-2020/illumina-video-questions-RGCheek
Branch erics-edits set up to track remote branch erics-edits from origin.
FULL SUCCESS RGCheek
Working on Ronan17, starting at Thu Feb 13 06:02:17 MST 2020
https://github.com/CSU-con-gen-bioinformatics-2020/illumina-video-questions-Ronan17
Branch erics-edits set up to track remote branch erics-edits from origin.
FULL SUCCESS Ronan17
Working on abeulke, starting at Thu Feb 13 06:02:19 MST 2020
https://github.com/CSU-con-gen-bioinformatics-2020/illumina-video-questions-abeulke
Branch erics-edits set up to track remote branch erics-edits from origin.
FULL SUCCESS abeulke
Working on carolazari, starting at Thu Feb 13 06:02:21 MST 2020
https://github.com/CSU-con-gen-bioinformatics-2020/illumina-video-questions-carolazari
Branch erics-edits set up to track remote branch erics-edits from origin.
FULL SUCCESS carolazari
Working on cbossu, starting at Thu Feb 13 06:02:23 MST 2020
https://github.com/CSU-con-gen-bioinformatics-2020/illumina-video-questions-cbossu
Branch erics-edits set up to track remote branch erics-edits from origin.
FULL SUCCESS cbossu
Working on ccolumbu, starting at Thu Feb 13 06:02:25 MST 2020
https://github.com/CSU-con-gen-bioinformatics-2020/illumina-video-questions-ccolumbu
Branch erics-edits set up to track remote branch erics-edits from origin.
FULL SUCCESS ccolumbu
Working on elenacorrea, starting at Thu Feb 13 06:02:26 MST 2020
https://github.com/CSU-con-gen-bioinformatics-2020/illumina-video-questions-elenacorrea
Branch erics-edits set up to track remote branch erics-edits from origin.
FULL SUCCESS elenacorrea
Working on eriqande, starting at Thu Feb 13 06:02:28 MST 2020
https://github.com/CSU-con-gen-bioinformatics-2020/illumina-video-questions-eriqande
Branch erics-edits set up to track remote branch erics-edits from origin.
FULL SUCCESS eriqande
Working on jenleon07, starting at Thu Feb 13 06:02:29 MST 2020
https://github.com/CSU-con-gen-bioinformatics-2020/illumina-video-questions-jenleon07
FAILURE SOMEWHERE WITHIN jenleon07
Working on kimhoke, starting at Thu Feb 13 06:02:30 MST 2020
https://github.com/CSU-con-gen-bioinformatics-2020/illumina-video-questions-kimhoke
FAILURE SOMEWHERE WITHIN kimhoke
Working on kruegg, starting at Thu Feb 13 06:02:30 MST 2020
https://github.com/CSU-con-gen-bioinformatics-2020/illumina-video-questions-kruegg
FAILURE SOMEWHERE WITHIN kruegg
Working on lauracgoetz, starting at Thu Feb 13 06:02:30 MST 2020
https://github.com/CSU-con-gen-bioinformatics-2020/illumina-video-questions-lauracgoetz
FAILURE SOMEWHERE WITHIN lauracgoetz
Working on mdrod110, starting at Thu Feb 13 06:02:31 MST 2020
https://github.com/CSU-con-gen-bioinformatics-2020/illumina-video-questions-mdrod110
Branch erics-edits set up to track remote branch erics-edits from origin.
FULL SUCCESS mdrod110
Working on mgdesaix, starting at Thu Feb 13 06:02:32 MST 2020
https://github.com/CSU-con-gen-bioinformatics-2020/illumina-video-questions-mgdesaix
Branch erics-edits set up to track remote branch erics-edits from origin.
FULL SUCCESS mgdesaix
Working on raven-wings, starting at Thu Feb 13 06:02:34 MST 2020
https://github.com/CSU-con-gen-bioinformatics-2020/illumina-video-questions-raven-wings
FAILURE SOMEWHERE WITHIN raven-wings
Working on seamus100, starting at Thu Feb 13 06:02:34 MST 2020
https://github.com/CSU-con-gen-bioinformatics-2020/illumina-video-questions-seamus100
Branch erics-edits set up to track remote branch erics-edits from origin.
FULL SUCCESS seamus100
Working on taylorbobowski, starting at Thu Feb 13 06:02:36 MST 2020
https://github.com/CSU-con-gen-bioinformatics-2020/illumina-video-questions-taylorbobowski
Branch erics-edits set up to track remote branch erics-edits from origin.
FULL SUCCESS taylorbobowski
Working on wcfunk, starting at Thu Feb 13 06:02:37 MST 2020
https://github.com/CSU-con-gen-bioinformatics-2020/illumina-video-questions-wcfunk
Branch erics-edits set up to track remote branch erics-edits from origin.
FULL SUCCESS wcfunk
From that report, it is clear that it takes about 1 to 2 seconds to handle each repository. If I were doing each repository by hand, (i.e. cloning and branching through a GUI interface like RStudio’s) each repository would probably take me about 30 seconds to a minute, with a lot of copying and pasting and chance for errors, and I would have destroyed my wrists with all the repetitive tasks. So, this is a HUGE deal.
It is also easy to scan through the results and see, “Holy Moly! These are
some dedicated students!” Everyone has successfully submitted their homework repositories
(and thus we were SUCCESS
-ful in cloning them), except for a handful
who were traveling or otherwise occupied and had warned me they wouldn’t
be able to do the assignment.
When all is said and done, I have the following git repositories on my laptop which I can peruse at my leisure:
% ls /tmp/illumina-questions/
AmandaCicchino/ FayDong/ RGCheek/ carolazari/ elenacorrea/ mgdesaix/ wcfunk/
BrennaF/ LibbyGH/ Ronan17/ cbossu/ eriqande/ seamus100/
EllenMCampbell/ NathanPhipps/ abeulke/ ccolumbu/ mdrod110/ taylorbobowski/
Additionally, in the directory where I ran the command, I have a
file called illumina-video-questions-log.stderr
that gives me a more detailed
report of things when they worked or failed.
If you are new to Unix, then the above script likely appears a bit daunting. Our goal by the end of the chapter is to have described every little piece of bash syntax needed, so that you will be able to read and understand the above script. You will thus also be in a good position to start writing your own scripts to automate tasks and analyses on your computer.
We will start with an overview of the structure of a script and then delve into specific areas of syntax. For each area of syntax, we will provide some examples, and then leave some openings for you, the reader, to try your own hand at implementing each pattern that you see.
5.2 The Structure of a Bash Script
A bash script is merely a text file that is a collection of different command lines, one
after the next, which the bash shell will run in sequence—one after
the other. If you want to run the script, you must make sure you have set its
permissions to include execute permissions (i.e., chmod ug+x script.sh
).
It is important to point out however, that the bash programming syntax that
we will be describing in this chapter is not solely useful in the context
of scripts that are stored in files. Rather, all the programming syntax can still
be used directly on the shell command line itself! This means you can employ
all the little tricks you will learn in this chapter while directly “hacking away”
at the command line. In this context, it is worth noting that if you want to
write multiple distinct commands on a single line, as if they were on separate lines,
you can separate them with a semicolon, ;
. For example:
## Put this in a file! (and catenate it later...twice!)
## Put this in a file! (and catenate it later...twice!)
Is equivalent to:
## Put this in a file! (and catenate it later...twice!)
## Put this in a file! (and catenate it later...twice!)
Sometimes when you are writing a script, (or even working on the command line) you might want
a very long expression to be treated as being all part of the same command line, even though
you would like to break it up over multiple lines. A backslash (\
) immediately followed by
a line ending (i.e., the “return” key) has the effect of treating the lines that it separates
as all being on the same line.
# this:
clone-classroom-repos.sh \
https://github.com/CSU-con-gen-bioinformatics-2020 \
illumina-video-questions- \
erics-edits \
/tmp/illumina-questions
# is the same as this:
clone-classroom-repos.sh https://github.com/CSU-con-gen-bioinformatics-2020 illumina-video-questions- erics-edits /tmp/illumina-questions
Using backslashes (and indenting) in this way can sometimes dramatically improve readability of your scripts.
You might also notice a number of lines or statements in the script above that start
with a #
. The #
is known as a comment character. When the bash interpreter is
reading the script, it ignores the comment character and anything following it until
the end of the line. This makes it quite convenient to pepper your scripts with
notes to yourselves or others that can be extremely helpful when you come back
to a piece of code and are trying to remember what it does! Example:
# this might report an error if there is no file
# named SillyRidiculous.what
ls -l SillyRidiculous.what
## ls: SillyRidiculous.what: No such file or directory
(Note: each line of the the output shown above follows two #
symbols. This
is not commenting. It is just the convention that the ‘bookdown’ package uses to signify that
what it is showing you is output, rather than input. No relation, really, to commenting…
In the clone-classroom-repos.sh
script, listed above, on the very top line (line 1) you will see
a special statement that follows a #
comment character. This is one of the few cases you will
see when the contents after a comment character are not ignored (the other place you will see this
is when preparing additional statements for schedulers for high-performance computing systems!).
In this case the combination #!
on the first line is telling Unix to get ready to learn how
to interpret the contents of the script:
#!/bin/bash
The part after the #!
is just
the path to the program bash
that implements the bash shell. It is telling Unix
how to run the script. While it is common practice to put this top line on a script,
on most systems, if the line is absent, Unix will interpret the script using bash, anyway.
If you ever find yourself having a hard time remembering the order of those first two
symbols (i.e., is it #!
or !#
?), just remember that it is sometimes called
the shebang. You know the first part should be a comment character so the line is treated
a little differently, but then that has to be followed by the “bang” which is !
.
5.2.1 A bit more on ;
and &
Recall the semicolon: it provides a way to combine multiple commands on a single line. In fact, when you think of putting multiple commands on different lines, you could think of each one being followed by a semicolon. Like this:
However, there is another character, beyond the semicolon, with which you might
follow commands. It is the single &
symbol. Thus,
you could use it like a semicolon, writing and executing this:
But, if you do that on your Unix system, you should see some numbers and a report about jobs, like this:
[1] 1007
[2] 1008
one
two
[1]- Done echo one
[2]+ Done echo two
Whoa! What the heck is going on there? The &
symbol means, “run the command that
came before it, but don’t bother waiting for it to be done before running the next one.”
As a consequence of this, when you run a command followed by a &
, the computer returns to you
the job ID numbers for the jobs that have been started, and when they are done, it
also tells you that the jobs that have completed.
If you want to run multiple jobs on your own computer, the &
syntax can be helpful.
But, most of the time as a bioinformatician, you will be vying with countless others
to run jobs on a large server or cluster. In those cases there are more refined
systems for allocating jobs, and, as a consequence, you may rarely use the single &
syntax when working on a high performace computing system.
5.3 Variables
When your goal is to script up repetitive tasks, one of the main ingredients to your
success is bash’s ability to assign values to variables, and then later, retrieve
those values and replace a variable in your script with its value, a process
known as variable substitution. Both variable assignment and
variable subtitution happen all over the clone-classroom-repos.sh
script, in lines like:
We are going to break this down because it is of central importance.
5.3.1 Assigning values to variables
The syntax to assign a value to a variable is, “put the variable on the left, follow it with an equals sign with absolutely no spaces around it, and the put the value on the right,” like this:
VAR=value
HUNGRY="My cat, Oliver"
SWEET="A chocolately treat"
MY_FILE=/Users/eriq/Documents/git-repos/eca-bioinf-handbook/eca-bioinf-handbook.Rproj
lowercase_variables_work_too="oh yeah."
Or_even_MIXtures_of_cases=boing
Identify the variables and the values in the above.
The names of variables must start with either an upper- or lowercase letter or an underscore. After that initial character, valid variable names can then include any combination of underscores, upper- and lowercase letters and numerals.
In the following, identify the variable names that are valid and those that are not on each line. After you have made your choices, paste all these lines into your terminal to see which ones work and which do not.
vaRiAble=value
1_tough_cookie="hard to eat with false teeth"
_bring_it_on="A fine musical"
PLATE.FIFTEEN=/home/me/labwork/plate_15
Ha! That is a pretty easy task because syntax highlighter in ‘bookdown’ colors variables differently than other parts of a script. Oh well, you still get the point!
The value of the variable, on the other hand can be pretty much any string
(as long as it doesn’t confuse the shell with characters like &
, ;
, or !
).
The shell typically understands that strings are delimited by whitespace, so, if
your string (value) should include multiple words separated by spaces, you must
enclose them in quotation marks:
You can use either pairs of single quotes ('this is single-quoted'
) or double
quotes ("this is double-quoted"
), but the shell treats these very differently,
as we will see. Most of the time, in bioinformatics, you will want to be using
double quotes.
If you want to include, in a variable’s value, characters that have special meaning to the shell,
like *
, [
, {
, &
, and ;
, among others, you must enclose the string inside quotation
marks to assign it to your variable:
5.3.2 Accessing values from variables
This is called “variable substitution.” Remember, when you want something from someone, (or even
just from a variable) it might cost you some money. Which is how you can
remember that you need to use the $
to access values from variables. The $
tells the
shell that you want the value of the variable, in a process called variable substitution. It is
called that, because if you write $VAR
, somewhere in a command line, then
the shell will happily go along and substitute the value of the variable $VAR
in place of the variable
itself, after which the shell will evaluate the command line.
If you have a variable called VAR
, the writing $VAR
subtitutes its value on the command line.
The same occurs if you write ${VAR}
. The latter is a little more formal, but is also, in
some sense a little more flexible, because it lets you append text immediately after the variable:
FIRST_PART=oxy
# this works:
echo "I like the word ${FIRST_PART}moron"
# this doesn't
echo "I like the word $FIRST_PARTmoron"
# but this would
echo "Remove the stain with $FIRST_PART-clean"
## I like the word oxymoron
## I like the word
## Remove the stain with oxy-clean
In the case above that does not work, the shell is subtituting the value
of the variable FIRST_PARTmoron
which actually does not exist, so
it substitutes nothing (and doesn’t even give you an error). This type
of mistake typically occurs when you forget the the underscore is part of
a valid variable name, like:
Now it is your turn: save your three favorite foods into
the variables ONE, TWO, and THREE, and then use the echo
command
to print My three favorite foods are...
where you include
those foods via variable substitution:
Are we starting to feel the power of Unix scripting yet?
5.3.3 What does the shell do with the value substituted for a variable?
This is a great question, and it gets at the heart of why Unix is so powerful for scripting. Recall that variable substitution occurs before command evaluation. So, basically, after variable substitution, the shell has a command line that includes the values instead of the variables. Then it just evaluates that command line. So, what happens to the values that have been substituted for the variables depends on what context they appear in, in the command line!
Follow along with this example:
# Start by assigning to DIR the absolute path of a directory you often go to
# Note, you should use a path from your own computer!
DIR=/Users/eriq/Documents/git-repos/eca-bioinf-handbook
# Now, note that we can do different things with that variable
# depending on how/where we put it in a command line.
# print it
echo $DIR
# list its contents
ls $DIR
# from your home directory (reached with `cd` and nothing after it)
# you can go directly to DIR like so:
cd
cd $DIR
# If you just put $DIR on the command line by itself, the shell
# interprets it as a command, and tries to execute it, which give an error:
$DIR
# On the other hand, if you wanted to make a new variable:
GODIR="cd $DIR"
# then, that would work as a command on its own:
cd
$GODIR
Oh! Now I am getting excited. Notice that in the clone-classroom-repos.sh
script I
did this quite a bit, making a few different command lines…
# store the commands themselves into variables. Note the
# use of double quotes.
CLONE_IT="git clone ${REPO/github.com/$USER@github.com} $DEST"
BRANCH_IT="git checkout -B $BRANCH"
PUSH_IT="git push -u origin $BRANCH"
…that I would then call later:
While you can count on subtituted variables to be evaluated in context,
the above-displayed behaviors of the shell don’t always work. Because of the
intricacies of how the bash shell command parser works, and the order of parsing
and variable substition, there are times when the string which is a substituted
variable will not be evaluated exactly as that same string would be evaluated if
it were just text in a script. Typically the differences in behavior are found
when =
signs or redirects are found in the string.
An example should make this clear:
# here is a command that is a simple string
echo me oh my, i like pie.
# if we run it as a command we see:
me oh my, i like pie.
# here is that same command stored in a variable
boing="echo me oh my, i like pie."
# Variable substitution for boing on the command line looks like:
$boing
# and the result we see is:
me oh my, i like pie.
# However, if we type this:
echo me oh my, i like pie. > /tmp/out
# then the the file /tmp/out will hold:
me oh my, i like pie.
# But if we say
bonk="echo me oh my, i like pie. > /tmp/out"
# And do:
$bonk
# We get:
me oh my, i like pie. > /tmp/out
# which is clearly not the same thing
In the above, example, when >
occurred within a variable,
the command parser did not recognize it as the redirection
operator.
Here is another example: imagine that we want to assign three values to three different variables. If we just assigned them on the command line we could do something like this:
# simple variable assignment
A=one; B=two; C=three
# after that, if we do:
echo $A $B $C
# we get
one two three
But, what if we wanted to assign three values to three variables within a string that gets assigned to a variable, and then substitute that variable on the command line to actually assign those variables. For example:
ASSIGNMENTS="D=four; E=five; F=six"
# then try to substitute ASSIGNMENTS to actually make
# the variable assignments:
$ASSIGNMENTS
# bash replies with an error message
-bash: D=four;: command not found
It is clear that bash is not able to interpret the =
sign as the assignment
operator in this context.
We can solve this problem by using the eval
keyword to tell bash to
explicitly evaluate the command
line a second time after variable substitution has occurred.
Observe. In the latter case:
# create variable that is a line with assignments
ASSIGNMENTS="D=four; E=five; F=six"
# evaluate that line after variable substitution:
eval $ASSIGNMENTS
# see that the variables D, E, and F have values assigned:
echo $D, $E, $F
# cool!
And in the former:
# assign complete command line with redirection to bonk
bonk="echo me oh my, i like pie. > /tmp/out"
# evalute command line after $bonk's value is substituted
eval $bonk
# now /tmp/out holds the contents:
me oh my, i like pie.
eval
keyword is not often used in bash
, but it is very useful in bioinformatics
for evaluating command lines, that include their redirection specifiers, and which have
been stored in a variable. Why is this important? If your script stores a command line (after
variable subtstitution) in a variable, then you can print that command line before evaluating
it. When developing your scripts this makes it easy to test your command lines (just echo it,
copy it, and paste it into a shell you have open for testing). When running scripts, if you print
out each line to a log, it is easier to go back to ones that failed and figure out why.
Your Turn: Make a variable called YELL_IT
that holds the command
line that would print the line “Oh, I just gotta be me” and redirect
it into the file yawp.txt
. Once you have done that, do variable
substitution without, and then with, the eval
keyword. Which one
actually gets the job done?
Remember that in RStudio, if you highlight some text and do CMD-Option-Return on a Mac, the text gets sent to the RStudio Unix Terminal. (I suspect the PC equivalent is cntrl-alt-return). That can be helpful for quickly testing lines you have written.
5.3.4 Double and Single Quotation Marks and Variable Substitution
Quite often you will want to save a value to a variable that, itself, includes other variables. In other words, you want to do variable subtition on the value you are assigning to the variable. Double quote let you do this: variable substition will proceed within double quotes:
## I like sandwiches
However, inside single quotes, variable subtitution will not occur:
## I like $FOO
Each behavior has its uses, but most of the time, as I said, you will be wanting to use double quotes.
Now it is your turn: Save three pairs of foods that you like (for example,
“cookies & cream” or “bananas & walnuts”) into
three variables, PAIR_ONE
, PAIR_TWO
, and PAIR_THREE
. Then combine those into
a variable called SENTENCE
and print it with echo
, so that the result looks something
like: “I like cookies & cream, and bananas & walnuts, and rooibos and milk.”
5.3.5 One useful, fancy, variable-substitution method
Bash is full of fancy variable-substitution embellishments. One that I use all the time replaces strings in a variable with other strings. Check this out:
## my_picture.jpg
## my_picture.png
## your_picture.jpg
That turns out to be some powerful stuff.
5.3.6 Integer Arithmetic with Shell Variables
Another occasionally useful feature of the bash shell is being able to
do integer arithmetic with shell variables that are, properly, integers.
Basically, if you have variables that are integers such as 1
, -43
, 15
, 1400
, as
opposed to strings (like Boing
, foo
, bar
, etc.) or non-integer numbers (like
3.14
, -0.002
, -7.3
, etc.), then you can do integer calculations on them including
addition, subtraction, multiplication, integer division and modulo division. Doing so
involves expanding the variables within double parentheses: $(( ))
. Within those
double parentheses, you do not need to include a $
before the variable name. You can also
just use literal numbers (such as 56
) in those double parentheses.
To show
examples, start with assigning integer values to three shell variables: A=2; B=5; C=17
. Then, see:
Addition and Subtraction use
+
and-
:Multiplication uses
*
:Division uses
/
. Note that it only gives the integer part of the result. This is called “integer division.”Modulo division, using
%
, returns the remainder when two integers are divided by one another:Rules of precedence are followed, but order of operations can also be expressed using parentheses inside the double parentheses:
You can’t do really complex math with these features of the bash shell, but sometimes when you just need to manipulate integers, this can be handy.
5.3.7 Variable arrays
Most programming languages allow you to store things in arrays. In R they
are called vectors (or, more generally, lists). The same is true in
bash
: you can store a number of values into a single variable, and then
access each individual value one at at time. This can be quite useful sometimes.
The syntax for assigning values to array variable is like this:
In other words, wrap the items inside a pair of parentheses, with white-space separating them, and the shell will break those up into different, numbered parts of the array variable. It is important to note that single-quoted groups of words are treated as single values that will be assigned to an array element as a group. (Though the same is not true of double-quoted values…)
Once values are stored in an array, how do we access them, i.e., what sort of variable subtitution can we do?
Well, if we just do the traditional variable substitution, like ${ArrayVariable}
or
$ArrayVariable
, then we just get the first element:
That is not super helpful. So, to get each separate element we can subscript the array by adding a number in square brackets inside the curly braces, like this:
The subscripting is done with 0 as the starting value, so ${ArrayVariable[0]}
will give words
, and ${ArrayVariable[1]}
gives or
.
Here is a for
loop (see below) that prints each of the different subscripts
followed by their associated values:
# this loop...
for i in {0..5}; do
echo $i: ${ArrayVariable[$i]}
done
# produces output like:
0: words
1: or
2: things
3: or stuff
4: separated
5: by whitespace
There are a few more important ways of accessing array variable values:
You can get all the values at once as a single string. There are two
ways you can do this: either ${ArrayVariable[*]}
or
${ArrayVariable[@]}
. These two methods differ only when you have wrapped the variable up
in double quotation marks, but that involves some serious bash arcana.
# when just printing output, both of these give the same result
echo ${ArrayVariable[*]}
echo ${ArrayVariable[@]}
# producing:
words or things or stuff separated by whitespace
words or things or stuff separated by whitespace
And finally, the length of the array (meaning, how many elements
are in it) is found with a peculiar syntax ${#ArrayVariable[@]}
Although bash arrays can be quite useful, I must admit that I only use them in a few bioinformatic situtations—primarily when I want to break a single row of white-space delimited columns (that I grabbed from a file, for example) into its consituent words that I can then manipulate.
Your turn: Create a variable called MY_ARRAY
in which each
element is the first name of 5 different people in our bioinformatics class.
Then print the 1st, 2nd and 5th of them each on a separate line.
5.4 Evaluate a command and substitute the result on the command line
Sometimes what you want to put on a command line isn’t just a variable
you have previously defined but rather the result of a command that
is executed. We see this in clone-classroom-repos.sh
around line 75:
In general, if you put a command inside a $()
, like $(command)
it means take the output of
the command and insert it into the command line.
You can even assign the result to a variable, like RESULT=$(command)
.
For example, try this:
Whoa! What happened to all my carriage returns? The results gets put onto the command line and parsed there. The command line parser sometimes treats line endings as just more whitespace, and it converts all runs of whitespace to a single space…
Your Turn: Make a three line script. In the first one, assign the output
of the date
command to the variable BEFORE
, then give this command sleep 5
,
then in the third line, assign the output of the date
command to the variable
AFTER
. Evaluate all three lines at once (by CMD-OPTION-Return or copying and pasting
into the terminal.) Then, look at BEFORE
and AFTER
to compare.
5.5 Grouping/Collecting output from multiple commands: (commands)
and { commands; }
Quite often you may wish to think of the result of a group of commands (taken together such a group is called “list” in bash parlance) as being the result of a single command. Typical use cases are when you want to redirect the output from three separate commands into a single file. For example if you want to put the contents of two files, FileA and FileB, into a file called “Both,” but separate the contents of FileA and FileB by a short line of x’s, you could do this:
But, it is easier to see what is going on and to maintain code that looks like this:
or like:
Both parentheses and curly braces can be used to group commands into one grouped command, thereby making it easy to redirect the output from that command. These two forms of grouping have subtle differences.
When you group commands into parentheses, all the commands get evaluated in a separate subshell. By contrast, grouped commands inside curly braces are all evaluated within the current shell. In many cases this will make almost no difference to you. However, if you are assigning variable values within the grouped commands, then, using parentheses for grouping, you won’t have access to those variable values in the current shell:
# include a variable assignment in parentheses
(cat FileA; echo xxxxx; cat FileB; NewVar=15) > Both
# if you try this:
$NewVar
# the shell knows nothing about it
Using curly braces, the variable assignments will be known in the current shell you are in:
# include a variable assignment in curly braces
{ cat FileA; echo xxxxx; cat FileB; NewVar=15; } > Both
# Now, if you try this:
$NewVar
# the shell knows that $NewVar has a value:
15
The syntax for using curly braces for grouping, however, is more finicky than it is with parentheses: the left curly brace cannot be touching anything on the right, and the last command in the group must be followed by a semicolon or a newline. Thus, both of these would fail:
5.6 Exit Status
When you run a command in Unix, that command might do a lot of different things, like print something to stdout, or copy a file around, or delete a file, or index a whole genome. Regardless of all the things a command might do, it also should let the operating system know whether it was successful or not. The exit status of a Unix command records whether it finished its task successfully or not. It is important to understand how exit statuses work so that you can design bioinformatic pipelines that will stop when something has gone wrong, and will let you know about that.
If a command exits normally (SUCCESS!) the exit status is 0. If the command does not exit normally (it may have been unsuccessful) then its exit status will be anything but 0. Some programs, when they fail, return a non-zero integer as their exit status, and the value can tell you what kind of error occurred. Other programs might just return 1.
Exit statuses are not typically seen by the user, but they do get passed to the
shell. You can always access the exit status of the last command with $?
. You can
remember that because, with the question mark, it is kind of like you are asking
the operating system, “What’s up!!??”.
Here is an example: if we try to list a file, using ls
, that exists, the file gets listed
and the exit status is 0 (SUCCESS!). If the file does not exist we get an error message
and an exit status of 1 (NO_SUCCESS!).
# list a file we know exists
% ls -d ~/Documents
/Users/eriq/Documents/
# check exit status of last command
% echo $?
0
# list a file we are pretty sure does not exist
% ls I-doubt-this-file-exists.yeah.sure
ls: I-doubt-this-file-exists.yeah.sure: No such file or directory
# check the exit status of last command
% echo $?
1
While we don’t get to “see” the exit status of a command without
looking at $?
, to the bash shell, when a command has completed
on the command line, it effectively becomes the value of its exit
status. This is important to understand when dealing with combinations
of exit statuses.
5.6.1 Combinations of exit statuses
In bioinformatics, exit statuses can be helpful in a script to “decide” whether
to continue processing the next command, based on whether the previous one
failed or not. bash
has an elegant way of implementing this in terms of
the binary operators &&
and ||
that combine exit statuses. The &&
is
a logical-AND combinining operator. If ES1 and ES2 are two exit statuses, then
ES1 && ES2
is “SUCCESS!” only if both ES1 and ES2 have exit statuses of
“SUCCESS!”. Here is a quick table:
ES1 |
ES2 |
ES1 && ES2 |
---|---|---|
NO_SUCCESS |
NO_SUCCESS |
NO_SUCCESS |
NO_SUCCESS |
SUCCESS! |
NO_SUCCESS |
SUCCESS! |
NO_SUCCESS |
NO_SUCCESS |
SUCCESS! |
SUCCESS! |
SUCCESS! |
If you study this table for a moment, you will see that
in the first two cases, the value of the combination,
ES1 && ES2
is apparent, just from knowing ES1
. It
doesn’t matter whether ES2
has an exit status
of SUCCESS!
or NO_SUCCESS
; either way, because
ES1
has failed, we know that ES1 && ES2
will be
NO_SUCCESS
.
This knowledge, combined with an understanding that
the bash shell is typically very busy, and so is not going to
do any extra work that it does not need to do will help you
to understand the behavior of the shell when two (or more)
commands are joined into a “compound command” with the &&
symbol.
Consider this:
When bash looks at that, it sees that whoever wrote it wants to
know the &&
-combination of exit statuses of command1
and
command2
. bash
keeps that thought in the back of its mind,
and then starts working through the commands from left to right.
If command1
fails, the shell says, “Hey! At this point, I know
the exit status of command1 && command2
, so I am not even going to
evaluate command2
! On the other hand, if the exit status
of command1
is”SUCCESS!“, then the shell will proceed to evaluating
command2
, because it knows that if the exit status of command1
was”SUCCESS!“, then it must evaluate command2
so as to get its
exit status to properly evaluate command1 && command2
.
The upshot of this is a construction like the following:
can be very useful in bioinformatics. This construction says,
evaluate each command, if all the preceding commands were successful.
If any of the commands fails, then none of the commands after them are
evaluated. This is helpful if future steps depend on the successful
completion of previous steps. In our example script, clone-classroom-repos.sh
at the beginning of this chapter, we see the &&
used on lines 103–106.
This is saying, “if we didn’t successfully clone the repository, then don’t
try to make a new branch in it, and if we didn’t successfully make a new
branch in it, then don’t try to push that branch back to GitHub.
The opposite of &&
is the ||
which combines exit statuses in an OR fashion
with a table like the following:
ES1 |
ES2 |
ES1 || ES2 |
---|---|---|
NO_SUCCESS |
NO_SUCCESS |
NO_SUCCESS |
NO_SUCCESS |
SUCCESS! |
SUCCESS! |
SUCCESS! |
NO_SUCCESS |
SUCCESS! |
SUCCESS! |
SUCCESS! |
SUCCESS! |
In this case, if ES1
is SUCCESS!
then we know that
ES1 || ES2
will have a combined exit status of “SUCCESS!”
Consequently, we use the ||
to force evaluation of
another command in case the previous one failed. Like this:
Note, in the above, we redirect the text, “Aw shucks! That file aint there” to
a file named /dev/stderr
. That file, /dev/stderr
is a special file: anything
that you send into it gets immediately printed on stderr.
We end by noting that the numbers assigned to SUCCESS!
and NO_SUCCESS
do
not accord with the 0’s and 1’s used in a standard “truth-table” context. We just
have to deal with that. I find it much easier to think in terms of exit statuses
of SUCCESS!
and NO_SUCCESS
, than in terms of the 0’s and 1’s, respectively,
by which those statuses are represented in the computer.
5.7 Loops and repetition
By this point, you have probably heard, or been told,
many times over, that Unix shell scripting is particularly good
for taking care of repetitive tasks. But, by this point in this
book, it might not yet be clear how that is the case. Wait no
more! This section will reveal a wonderful construct called the
for
loop that lets you do a task repeatedly, each time setting
the value of a variable to something different that you want to
be applying some commands to.
The basic syntax of the for
loop is:
# here the cycled variable is "i"
for i in some things separated by whitespace; do
commands involving $i
done
For example:
for i in oranges bananas apples; do
echo "I like $i"
done
# produces this:
I like oranges
I like bananas
I like apples
When you write a for loop in a script, it is good practice to indent the command lines that will get evaluated multiple times. This is particularly useful if you have nested for loops (one inside another) such as:
for fruit in pears figs; do
for who in Mark Alice; do
echo "$who likes $fruit"
done
done
# which produces:
Mark likes pears
Alice likes pears
Mark likes figs
Alice likes figs
However, in bash (unlike Python, which is particularly obsessed with indentation) you are not required to indent things. In fact the above could have been written all on one line:
It is also worth pointing out that, while many languages use curly braces to denote
blocks of repeated code, bash
uses the pair do...done
, which I find to be quite cute.
As found with the variable GHNAMES
in our example script,
clone-classroom-repos.sh
, you can also use variable
subsitution to provide a list of terms to cycle over. Here is
another small example:
You can also use globbing (path expansion) to provide a list of
things (files, specifically) to cycle over. The following prints the path
and the first two
lines of all files in the directory table_inputs
in this book’s repository,
at this point in the book’s formation:
for i in table_inputs/*; do
echo "======= file: $i ========="
head -n 2 $i
done
# this produces:
======= file: table_inputs/minimal-tmux.txt =========
Within tmux? ; Command ; Effect
N ; `tmux ls` ; List any tmux sessions the server knows about
======= file: table_inputs/sam-columns-table.txt =========
Column & Field & Data Type & Description
1 & QNAME & String & Name/ID of the read (from FASTQ file)
======= file: table_inputs/sam-flag-table.txt =========
bit-# & bit-gram & $2^x$ & dec & hex & Meaning
1 & $\bitsa$ & $2^0$ & 1 & 0x1 & the read is paired (i.e. comes from paired-endsequencing.)
======= file: table_inputs/tmux-pane-strokes.txt =========
Within tmux? ; Command ; Effect
Y ; `<cntrl>-b /` ; Split current window/pane vertically into two panes
Those files are the ones used as input to a few of the tables in the book.
If it turns out that you want to cycle over some integers,
in order (whether increasing or decreasing), from a starting
value to a stopping value, you can use curly braces and two dots,
like this: {1..5}
, like this:
for i in {1..5}; do echo The number is: $i; done
# this makes:
Number 1
Number 2
Number 3
Number 4
Number 5
And you can have negative numbers and a reverse order, too:
for i in {4..-2}; do echo The number is: $i; done
# this makes
The number is: 4
The number is: 3
The number is: 2
The number is: 1
The number is: 0
The number is: -1
The number is: -2
One cool thing to realize is that any output that goes to stdout
from within a for
loop—if it is not redirected from within the
loop—effectively comes “flowing out” to stdout from right
after the done
keyword. So, you can redirect it in one swell foop
from that point in your code like this:
for i in table_inputs/*; do
echo "======= file: $i ========="
head -n 2 $i
done > file-to-redirect-it-all-into.txt
…or, you could even pipe it to another command, like this, to print just the first column of text of each line:
5.8 More Conditional Evaluation: if
, then
, else
, and friends
We have already seen how exit statuses can be combined with
&&
or ||
to control the flow of a script (i.e., if this
failed, don’t do the next line…). There is also a traditional
if/then/else
construct in bash to control the execution of script
on the basis of exit status. Thus, we can do something like:
Note that this opens an if
block of code with if
and then
it closes it (after the then
and the else
) with a backward
if
: fi
.
The general syntax is:
There is also an else-if construct, that is named elif
, with
syntax like this:
if exit_status1; then
Do this if exit_status1 = SUCCESS!
elif exit_status2; then
Do this if exit_status1 = NO_SUCCESS and exit_status2 = SUCCESS!
else
Do this if both were NO_SUCCESS
fi
Many of the times when you want to use an if
construct, you will
want to be testing things about files, or strings or variables,
rather than assessing exit status of functions. Alas, all bash is
capable of is assessing exit statuses. But hark! All is not lost becuase
there is a function called test
that let’s you test if statements
are true, and if they are, then it returns an exit status of 0 (SUCCESS!).
For example, to test if a file exists, you can use:
# this file does exist in the current directory on my system...
test -f eca-bioinf-handbook.Rproj
# do this to see what the exit status was
echo $?
Or, if we want to test if the value of a variable is the same as some string.
VAR=big_and_bad # set a variable to some string value
test $VAR = small_and_sweet
# get exit status, which will be NO_SUCCESS (1)
echo $?
The value of integer variables can be tested too. See man test
for
all the details.
Now, the thing to remember about the test function is that there
is an intuitive-looking shorthand for writing it: that is to write
its arguments between a [
and a ]
, but not touching either of them.
So:
Note that the RStudio bash shell seems to be doing something weird,
I don’t think the “last command” is quite what we think it should be,
so $?
does not seem to be reliable.
5.9 Finally…positional parameters
Way back when we started this chapter, in the example script, clone-classroom-repos.sh
,
right at the top we see lines like:
What the heck is that $1
. That is not proper syntax for a variable name! A variable
name can’t start with a number, after all. Aha! $1
, $2
, $3
, and so forth are
variables that store the positional parameters of a script. In other words if you write a
script called my_script.sh
and you invoke it with some words after it, like:
Then, when that script is executing the code inside it, the value of
$1
will be BigFile.txt
, the value of $2
will be small-file.txt
, the
value of $3
will be Yee-ha
, and the value of $4
will be Oh Yeah
.
The variable $#
inside the script holds the number of positional parameters, and,
in our example script [ $# -ne 4 ]
evaluates to NO_SUCCESS
if $#
is not equal
to 4, in which case the script prints a message about how to use it.
Your Turn: Write a script in a separate file called “three-things.sh” that is expecting to take three positional parameters. In side the script, have it print the actual values passed to it in reverse order for example:
Third parameters is: -----
Second parameter is: -----
First parameters is: -----
Where -----
would be replaced by the actual value of the positional parameters.
5.10 basename
and dirname
two useful little utilities
Paths to files and directories expand and are listed as the paths relative
to the current working directory. Sometimes, you just want the name of the
file. This is what basename
gives you. For example:
# expand the filenames a couple directories down:
files="figure-creation/1.01-unix/*"
# print all those file names
echo $files
# that produces this output:
figure-creation/1.01-unix/file-hierarchy.dot figure-creation/1.01-unix/file-hierarchy.pdf figure-creation/1.01-unix/file-hierarchy.png figure-creation/1.01-unix/file-hierarchy.sh
# do this to get just the filenames:
basename $files
# which produces this output:
file-hierarchy.dot
file-hierarchy.pdf
file-hierarchy.png
file-hierarchy.sh
# if you want just the relative path to the directories those files
# are in you can use dirname, but have to operate on one file at a time:
for i in $files; do
dirname $i
done
# makes this output:
figure-creation/1.01-unix
figure-creation/1.01-unix
figure-creation/1.01-unix
figure-creation/1.01-unix
5.11 bash
functions
In our example script, clone-classroom-repos.sh
, we define a bash function called usage
that prints a helpful message showing the syntax and an example invocation of the script.
This provides an example of how you can write functions in bash
. In all honesty, I only
rarely use functions in bash
, but it is good to know about nonetheless.
In bash
, a function is just a collection of commands, grouped using curly braces, that
will be evaluated when the function’s name is issued on the command line or within a script.
Not only will those lines be evaluated, but you can also pass positional parameters to the
function by following its name with other words/tokens/values. The positional parameters
within a function are distinct from the positional parameters within, say, the main script.
Functions can be defined, most transparently, by using the function
keyword.
As an example, here is a silly function, called Silly
that takes two positional parameters
and then merely prints them separated by monkey noises.
Another syntax you might see is to not use the function keyword, but rather
follow the function’s name with ()
:
You can even put it all on one line, but you have to make sure that you respect the curly-braces sensitivity to spacing and explicit line ending semicolons:
5.13 Further reading
An excellent chapter on the development of Unix (Raymond 2003)
A nice set of bash scripting tutorials can be found at https://ryanstutorials.net/bash-scripting-tutorial/