Chapter 8 High Performance Computing Clusters (HPCC’s)
One interesting aspect of modern next generation sequencing data is simply its sheer size: it is not uncommon to receive a half or a full terabyte of data from a sequencing center. It is impractical to store this much data on your laptop, let alone analyze it there. Crunching through such a quantity of data on a single computer or server could take a very long time. Instead, you will likely break up the analysis of such data into a number of smaller jobs, and send them off to run on an assortment of different computers in a High Performance Computing Cluster (HPCC).
Thus, even if you have immersed yourself in bioinformatic data file formats and honed your skills at shell programming and accessing remote computers, sequence data analysis remains such a formidable foe that there is still one last key area of computing in which you must be fluent, in order to comfortably do bioinformatics: you must understand how to submit and manage jobs sent to an HPCC.
My first experience with HPCC’s occurred when I started analyzing high-throughput sequencer output. I had over 15 years experience in shell programming at that time, and I was given some example analysis scripts to emulate, but I still found it took several weeks before I was moderately comfortable in an HPCC environment. Most HPCC’s have some sort of tutorial web pages that provide a little bit of background on cluster computing, but I didn’t find the ones available to me, at the time, to be particularly helpful.
The goal of this chapter is to provide the sort of background I wish that I had when I started doing cluster computing for bioinformatics. I will not be providing a comprehensive overview of parallel computation. For example, we will not focus at all upon the rich tradition of parallel computing applications through “message passing” interfaces which can maintain a synchronized analysis from a single program executing on multiple computers at the same time. Rather, we will focus on the manner in which most bioinformatic problems can be broken down into a series of smaller jobs, each of which can be run, independently, on its own processor without the need for maintaining synchrony between multiple processes.
We start with an overview of what an HPCC consists of, defining a few important terms. Then we provide some background on the fundamental problem of cluster computing: namely that a lot of people want to use the computing power of the cluster, but it needs to be allocated to users in an equitable fashion. An understanding of this forms the basis for our discussion of job scheduling and the methods you must use to tell the job scheduler what resources you will need, so that those resources will eventually be allocated to you. We will cover a job scheduler called SLURM, which stands for the Simple Linux Utility for Resource Management. It is the scheduler used on the Summit supercomputer in Boulder and the Hummingbird and Sedna clusters deployed at UCSC and the NWFSC, respectively.
8.1 An oversimplified, but useful, view of a computing cluster
At its simplest, an HPCC can be thought of as a whole lot of computers that are all put in a room somewhere for the purposes of doing a lot of computations. Most of these computers do not possess all the elements that typically come to mind when you think of a “computer.” For example, none of them are attached to monitors—each computer resembles just the “box” part of a desktop computer. Each of these “boxes” is called a node. The node is the unit in a cluster that corresponds most closely to what you think of as a “computer.” As is typical of most computers today, each of these nodes has some amount of Random Access Memory (RAM) that is only accessible by the node itself. RAM is the space in memory that is used for active computation and calculation.
Each of these nodes might also have a hard drive used for operating system software. The bulk of the hard drive space that each node can access, however, is in the form of a large array of hard disks that are connected to all of the nodes by an interface that allows data to be transferred back and forth between each node and the hard-drive array at speeds that would be expected of an internal hard drive (i.e., this array of hard drives is not just plugged in with a USB cable). Memory on hard drives is used for storing data and the results of calculations, but is not used for active calculations the way RAM is used. In order for calculations to be done on data that are on the hard drive array, it must first be read into a node’s RAM. After the calculations are done, the results are typically written back out onto the hard drive array.
On a typical cluster, there are usually several different “portions” of the hard drive
array attached to every node. One part of the array holds home directories of the
different users. Each user typically has a limited amount of storage in their home directory, but
the data in these home directories is usually safe or protected, meaning you can put a file there
and expect that it will be there next week, month, or year, etc. It is also likely that the home directories are
backed up (but check the docs for your cluster to confirm this!). Since the space in home directories is limited,
you typically will not put large data sets in your home directory for long-term storage, but you will
store scripts and programs, and such items as cloned GitHub repositories there.
(On the SUMMIT computer, space on the home directory is very restricted, and as we have
already seen, your projects directory at /projects/username
serves the purpose of a typical
home directory on SUMMIT).
Another
type of storage that exists on the hard drive array is called persistent long-term storage. This type of storage
is purchased for use by research groups to store large quantities of
data on the cluster for long periods of time. As discussed in the last chapter, the rise of cloud-based storage solutions,
like Google Drive, offering unlimited storage to institutional users, makes persistent long-term storage less
important (and less cost-effective) for many research groups. Finally, the third type of
storage in the hard drive array is called scratch storage. There are usually fairly
light limits (if any at all) to how much data can be placed in scratch storage, but there
will typically be Draconian time limits placed on your scratch storage. For example, on the
Hoffman2 cluster at UCLA, you are granted 2 Tb of scratch
storage, but any files that
have sat unmodified in scratch
for more than 14 days will be deleted (and if space is tight, the
system administrators may delete things from scratch
in far fewer then 14 days.) Check
your local cluster documentation for information about time and space limits on scratch
.
On many clusters, scratch space is also configured to be very fast for input and output
(for example, on many systems, the scratch storage will be composed of solid state drives rather
than spinning hard disks). On jobs that require that a lot of data be accessed from the
drive or written to it (this includes most operations on BAM files), significant decreases
in overall running time can be seen by using fast storage. Finally, scratch space exists on a cluster
expressly as the place to put data and outputs on the hard drive when running jobs. For all these
reasons, when you run jobs, you will always want to read and write data from and to scratch
. We
will talk more about the specifics of doing so, but for now, you should be developing a generic picture
of your cluster computing workflow that looks like:
- Download big data files from the cloud to
scratch
- Run analyses on those data files, writing output back to
scratch
. - When done, copy, to the cloud, any products that you wish to keep.
- Remove data and outputs left on
scratch
.
As is the case with virtually all modern desktop or laptop computers, within each node, there are multiple (typically between 16 and 48) cores, which are the computer chip units that actually do the computations within a node. A serial job is one that just runs on a single core within a node, while a parallel job might run on multiple cores, at the same time, within a single node. In such a parallel job, each core has access to the same data within the node’s RAM (a “shared-memory,” parallel job). The core is the fundamental unit of computing machinery that gets allocated to perform jobs in an HPCC.
Most of the nodes in a cluster are there to hold the cores which are the computational workhorses, slogging through calculations for the HPCC’s myriad users. However, some nodes are more appropriate to certain types of computations than others (for example, some might have lots of memory for doing genome assembly, while others will have big, hurkin’, graphical processing units to be used for GPU calculations). Or, some nodes might be available preferentially for different users than for others. For these reasons, nodes are grouped into different collections. Depending on the system you are using, these collections of different nodes are called, either, partitions or queues. The world of SLURM uses the term partitions for these collections, and we will adopt that language as well, using queue to refer to the line of jobs that are waiting to start on an HPCC; however we warn the reader that other job schedulers (like the Univa Grid Engine) use the term “queues” to refer to collections of nodes.
On every cluster, there will be one to several nodes that are reserved not for
doing computation, but for allowing users to access the cluster. These are called the
login nodes or the head nodes. These nodes are solely for logging in, light editing of
scripts, minor manipulation of directories, and scheduling and managing jobs. They are absolutely not
for doing major computations. For example, you should never login to the head node and immediately
start using it, in an interactive bash
session to, say, sort BAM files or run bwa mem
to do alignments. Running commands that require a lot of computation—or a lot of
input and output from the disk—on the login nodes is an egregious
faux pas of cluster computing. Doing so can
negatively impact the ability of other users to login or otherwise get their work done, and it
might lead the system administrators to disable your account. Therefore,
never do it! All of your hardcore computation on a cluster has to be done on
a compute node. We will show how to do that shortly, but first we will talk about why.
8.2 Cluster computing and the job scheduler
When you do work, or stream a video, or surf the web on your laptop computer, there are numerous different computer processes running, to make sure that your computer keeps working and doing what it is supposed to be doing. In the case of your laptop, the operating system, itself, orchestrates all these different processes, making sure that each one is given some compute time on your laptop’s processors in order to get its work done. Your laptop’s operating system has a fair bit of flexibility in how it allocates resources to these different processes: it has multiple cores to assign different processes to, and it allows multiple processes to run on a single core, alternating between these different processes over different cycles of the central processing unit. Things work differently on a shared resource like an HPCC. The main, interesting problem of cluster computing is basically this: lots of people want to use the cluster to run big jobs, but the cluster does not run like a single computer.
The cluster is not happy to give lots of different jobs from lots of different users a chance to all run on the same core, sharing time by dividing up cycles. When a user wants to use the computational resources of an HPCC, she cannot just start a job and be confident that it will launch immediately and be granted at least a few CPU cycles every now and again. Rather, on an HPCC, every job that is submitted by a user will be assigned to a dedicated core (or several cores, if requested, and granted) with a dedicated amount of memory. If a core is not available for use, the job “gets in line” (into the “queue”, so to speak) where it sits and waits (doing nothing) until a core (or multiple cores if requested) with sufficient associated resources, like RAM memory, becomes available. When such a core becomes available in the cluster, the job gets launched on it. All of this is orchestrated by the job scheduler, of which SLURM is an example.
In this computing model, a job, once it is launched, ties up the core and the memory that has been allocated to it until the job is finished. While that job is running, no one else’s jobs or processes can run on the core or share the RAM memory that was allocated to the job. For this reason, the job scheduler, needs to know, ahead of time, how long each job might run and what resources will be required during that time. A simple contrived example illustrates things easily: imagine that Joe and Cheryl each have 1000 separate jobs to run. Each of Cheryl’s jobs involves running a machine-learning algorithm to identify seabirds in high-resolution, aerial images of the ocean, and takes only about 20 minutes running on a single core. Each of Joe’s jobs, on the other hand, involves mapping billions of sequencing reads, a task which requires about 36 hours when run on a single core. If their cluster has only 640 cores, and Joe submits his jobs first, then, if the job scheduler were naive, it might put all of his jobs in line first, requiring some 50 or 60 hours before the first of Cheryl’s jobs even runs. This would be a huge buzz kill for Cheryl. However, if Cheryl and Joe both have to provide estimates to the scheduler of how long their jobs will run, the scheduler can make more equitable decisions, starting a few of Joe’s jobs, but retaining many more cores for Cheryl’s jobs, each of which runs much faster.
Thus, when you want to run any jobs on a cluster, you must provide an estimate of the resources that the job will require. The three main axes upon which these resources are measured are:
- The number of cores the job will require.
- The maximum amount of RAM (memory) the job will require.
- The amount of time for which the job will run.
Requests for large amounts of resources for long periods of time generally take longer to start. There are two main reasons for this: either 1) the scheduler does not want to launch too many long-duration, high-memory jobs because it anticipates other users will want to use resources down the road and no single user should tie up the compute resources for too long; or 2) there are so many jobs running on myriad nodes and cores, that only infrequently do nodes with sufficient numbers of cores and RAM come available to start the new jobs.
The second reason is a particular bane of new cluster users who unwittingly request more resources
than actually exist (i.e. 52 cores, when no single node has more than 32; or 50 Gb of RAM when no single
node has more than 48 Gb). Unfortunately (or, perhaps comically, if you have a sick sense of
humor), some job schedulers will not notify you of this sort of transgression.
Rather, your job will just sit in line waiting to be launched, but it never will be, because sufficient
resources never become available!
(Foruntately, however, the SLURM scheduler is “wise” enough to notify the user if
she or he has requested more resources than will ever become available.)
It is worth noting that regardless of whether reason 1 or reason 2 is the dominant cause influencing how long it takes to start a job, asking for fewer resources for less time will generally allow your jobs to start faster. Particularly because of reason #2, however, breaking your jobs down (if possible) into small chunks that will run relatively quickly on a single core with low RAM needs can render many more opportunities for your jobs to start, letting you tap into resources that are not often fully utilized in a cluster. Since I started working on a large cluster in which it took a long time to start a job that required all or most of the cores on a single node—but in which there were many nodes harboring a few cores that were not being used—I tend to endorse the approach of breaking jobs down into small units that will run relatively quickly, with low RAM requirements, on a single core.
Since the requested resources for a job can play such a large role in wait times for jobs to start, you might wonder why people don’t intentionally underestimate the resources they request for their jobs. The answer is simple: the job scheduler is a highly efficient and completely dispassionate manager. If you requested 2 hours for your job, but your job has not finished in that amount of time, the job scheduler will waste no time hemming and hawing or having an emotional struggle with itself about whether it should stop your job. No, at 2 hours and 0.2 seconds it WILL kill your job, regardless of whether it is just about to finish, or not. Similarly, if you requested 4 Gb of RAM, but five hours into your job, the program you are running ends up using 5 Gb or RAM to store a large chunk of data, your job WILL be killed, immediately.
Thus, it is best to be able to accurately estimate the time and resources a job will require. You always want to request more time and resources than your job will actually need, but not too much more. A large part of getting good at computing in a shared cluster resource is gaining experience in predicting how long different jobs will run, and how much RAM they will require. Later we will describe how the records of your jobs, stored by the job scheduler, can be accessed and organized to aid in predicting the resource demand of future jobs.
8.3 Learning about the resources on your HPCC
Most computing clusters have a web page that describes the configuration of the system and the resources available on it. However, you can use various SLURM commands to learn about those resources as well, and also to gain information about which of the resources are in use and how many users are waiting to use them.
Requests for resources, and for information about the computing cluster, are made to the job scheduler using a few basic commands. As said, we will focus on the commands available in a SLURM-based system. In a later section, we will more fully cover the commands used to launch, schedule, and manage jobs. Here we will first explore the SLURM commands that you can use to “get to know” your cluster.
All SLURM commands begin with an s
, and all SLURM systems support the sinfo
command
that gives you information about the cluster’s nodes and their status (whether they are currently
running jobs or not.) On your cluster, man sinfo
will tell you about this command.
On the Sedna
cluster, shortly after it got installed and therefore did not
have many users, when we issue the sinfo
command we see this:
% sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
nodes* up infinite 28 idle node[01-28]
himem up infinite 1 alloc himem01
himem up infinite 2 idle himem[02-03]
which tells us that there are two partitions (collection of nodes)
named nodes
and himem
. The himem
partition has one node which is
currently allocated to a job (STATE
= alloc
), and two more that
are not currently allocated to jobs. The himem
partition holds machines
with lots of RAM memory for tasks like sequence assembly (and, when I ran that
command, one of the nodes in himem
was busy assembling the genome of
some interesting sea creature.
It also has 28 compute nodes in the
nodes
partition that are idle
, which means they are available for use.
This is a very small cluster.
If you do the same command on SUMMIT you get many more lines of output. There are many more different partitions, and there is a lot of information about how many nodes are in each:
% sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
shas* up 1-00:00:00 2 drain* shas[0136-0137]
shas* up 1-00:00:00 2 down* shas[0404,0506]
shas* up 1-00:00:00 1 drain shas0101
shas* up 1-00:00:00 3 resv shas[0102,0521,0853]
shas* up 1-00:00:00 74 mix shas[0125,0130,0133,0138,0141,0149,0156,0158,0218,0222,0229,0236-0237,0240-0241,0243,0246,0255,0303,0311,0314,0322,0335,0337,0341,0343,0351,0357,0402,0411,0413,0415-0416,0418,0423,0428,0432-0433,0435-0436,0440,0452,0455-0456,0459,0501,0504,0514,0522-0524,0526-0527,0556,0608,0611-0613,0615-0616,0631,0637,0801,0810-0811,0815,0834-0835,0850,0855,0907-0909,0921]
shas* up 1-00:00:00 359 alloc shas[0103-0124,0126-0129,0131-0132,0134-0135,0139-0140,0142-0148,0150-0155,0157,0159-0160,0201-0217,0219-0221,0223-0225,0227-0228,0230-0235,0238-0239,0242,0247-0254,0256-0260,0301-0302,0304-0310,0312-0313,0315-0321,0323-0334,0336,0338-0340,0342,0344-0350,0352-0356,0358-0360,0401,0405-0410,0412,0414,0417,0419-0422,0424-0427,0429-0431,0434,0437-0439,0442-0451,0453-0454,0457-0458,0460,0502-0503,0505,0507-0513,0515-0520,0525,0528-0555,0560,0605-0607,0609,0614,0617-0630,0632-0636,0638-0650,0652-0664,0802-0809,0812-0814,0816-0833,0836-0849,0851-0852,0854,0856-0860,0901-0906,0910-0913,0915-0920,0922-0932]
shas* up 1-00:00:00 11 idle shas[0226,0244-0245,0403,0441,0557-0559,0610,0651,0914]
shas-testing up infinite 2 drain* shas[0136-0137]
shas-testing up infinite 2 down* shas[0404,0506]
shas-testing up infinite 1 drain shas0101
shas-testing up infinite 3 resv shas[0102,0521,0853]
shas-testing up infinite 74 mix shas[0125,0130,0133,0138,0141,0149,0156,0158,0218,0222,0229,0236-0237,0240-0241,0243,0246,0255,0303,0311,0314,0322,0335,0337,0341,0343,0351,0357,0402,0411,0413,0415-0416,0418,0423,0428,0432-0433,0435-0436,0440,0452,0455-0456,0459,0501,0504,0514,0522-0524,0526-0527,0556,0608,0611-0613,0615-0616,0631,0637,0801,0810-0811,0815,0834-0835,0850,0855,0907-0909,0921]
shas-testing up infinite 359 alloc shas[0103-0124,0126-0129,0131-0132,0134-0135,0139-0140,0142-0148,0150-0155,0157,0159-0160,0201-0217,0219-0221,0223-0225,0227-0228,0230-0235,0238-0239,0242,0247-0254,0256-0260,0301-0302,0304-0310,0312-0313,0315-0321,0323-0334,0336,0338-0340,0342,0344-0350,0352-0356,0358-0360,0401,0405-0410,0412,0414,0417,0419-0422,0424-0427,0429-0431,0434,0437-0439,0442-0451,0453-0454,0457-0458,0460,0502-0503,0505,0507-0513,0515-0520,0525,0528-0555,0560,0605-0607,0609,0614,0617-0630,0632-0636,0638-0650,0652-0664,0802-0809,0812-0814,0816-0833,0836-0849,0851-0852,0854,0856-0860,0901-0906,0910-0913,0915-0920,0922-0932]
shas-testing up infinite 11 idle shas[0226,0244-0245,0403,0441,0557-0559,0610,0651,0914]
shas-interactive up infinite 2 drain* shas[0136-0137]
shas-interactive up infinite 2 down* shas[0404,0506]
shas-interactive up infinite 1 drain shas0101
shas-interactive up infinite 3 resv shas[0102,0521,0853]
shas-interactive up infinite 74 mix shas[0125,0130,0133,0138,0141,0149,0156,0158,0218,0222,0229,0236-0237,0240-0241,0243,0246,0255,0303,0311,0314,0322,0335,0337,0341,0343,0351,0357,0402,0411,0413,0415-0416,0418,0423,0428,0432-0433,0435-0436,0440,0452,0455-0456,0459,0501,0504,0514,0522-0524,0526-0527,0556,0608,0611-0613,0615-0616,0631,0637,0801,0810-0811,0815,0834-0835,0850,0855,0907-0909,0921]
shas-interactive up infinite 359 alloc shas[0103-0124,0126-0129,0131-0132,0134-0135,0139-0140,0142-0148,0150-0155,0157,0159-0160,0201-0217,0219-0221,0223-0225,0227-0228,0230-0235,0238-0239,0242,0247-0254,0256-0260,0301-0302,0304-0310,0312-0313,0315-0321,0323-0334,0336,0338-0340,0342,0344-0350,0352-0356,0358-0360,0401,0405-0410,0412,0414,0417,0419-0422,0424-0427,0429-0431,0434,0437-0439,0442-0451,0453-0454,0457-0458,0460,0502-0503,0505,0507-0513,0515-0520,0525,0528-0555,0560,0605-0607,0609,0614,0617-0630,0632-0636,0638-0650,0652-0664,0802-0809,0812-0814,0816-0833,0836-0849,0851-0852,0854,0856-0860,0901-0906,0910-0913,0915-0920,0922-0932]
shas-interactive up infinite 11 idle shas[0226,0244-0245,0403,0441,0557-0559,0610,0651,0914]
sgpu up 1-00:00:00 1 resv sgpu0501
sgpu up 1-00:00:00 1 alloc sgpu0502
sgpu up 1-00:00:00 9 idle sgpu[0101-0102,0201-0202,0301-0302,0401-0402,0801]
sgpu-testing up infinite 1 resv sgpu0501
sgpu-testing up infinite 1 alloc sgpu0502
sgpu-testing up infinite 9 idle sgpu[0101-0102,0201-0202,0301-0302,0401-0402,0801]
sknl up 1-00:00:00 1 drain sknl0710
sknl up 1-00:00:00 1 resv sknl0706
sknl up 1-00:00:00 18 alloc sknl[0701-0705,0707-0709,0711-0720]
sknl-testing up infinite 1 drain sknl0710
sknl-testing up infinite 1 resv sknl0706
sknl-testing up infinite 18 alloc sknl[0701-0705,0707-0709,0711-0720]
smem up 7-00:00:00 1 drng smem0201
smem up 7-00:00:00 4 alloc smem[0101,0301,0401,0501]
ssky up 1-00:00:00 1 drng ssky0944
ssky up 1-00:00:00 1 mix ssky0952
ssky up 1-00:00:00 3 alloc ssky[0942-0943,0951]
ssky-preemptable up 1-00:00:00 1 drng ssky0944
ssky-preemptable up 1-00:00:00 1 mix ssky0952
ssky-preemptable up 1-00:00:00 9 alloc ssky[0933-0934,0937-0940,0942-0943,0951]
ssky-preemptable up 1-00:00:00 9 idle ssky[0935-0936,0941,0945-0950]
ssky-ucb-aos up 7-00:00:00 6 alloc ssky[0933-0934,0937-0940]
ssky-ucb-aos up 7-00:00:00 3 idle ssky[0935-0936,0941]
ssky-csu-mbp up 7-00:00:00 1 idle ssky0945
ssky-csu-asb up 7-00:00:00 1 idle ssky0946
ssky-csu-rsp up 7-00:00:00 4 idle ssky[0947-0950]
Yikes! That is a lot of info. The numbers that you see on the ends of the lines there are node numbers that belong to each category.
The first column gives the partition that the nodes on the row belong to. (Note that
some nodes belong to multiple partitions). The second column tells us whether the
machine(s) are operational (i.e., up
). The third line is the time limit for jobs
in the partition and the fourth column is the number of nodes in the row. The fifth
column is very important—it describes the state of the nodes in the row. There are
technically a lot of states that a node can be in but the most common and relevant ones
are:
alloc
: Every core in the node is allocated to jobs that are currently running.mix
: Some, but not all of the cores on the node are allocated to currently running jobs.drain
: Jobs are running on the node, but no new jobs will be run on it, as it is being reserved (typically for maintenance).idle
: No cores are allocated to jobs. These are available for use!
If you wanted to see information about each node in the shas
partition, you could
print long information for each node like this:
On Summit, that creates almost 1500 lines of output. Some nodes are listed several times
because they belong to different partitions. To look at results for just the standard compute
parition (shas
: 380 nodes with Intel Xeon Haswell processors) you can use
Try that command. (Or try a similar command with an appropriate partition name on your own cluster.) If you want to see explicitly how many cores are available vs allocated on each node, how much total memory each node has, and how much of that total memory is free, in that partition you can do:
The top part of the output from that command on SUMMIT looks like:
NODELIST CPUS(A/I/O/T) MEMORY ALLOCMEM FREE_MEM
shas0101 0/0/24/24 116368 0 125156
shas0102 0/24/0/24 116368 0 112325
shas0103 24/0/0/24 116368 116352 87228
shas0104 24/0/0/24 116368 116352 80769
This says shas0101
has 24 CPUs that are Out (not functional at this point).
shas0102
, on the other hand, has 24 CPUs that are Idle, while shas0103
has
24 CPUs that are allocated, and so forth. (In our parlance, here,
CPU is being used to mean “core”). All of the nodes have
116 Gb of memory total. Most of them have about that much memory allocated to
the jobs they are running.
We can throw down some awk
to count the total number of available cores
(and the total number of all the cores):
% sinfo -N -p shas -O nodelist,cpusstate | awk -F"/" '{avail+=$2; tots+=$4} END {print "Out of", tots, "cores, there are", avail, "available"}'
Out of 10848 cores, there are 331 available
That could explain why it can be hard to get time on the supercomputer: at this time, only about 3% of the cores in the system are idle, waiting for jobs to go on them.
If you want to see how many jobs are in line, waiting to be launched, you can use the
squeue
command. Simply issuing the command squeue
will give a (typically very long)
list of all jobs that are either currently running or are in the queue,
waiting to be launched. This is worth doing in order to see how many
different people are using (or waiting to use) the resources.
If we run the command on Sedna, we see that (like we saw before) only one job is currently running:
% squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
160 himem MaSuRCA_ ggoetz R 34-15:56:16 1 himem01
Holy cow! Looking at the TIME column you can see that the genome assembly job has been running for over 34 days! (The time format is DAYS-Hours:Minutes:Seconds).
The output can be filtered to just PENDING
or RUNNING jobs using the -t
option. So,
lists all jobs waiting to be launched. If you wanted to see jobs that a particular
user (like yourself) has running or pending, you can use the -u
option. For example,
In fact that is such a worthwhile command that it is worth creating an alias
for it by adding a line such as the following to your ~/.bashrc
:
Then, typing myjobs
shows all of the jobs that you have running, or waiting to be launched
on the cluster. (Of course, you should replace eriq@colostate.edu
with your own
user name!)
8.4 Getting compute resources allocated to your jobs on an HPCC
8.4.1 Interactive sessions
Although most of your serious computing on the cluster will occur in batch jobs that get submitted to the scheduler and run without any further interaction from you, the user, we will begin our foray into the SLURM job scheduling commands by getting an interactive shell on a compute node. “Why?” you might ask? Well, before you launch any newly-written script as a batch job on the cluster, you really should run through the code, line by line, inside that script and ensure that things work as they should. Since your script likely does some heavy computing, you can’t do this on the head nodes (i.e., login nodes) of the cluster.
The system administrators of every different HPCC seem to have their own preferred (or required) way of requesting an interactive shell on a compute node. Before you start doing work on an HPCC, you should always read through the documents (usually available on a website associated with the HPCC) to learn the accepted way of doing certain tasks (like requesting an interactive shell). The following is a short survey of three different ways I have seen for requesting an interactive session. None of these methods is universally applicable to all HPCCs. (Again, read the documents for your own HPCC.)
One way to get an interactive shell on a compute node of a cluster is to request it with:
This tells slurm to run bash
(which is typically found at /bin/bash
on all Unix system) within
a pseudo-terminal (the --pty
part). If you run this command you might be given a bash shell on a compute
node. This works on the Sedna cluster, but not on all others.
On the SUMMIT cluster at CU Boulder, there is
a different recommended way of requesting an interactive shell. The
sysadmins have written their own script to do it, called sinteractive
.
So, you could do:
However, that method uses screen
under the hood, and that can interfere
with you TMUX session. So, you might prefer to do:
# get on the scompile node
ssh scompile
# Then use srun, requesting the shas-interactive partition
srun --partition shas-interactive --pty /bin/bash
Note that you might only be able to get one job at a time on the
shas-interactive
partition.
You might also try the shas-testing
partition on SUMMIT.
On the Hummingbird cluster, at UCSC, it takes a few more steps. The webpage at https://www.hb.ucsc.edu/getting-started/
tells you how to do that. Go to that page and search for Job Allocation – Interactive, Serial
. That takes you to
the instructions. Minimally, here is what it looks like to get one core
on a node on the Instruction partition, for 1 hour, with 1 Gigabyte of memory
allocated to you:
[~]--% salloc --partition=Instruction --nodes=1 --time=01:00:00 --mem=1G --cpus-per-task=1
salloc: Granted job allocation 48130
[~]--% ssh $SLURM_NODELIST
# check that you are on the right host.
[~]--% hostname
hbcomp-005.hbhpc.ucsc.edu
Regardless of the method used to request an interactive shell, when the command
returns with a command prompt, you should always check the hostname to make sure
that you are not, inadvertently, still on the login node. Do that with the hostname
command:
Or, to make it very easy to see which node you are logged into, you can change
the settings for your Unix command prompt to show you the name of the host you
are logged into (Section 4.2.2). For example, to include in your
command prompt the short name of the computer you are logged into, and the current
working directory, separated by a :
and a space, you could modify your .bashrc
file to define the PS1
environment
variable like so:
Now, whenever you see your command prompt, it will indicate the name of the computer you are logged into.
It is also worth using your alias myjobs
(see the last part of Section 8.3
to learn how to make such an alias) to check that your interactive shell session is listed amongst
your running jobs.
Once you have obtained an interactive shell on a compute node by one of the above methods, it should behave like any other shell you work on. You can give commands with it and define shell variables, etc. In short, it is perfect for running through your scripts, line by line, to make sure that they are working.
When you are done using your interactive shell, you should logout from it. This can be
done by typing logout
, or by hitting the d
key while holding down the control
button
(i.e. <cntrl>-d
). This returns the core that you were using to the pool of cores that can
be allocated to other users.
Different HPCCs have different settings for the default amount of time an interactive shell will be
granted to a user. On some systems, you can request a shell for a certain amount of time using the
--time
option to the slurm command.
8.4.2 Batch jobs
After you have tested your bioinformatic scripts in an interactive shell and are ready to run them “for real,” you will want to launch the script in a non-interactive, or batch job. Such a job is one that you submit to the job scheduler, and then, when resources for the job become available, the scheduler will launch the job without any futher interaction from you. The job will run until it completes (or fails, or runs out of time, etc.), all without you having to interact directly with the computing cluster after submitting the job.
As you might expect, the SLURM command used to submit batch jobs is named (…wait for it!…) sbatch
.
The syntax for using sbatch
is:
sbatch sbatch_options script.sh script_arguments
where:
sbatch
is the commandsbatch_options
shows the place where different options to thesbatch
command should be placed. These options can be a number of things like--time=00:10:00
and--mem=4G
. We will talk aboutsbatch
options in more detail below.script.sh
is the script that you have written that you want to run on the HPCC. We refer to this as the job script. It does not have to be namedscript.sh
, but, since it is a shell script, it ought to have the.sh
extension. The contents of this script file effectively define the job that you wish to run. It should be a shell script, and, on most SLURM systems you need to be specific about what “flavor” of shell it should be run in. This is done using the “shebang” line at the top of the script, like#!/bin/bash
(see the paragraph immediately before 5.2.1). If you want to know what path should come after the#!
(the “shebang”), you can give the commandwhich bash
from the shell on one of your cluster’s login nodes (or an interactive shell on a compute node), and the shell should respond with the absolute path to the bash shell interpreter.script_arguments
are any other options, or other arguments (like filenames) that you may want to pass to the script.
For practice in submitting batch jobs, we will first write a simple job script that prints a few pieces of information about the shell that is running and the environment variables defined within that shell.
So, first, change into your scratch directory
and make a new directory called sbatch-play
and cd
into it. Then, using a text editor,
copy the following lines to a shell script called easy.sh
and save it within
your current working directory (sbatch-play
). Note that the first line might
have to be modified for your system, depending on the output of which bash
, as mentioned
above:
#!/bin/bash
echo "Starting at $(date)"
echo
echo "Current working directory is: $PWD"
echo "Current user is: $USER"
echo "Current hostname is: $(hostname)"
echo
echo "Currently, the PATH is set to: $PATH"
echo
# now, let's print the states of all the the environment variables
# and send them to stderr:
env > /dev/stderr
sleep 180
echo "Done at $(date)"
Once you have made that file, schedule it to run, telling the scheduler
to put it in the queue to run, and that you are asking for only
5 minutes of computer time to run it, and to send stdout from the
job to a file named output-XXXXX
, and stderr to a file named
error-XXXXX
, where XXXXX
is the job number that was assigned to your job.
Job numbers get assigned in series by the job scheduler, so if the number
is 4373238, then yours is the 4,373,238-th job that has been requested
on the cluster.
You can check with your myjobs
alias to see if the job has been scheduled
and/or if it is running. Once it is running, you should see two new files
in your current working directory output-XXXXX
and error-XXXXX
.
First, check the output-XXXXX
file, using the cat
command. You should
see that the current working directory printed by the easy.sh
script
is the same as the current working directory from which you scheduled
the job. Additionally, the PATH variable accessed inside the easy.sh
script
is the same as the PATH variable in the shell that you scheduled
the job from. You can check this by issuing:
echo $PATH
in the shell from which you scheduled the job.
The take-home message from this little exploration is that the shell
in which your job script, easy.sh
, gets executed, inherits some of the important environment
variables (PATH, PWD, etc.) that are in effect in the shell from which you
scheduled the job.
However, a number of new environment variables have also been defined, most of them
supplying extra information from SLURM itself. To see that, we can compare the
file error-XXXXX
(which holds information about all the environment variables
in effect when the job scheduler launched easy.sh
) to the output of the
env
command (which prints the values of all the environment variables) when we
invoke it in the shell from which the job was scheduled. Try doing that like this:
# first, make a file of the environment variables in effect in the
# shell the job was scheduled from. Sort these alphabetically to
# make it easier to compare between the two environments
env | sort > scheduling_env.txt
# also, sort the error-XXXXX file
sort error-XXXXX > slurm_running_env.txt
# then use less to compare sched_env.txt and error-XXXXX.
# The adventurous can use tmux to split their screen vertically into
# two panes to make this easier:
Sorting the files, as we have done above, garbles up some multi-line bash functions, but
still makes it easy to see how the files differ. In particular, the
large block of environment variables that start with SLURM_
give information
about the job resources and job numbers, etc. These can come in handy, though
we will end up being more concerned with some of those variables when we
start using SLURM job arrays.
8.4.2.1 The sbatch --test-only
option
Especially when you start using an HPCC, scheduling jobs can be a confusing and somewhat
daunting process. I used to always find my blood pressure going up when I submitted large
jobs. Questions swirled in my head: “Will my script run? How long will I have to wait
for my job to launch? Have I requested more resources than are available on the cluster
such that my job will never launch?”, etc. SLURM provides the --test-only
option to
sbatch
that can be helpful in this regard. When sbatch
is run with the --test-only
option,
the scheduler verifies that your script looks like a shell script, and then gives you information
about how many cores in how many nodes and in which partition the job will be scheduled for.
Try it like this:
sbatch --test-only --time=00:05:00 --output=output-%j --error=error-%j easy.sh
# then see what happens if you request more than 24 hours:
sbatch --test-only --time=24:05:00 --output=output-%j --error=error-%j easy.sh
As the name of the option implies, when you are using --test-only
, your job does not
actually get scheduled.
sbatch
with the --test-only
option also gives an estimate of the time
when your job would be launched. This estimate seems to
be completely uninformative. It might tell you that it estimates your job would launch in
3 hours, but it could be much faster than that. One absolutely critical piece of information the
--test-only
option will give you is a nice big warning if you are requesting more resources than
are available to you. For example, on the shas
partition on SUMMIT, job run lengths are
capped at 24 hours, so you can’t request more than that amount of time. You can always use the --test-only
option to
ensure that you are not requesting more resources than are available.
We will note here that the --test-only
option is somewhat akin to rclone
’s --dry-run
option, which lets you see what files would be copied, without actually copying them. This
sort of option is always a nice feature when you want to check that you have set your requests
up properly before committing computing or network resources to them.
8.4.2.2 Get messages from SLURM
If you want to be notified by email when a job starts and finishes, (or fails
or gets killed by the scheduler). You can use the --mail-user
option to supply
an email address and the --mail-type
option to tell the scheduler when it should
email you with updates regarding your job status. For example, supply your own
email address in the following (replacing yourname@yourmail.edu
) and try it:
sbatch --mail-type=ALL --mail-user=yourname@yourmail.edu --time=00:05:00 --output=output-%j --error=error-%j easy.sh
You should receive an email from the cluster when your job starts and completes.
This can be a handy feature when you are not running too many jobs. If you have a large number
of jobs, being notified
every time a job starts and finishes can quickly fill up your inbox. In those cases, consider
setting --mail-type=FAIL
to only be notified when jobs have failed.
8.4.2.3 Include options inside your job script
You might note
that the above command line is getting quite long and cumbersome, and it would surely become
very difficult to remember and type commands with even more options, every time you wanted to start a job.
On top of that, the options that you would want to invoke are typically going to be specific to the job
script that you are launching (i.e. easy.sh
in the above example). Consequently, SLURM
let’s you specify the options passed to sbatch
inside the script file itself. This leads
to much simpler commands on the command line, and makes your workflows somewhat more reproducible.
To include the sbatch
options in the job script, you add them below the “shebang” line, in lines
that start with #SBATCH
before any other commands in the script.
For example, we could make a new file called easy2.sh
like this:
#!/bin/bash
#SBATCH --time=00:05:00
#SBATCH --output=output-%j
#SBATCH --error=error-%j
echo "Starting at $(date)"
echo
echo "Current working directory is: $PWD"
echo "Current user is: $USER"
echo "Current hostname is: $(hostname)"
echo
echo "Currently, the PATH is set to: $PATH"
echo
# now, let's print the states of all the the environment variables
# and send them to stderr:
env > /dev/stderr
sleep 180
And we could then test it like this:
or schedule it like this:
As you might imagine, sbatch
can take many different options.
We list the main ones that you will
be concerned with while doing bioinformatics in Table 8.1.
Option | What it does |
---|---|
--cpus-per-task=<ncpus> |
Indicate that ncpus cores are needed for each task in the job |
--chdir=<directory> |
Set the working directory of the batch script to “directory” before the script is run. This is seldom used if you schedule batch scripts from the directory in which they should be run. |
--error=<filename pattern> |
Path to which stderr should
be written. Like the
--output option this llows
for %j , %A , and %a (see
below). |
--job-name=<jobname> |
Assign a name to the batch
job. This name then appears
in, for example, squeve -u user . It should be short
and descriptive. |
--mail-type=<type> |
Upon which job-scheduling events for this job should email be sent to the user? Choices include NONE, BEGIN, END, FAIL, REQUEUE, ALL. |
--mail-user=<user> |
Address to which notification emails should be sent. |
--mem=<size[units]> |
How much memory is requested
per node. Most easily
specified in gigabytes with a
trailing as in 4G . For
example, --mem=4G . |
--mem-per-cpu=<size[units]> |
How much memory is requested
per core. For example,
--mem-per-cpu=4.6G . Only
one of --mem or
--mem-per-cpu should be
given. |
--output=<filename pattern> |
Path to which stdout should
be written while executing the
batch script. Path can include
%j , which expands to the job
ID, or, if running as a job
array, %A expands to the job
ID, and %a to the job array
index. |
--partition=<partition_names> |
Request an allocation of
resources from a compute node
in one of the partitions
listed in partition_names ,
which is a comma-separated
list of partitions. For
example,
--partition=shas,compute,himem |
--time=<time> |
Request an allocation of a
certain amount of time.
Specified in
days-hours:minutes:seconds
or hours:minutes:seconds ,
for example:
--time=2-12:00:00 or
--time=05:00:00 . Jobs
running longer than this time
will be killed. |
--test-only |
Usually given directly on the command line to verify the job submission is compliant without actually scheduling the job. |
8.4.2.4 Override within-file sbatch
directives from the command line
Since you can write options for the sbatch
command either 1) on the command
line (as in sbatch --time=00:03:00 ...
), or 2) within the script file given to
the sbatch
command (i.e., lines like #SBATCH --time=00:05:00
within
the top part of the script), you might wonder what happens when the same
option is supplied both on the command line and within the script.
It turns out that the version of the option on the command line takes precedence.
That is, any options that are given on the command line itself will
override the options if they are given within the file. For example,
if we submitted the above easy2.sh
script with the command:
it would be submitted with a time limit of 1 hour (01:00:00
) rather
than of 5 minutes (00:05:00
, as given within the file.)
This feature is particularly useful when you find yourself needing to re-run just certain elements of a SLURM job array, as we will see in a few sections.
8.4.2.5 The vagaries of conda
within sbatch
As we noted above, the shell environment in which SLURM executes
your scheduled job inherits many important variables (like PATH and
the current working directory) from the shell that you submit the job from.
One might hope that this would be sufficient to allow you to activate
a conda environment from within your job script. This turns out not to be
the case: even though several conda
environment variables are set up
and the path to conda
is set, there are some conda
initialization
features that have to happen within your job script in order to be able to
explicitly set your conda
environment within your job script.
To see this, first make a script called conda-test1.sh
that
tries to activate a different conda
environment within the job script.
In this case, it is activating an environment called trimmo
. If you
don’t have a trimmo environment, then you should replace trimmo
with
the name of another environment that you do have.
#!/bin/bash
#SBATCH --time=00:03:00
#SBATCH --output=conda-test1-output-%j
#SBATCH --error=conda-test1-error-%j
echo "Starting at $(date)"
echo
echo "Current working directory is: $PWD"
echo "Current user is: $USER"
echo "Current hostname is: $(hostname)"
echo
echo "Currently, the PATH is set to: $PATH"
echo
echo "Activating environment bioinf"
conda activate trimmo
echo
echo "Now the PATH is: $PATH"
sleep 120
and then run that with:
Make a note of the job number reported after you submit this job. That is the
XXXX
you will use in conda-test1-error-XXXX
.
After that has run, reading the contents of the conda-test1-error-XXXX
file you should
see an error about not being able to activate a conda environment. Likewise,
if you look at the two PATH
s printed out in conda-test1-output-XXXX
,
you will see that they are the same. So, indeed, the trimmo
environemnt
was not activated.
One way to remedy this problem is to source your ~/.bashrc
(which seems to be where Miniconda3 establishes its “conda init” block)
from within your job script.
For example, create a new file called conda-init2.sh
and fill it with
the following:
#!/bin/bash
#SBATCH --time=00:03:00
#SBATCH --output=conda-test2-output-%j
#SBATCH --error=conda-test2-error-%j
source ~/.bashrc
echo "Starting at $(date)"
echo
echo "Current working directory is: $PWD"
echo "Current user is: $USER"
echo "Current hostname is: $(hostname)"
echo
echo "Currently, the PATH is set to: $PATH"
echo
echo "Activating environment bioinf"
conda activate trimmo
echo
echo "Now the PATH is: $PATH"
echo
echo "And the environment variables look like:"
echo "================================"
env
sleep 120
Then run it with:
If conda
is working properly within the job script there should be no output
to conda-test2-error-XXXX
and you should see that invoking conda activate trimmo
changed the PATH variable to include paths from the trimmo
environment.
The above points are very important if you want to use software like bwa
, samtools
,
and bcftools
, that you installed using Miniconda, from within a job script.
To repeat: to activate a conda environment named env_name
, for example, from within a job script running
under sbatch
you must add these lines:
8.4.2.6 Test a longer running bash script
Now we are going to make a job script to map a pair of trimmed fastq
files in our non-model-wgs-example-data
directory.
Make a new directory in non-model-wgs-example-data
called scripts
and then make a job script called map_s002---1.sh
inside the scripts
directory with these lines in it (modifying the
email to be your email address)
#!/bin/bash
#SBATCH --time=01:00:00
#SBATCH --output=results/slurm_logs/map/map_s002---1-output-%j
#SBATCH --error=results/slurm_logs/map/map_s002---1-error-%j
#SBATCH --mail-type=ALL
#SBATCH --mail-user=yourname@yourmail.edu
source ~/.bash_profile # or ~/.bashrc if that is how you are set up
conda activate bwasam
mkdir -p results/bam # a directory in which to put the output from stdout
mkdir -p results/logs/bwa # a directory in which to put the output from stderr
mkdir -p results/logs/samtools-sort
SU=s002---1
RG='@RG\tID:s002_T199968_Lib-1_HY75HDSX2_1_CAGGTTCA+GGCGTTAT\tSM:T199968\tPL:ILLUMINA\tLB:Lib-1\tPU:HY75HDSX2.1.CAGGTTCA+GGCGTTAT'
PREFIX=results/trimmed/$SU
OUT=results/bam/$SU.bam
BWALOG=results/logs/bwa/$SU.log
SAMSORTLOG=results/logs/samtools-sort/$SU.log
bwa mem \
-R $RG \
resources/genome.fasta \
${PREFIX}_R1.fq.gz ${PREFIX}_R2.fq.gz 2> $BWALOG | \
samtools view -u - | \
samtools sort -l 9 -o $OUT - 2> $SAMSORTLOG
And schedule it to run (from within the non-model-wgs-example-data
directory) with:
Note that we make the directory results/slurm_logs/map
to put the slurm output into, but most of
the program output should go to the output files and the logs.
While this job is running it is worthwhile to reflect on how crazy it would be to
prepare a separate script for each of the pairs of FASTQ files in our
non-model-wgs-example-data
data set. Doing so would mean a lot of copying and pasting
followed by painstaking revision of the read group strings in the files, etc.
There would be lot of nearly redundant scripts and it would even be a hassle to
manually launch each one. Obviously, using a little programming and some variable
names, it would be better to have just a single script that could be applied, with
minor adjustments that happen at run-time, to all the pairs of FASTQ files.
The next sections shows one powerful way that the SLURM scheduler makes such automation of repetitive tasks a little easier: job arrays.
8.4.3 SLURM Job Arrays
Very often in bioinformatics, large jobs can be broken down into a series of smaller tasks. As an example, aligning sequencing read data from an individual sample to a reference genome can always be done independently of the sequences from any other individuals. Therefore a job that demands that sequencing data from 96 different individuals be aligned to a reference genome can be broken into 96 different tasks: each task involves mapping the sequencing data of just a single individual to the reference genome.
The bash shell for
loop provides a way to cycle over different repetitive tasks.
In the case of alignment over multiple individuals we might use a for
loop
in a shell script in the following fashion, to repeatedly run a script
called align.sh
on the paired-end read from each of four different individuals:
Such an approach will work OK on an HPCC if each task (aligning a single individual, in this case) takes just a short amount of time. But, what if aligning each individual takes almost a day of compute time, and you are only authorized to run jobs less than 24 hours on your HPCC? What is needed in that event is a convenient way to break jobs down into separate tasks, such that each task is run as its own separate job on the HPCC.
SLURM provides a feature known as job arrays for handling just this type of use case. When you submit a job array, you are, in effect, submitting to SLURM a recipe that is telling the scheduler that you are submitting some number of instances of a particular job. As resources become available, each new instance of that job starts to run. The different “instances” can be different replicates of a complex simulation, or analyses done on different instances of a data set. Our alignment example falls into the latter case: each instance of the job would be aligning the reads of a different individual.
SLURM job arrays are submitted by using the --array=<array_spec>
option to sbatch
.
Each task of a job array is associated with a whole number, and the
<array_spec>
argument to the --array
option tells which numbers (and, hence,
which tasks) should be scheduled to run. For example, if you submit a script
named mytasks.sh
to sbatch
,
and the script includes in the SLURM preamble a line that says:
then SLURM will schedule ten different jobs, indexed from 1 up to 10. That is all
well and good, but, in order to be useful, those 10 different jobs had best do
different things (for example, operate on 10 different data sets). The user must
see to it that each task of a job array performs the proper instance of a
job, and this is done using the shell environment variable SLURM_ARRAY_TASK_ID
.
When the first of the 10 scheduled tasks in the above job array is launched,
the environment variable SLURM_ARRAY_TASK_ID
is set to the value 1, and then
the contents of mytasks.sh
are executed. When the second task is launched,
SLURM_ARRAY_TASK_ID
is set to 2, and then mytasks.sh
is run, and so forth,
until the tenth job is launched with SLURM_ARRAY_TASK_ID
set to 10. Therefore, inside
the mytasks.sh
script, the user can use the value of SLURM_ARRAY_TASK_ID
(accessed
via variable substitution with a dollar sign, e.g., $SLURM_ARRAY_TASK_ID
) to
direct the script to perform the proper instance of the job that is to be done.
Thus, we could write our alignment example with four individuals, above, as a job array with a script file like this:
Submitting that job with sbatch
will submit four different tasks as part of
a job array, each task being the alignment of sequencing reads from one of
four different individuals.
8.4.3.1 Naming SLURM job array tasks
Internally, when you submit a job array to SLURM, it creates a job ID for the
job array and then it reserves specific job numbers for the tasks within the array. However,
you will find it easiest to refer to the separate tasks in a SLURM job array using the
syntax ArrayJobNumber_ArrayTaskNumber
. In other words, if we submitted the
job array script listed above for aligning the reads from four individuals, and SLURM
told us that the job-array had been assigned the job number 6677, then you can
refer to the different tasks as 6677_1
for the job aligning reads from Ind_1,
6677_2
for the job aligning reads from Ind_2, and so forth.
When using the --output
or --error
options to direct stdout and stderr to
files during execution of the scripts submitted using the --array
option to sbatch
, you
can name those files using %A
which expands to the overall job array number and
%a
which expands to the SLURM array task ID. So for example:
would be useful.
8.4.3.2 Variations on the <array_spec>
In the simplest case, you will want to evaluate tasks with indexes that proceed from 1 up to some stopping number, stepping by 1 each time. However, you can exert rather finely tuned control over which array task IDs you would like to be executed. The important thing to remember with all these variations is that no spaces are allowed!
As we have seen already, a dash is used to represent a range of numbers, such as:
To change the step size, you can immmediately follow the second number of the range
with a colon, :
, and give the step size. For example, to cycle over jobs with
array indexes 1, 4, 7, and 10, you could do:
If you wanted to pick out specific single indexes, you can combine single values and ranges with commas. For example, using just single values interspersed with commas:
You can even combine ranges with step sizes with commas. This would run indexes 1, 11, 21 and 100, 150, 200:
Being able to pick out single array index values to be done is wonderfully
convenient for redoing tasks that might have failed. For example, suppose you
ran a job with 500 tasks, and upon investigating the job logs you found that
array tasks 47, 368, 477, and 489 had failed because insufficient time was
allocated to them. In that case you could update the --time
setting of your
script and then re-run just those four jobs by including:
Be warned, however, that SLURM does not check your <array_spec>
to ensure that
each task index is given only once. If you specify that the same index should be run
twice, as is the case for 4 in the following:
then the job will be run twice, and possibly those two jobs will run at the same time so that output from each job will intermittently overwrite output from the other job, leading to corrupted results. Just be careful not to request the same task index more than once!
Finally, if you want to be friendly to the other users of your cluster
and want no more than X of your array tasks to be running at any time,
you can stipulate that by ending the <array_spec>
with %X
. For example,
to have no more than 5 array tasks running at any one time, you could do:
8.4.3.3 Translating Array Indexes to Job Instances
The user is left to translate what a job array index of, say, 7, means in terms of what actions that array task should take. Quite often you will want to map an array index to a different file to analyze, or perhaps a different region of a chromosome to do variant calling on, etc. A flexible and generic way of doing this mapping from array indexes to job specifics is to first define the variables (things like filenames, etc.) required by each array task in a simple TAB-delimited text file in which the first row holds the names of the variables in different TAB-separated columns, and each row below that holds the values that those variables should take for different values of the array index. The array index itself should be listed in the first column.
For example, if we have 10 different
files to process, each with different read-group values (Section 19.2),
you can specify the filenames and elements of the read group values in a TAB-delimited
text file called files-and-groups.tsv
that looks like this:
index file_prefix LB Flowcell Lane
1 DPCh_plate1_A01_S1_L1_R Lib-1 HTYYCBBXX 1
2 DPCh_plate1_A01_S1_L2_R Lib-1 HTYYCBBXX 2
3 DPCh_plate1_A01_S1_L3_R Lib-1 HTYYCBBXX 3
4 DPCh_plate1_A01_S1_L4_R Lib-1 HTYYCBBXX 4
5 DPCh_plate1_A01_S1_L5_R Lib-1 HTYYCBBXX 5
6 DPCh_plate1_A01_S1_L6_R Lib-1 HTYYCBBXX 6
7 DPCh_plate1_A01_S1_L7_R Lib-1 HTYYCBBXX 7
8 DPCh_plate1_A01_S1_L8_R Lib-1 HTYYCBBXX 8
9 DPCh_plate1_A02_S2_L1_R Lib-1 HTYYCBBXX 1
10 DPCh_plate1_A02_S2_L2_R Lib-1 HTYYCBBXX 2
Then, when executing a job array script, this file can be used to
convert the SLURM_ARRAY_TASK_ID
into the necessary file names
and other variables with a simple awk
script and
variable assignments within your job script
that look like this:
# make a command line that holds assignments of values to shell variables
ASSIGN=$(awk -F"\t" -v N=$SLURM_ARRAY_TASK_ID '
NR==1 {for(i=1;i<=NF;i++) vars[i]=$i; next}
$1 == N {for(i=1;i<=NF;i++) printf("%s=%s; ", vars[i], $i)}
' files-and-groups.tsv)
# make those assignments
eval $ASSIGN
Then, when SLURM_ARRAY_TASK_ID is equal to 7, for example, the variable
ASSIGN
gets the string value:
and then evaluating that string with eval $ASSIGN
will define for you the values of the
shell variables index
, file_prefix
, LB
, Flowcell
, and Lane
, as indicated
above. These shell variables may be used in subsequent code in your script
(via variable subtitution with the $
, like $file_prefix
) to ensure that the
proper files are processed, etc., for each different value of SLURM_ARRAY_TASK_ID.
8.4.3.4 SLURM Job Array hands-on
Here, you can run some small scripts to get your feet wet with
defining slurm job arrays. Create a directory in your scratch
directory
named job-array-fun
and then cd
into that directory. In that directory,
save the following lines to a script
named test-arrays.sh
and then submit it via sbatch test-arrays.sh
.
It should produce three output files.
#!/bin/bash
#SBATCH --time=00:05:00
#SBATCH --output=my_output_%A_%a
#SBATCH --array=1-3
echo "The SLURM_ARRAY_JOB_ID is : $SLURM_ARRAY_JOB_ID"
echo "The SLURM_ARRAY_TASK_ID is: $SLURM_ARRAY_TASK_ID"
echo "The SLURM_JOB_ID is: $SLURM_JOB_ID"
echo
echo "You can refer to this individual SLURM array task as: ${SLURM_ARRAY_JOB_ID}_${SLURM_ARRAY_TASK_ID}"
echo
sleep 180
Once this has been submitted, use your myjobs
alias to see how it looks
when it is queued: something like this:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
4499013_[1-3] shas test-arr eriq@col PD 0:00 1 (Priority)
If it is taking a while to start this job, and you don’t have a myjobs
alias,
put one in your ~/.bashrc
. Add these lines:
where you have to change username to be your login name on the HPCC. Note
that, for Colorado State University affiliates on SUMMIT, that username is of the form,
CSUeID@colostate
.
If you would like to see each one of the tasks in a job array listed
on its own line, you can use the -r, --array
option to squeue
. In terms
of a myjobs
alias that means issuing myjobs -r
or myjobs --array
(both are
equivalent):
% myjobs -r
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
4499052_1 shas test-arr eriq@col PD 0:00 1 (Priority)
4499052_2 shas test-arr eriq@col PD 0:00 1 (Priority)
4499052_3 shas test-arr eriq@col PD 0:00 1 (Priority)
I want people to look closely at the variables SLURM_ARRAY_JOB_ID and
SLURM_ARRAY_TASK_ID, that get written to stdout by each job into the files
my_output_%A_%a
(where %A
and %a
expand to the job array ID and the
job array task ID, respectively) and to understand how those are different from
SLURM_JOB_ID in this context. We will discuss that in class.
Unfortunately, it seems like it can take a long time for that to start on SUMMIT. So, here are the results:
==> my_output_27234_1 <==
The SLURM_ARRAY_JOB_ID is : 27234
The SLURM_ARRAY_TASK_ID is: 1
The SLURM_JOB_ID is: 27235
You can refer to this individual SLURM array task as: 27234_1
==> my_output_27234_2 <==
The SLURM_ARRAY_JOB_ID is : 27234
The SLURM_ARRAY_TASK_ID is: 2
The SLURM_JOB_ID is: 27236
You can refer to this individual SLURM array task as: 27234_2
==> my_output_27234_3 <==
The SLURM_ARRAY_JOB_ID is : 27234
The SLURM_ARRAY_TASK_ID is: 3
The SLURM_JOB_ID is: 27234
You can refer to this individual SLURM array task as: 27234_3
Next, we want to explore the possibility of overriding the #SBATCH --array
option
given internally in the test-arrays.sh
with one given on the command line. Thus,
try submitting the job with:
When that job has finished, go ahead and look at the output files and make sure it is clear how the SLURM_JOB_IDs and the SLURM_ARRAY_JOB_IDs and the SLURM_ARRAY_TASK_IDs vary.
8.4.3.5 More on converting Task IDs to variables from a tab delimited file
We want to return, for a moment, to the useful task of assigning shell variables from a line of values in a TAB-separated-values file, like was done in the section 8.4.3.3. This can be incredibly useful, so we will be a little more formal about it.
8.4.3.5.1 Make a numbered-units.tsv
file
First, we will
describe how to make a file called numbered-units.tsv
from the file
units.tsv
that we have in our non-model-wgs-example-data
directory. This file is just like units.tsv
, but the first column is called index
, and it
holds the row numbers of all the values in the file. We do this with awk
.
To learn more about awk
see Section 6.1.
# Do this within you non-model-wgs-example-data directory
awk 'BEGIN {OFS="\t"} NR==1 {print "index", $0; next} {print ++n, $0}' units.tsv > numbered-units.tsv
Now, the first few columns of file numbered-units.tsv
look like this:
index sample unit library flowcell platform lane sample_id barcode
1 s001 1 Lib-1 HY75HDSX2 ILLUMINA 1 T199967 AAGACCGT+CAATCGAC
2 s002 1 Lib-1 HY75HDSX2 ILLUMINA 1 T199968 CAGGTTCA+GGCGTTAT
3 s002 2 Lib-1 HY75HDSX2 ILLUMINA 2 T199968 CAGGTTCA+GGCGTTAT
4 s003 1 Lib-1 HY75HDSX2 ILLUMINA 2 T199969 TTACCGAC+CGTATTCG
5 s003 2 Lib-1 HY75HDSX2 ILLUMINA 3 T199969 TTACCGAC+CGTATTCG
6 s003 3 Lib-1 HTYYCBBXX ILLUMINA 2 T199969 TTACCGAC+CGTATTCG
7 s004 1 Lib-1 HY75HDSX2 ILLUMINA 3 T199970 TCGTCTGA+TCAAGGAC
8 s004 2 Lib-2 HY75HDSX2 ILLUMINA 4 T199970 AGTGACCT+CTCCTAGA
9 s005 1 Lib-1 HY75HDSX2 ILLUMINA 2 T199971 TTCCAGGT+AAGCACTG
10 s005 2 Lib-1 HY75HDSX2 ILLUMINA 4 T199971 TTCCAGGT+AAGCACTG
11 s005 3 Lib-3 HTYYCBBXX ILLUMINA 2 T199971 AGCCTATC+GTTACGCA
12 s006 1 Lib-1 HY75HDSX2 ILLUMINA 1 T199972 TACGGTCT+GCAATGGA
13 s006 2 Lib-1 HTYYCBBXX ILLUMINA 3 T199972 TACGGTCT+GCAATGGA
14 s006 3 Lib-2 HY75HDSX2 ILLUMINA 2 T199972 TCATCTCC+CTAGCAAG
15 s006 4 Lib-2 HTYYCBBXX ILLUMINA 4 T199972 TCATCTCC+CTAGCAAG
16 s007 1 Lib-1 HY75HDSX2 ILLUMINA 2 T199973 TAGGAGCT+GTTAAGGC
17 s007 2 Lib-1 HY75HDSX2 ILLUMINA 3 T199973 TAGGAGCT+GTTAAGGC
18 s007 3 Lib-2 HY75HDSX2 ILLUMINA 2 T199973 CCAGTATC+ATCTCGCT
19 s007 4 Lib-2 HY75HDSX2 ILLUMINA 3 T199973 CCAGTATC+ATCTCGCT
20 s008 1 Lib-1 HY75HDSX2 ILLUMINA 1 T199974 TACTCCAG+CCTATACC
21 s008 2 Lib-2 HY75HDSX2 ILLUMINA 1 T199974 TTGCGAGA+GTGCCATA
22 s008 3 Lib-3 HY75HDSX2 ILLUMINA 1 T199974 GAACGAAG+GGTGATTC
Having the row number in the first column is helpful if we want to pick out particular rows in the file.
8.4.3.5.2 Make a line-assign.sh
script and put it in your PATH
Next, we are going to wrap the awk
code we used in section
8.4.3.3 into a little shell script that we put in our
PATH so that we can use it from any directory we might be in.
Create a directory called bin
in your home directory with:
Then open a file called line_assign.sh
in the bin
directory, using
nano
with this command:
Copy the following script into nano
and save it and exit nano
.
#!/bin/bash
# simple script. Give it a TAB delimited file $2 with the first column holding
# indexes. This will pick out the line that matches $1, and return a command
# line assigning the column header names as shell variables whose values are the
# follows in the file $2.
if [ $# -ne 2 ]; then
echo "Wrong number of arguments in $0
Expecting two arguments, with syntax like this:
line_assign.sh IndexValue TSV-file
Where
IndexValue is the number in the first column of the row you want to
extract values from.
TSV-file is a tab-separated-values file with the first column named
index, and the other columns named according to shell variables whose
values you want to set to the values in the row IndexValue
Typical usage:
eval \$(line_assign.sh 7 numbered-units.tsv)
That will set, in the current shell, a series of shell variables whose
names are the column names to values in the
row IndexValue
" > /dev/stderr
exit 1
fi
awk -F"\t" -v LINE=$1 '
$1 == "index" {for(i=1; i<=NF; i++) vars[i]=$i; next}
$1 == LINE {for(i=1; i<=NF; i++) printf("%s=\"%s\"; ", vars[i], $i)}
' $2
Once you have saved and exited that file, you have to use chmod
to’
make sure that it is executable:
Finally, we just have to make sure that your shell always knows to look
in ~/bin
for shell scripts. So, edit your .bashrc
file with nano
,
using:
to add the following line
above the # >>> conda initialize >>>
block:
When you have made that change to your ./bashrc
file, be sure to
run
to make sure the change is applied to your current shell.
(NOTE: you only need to do this to get the changes in currently open
terminals. When you login fresh, your full ~/.bashrc
contents will
be loaded).
After this, you should be able to type
from any directory on your system, and get the usage message about how
to use the line_assign.sh
script.
Finally, in your non-model-wgs-example-data
directory, run the script
to see the command lines it returns:
See how that is a series of commands separated by semicolons. Each one is an assignment of a value to a shell variable.
Recalling what we learned about how to substitute the output of commands
back on the command line with the $(comm)
syntax (Section 5.4) and how to use
the eval
keyword to ensure the shell evaluates substituted values
(Section 5.3.3), note that you can do this:
and, all of sudden, you will have shell variables named
index
, sample
, unit
, library
, flowcell
, platform
, lane
, sample_id
, barcode
, fq1
, fq2
, kb1
, and kb2
, and their values
are all of the values in row index 10 of the files numbered-units.tsv
.
Try it out by looking a the values of some these. Evaluate all of the following:
echo "
sample will expand to: $sample
unit will expand to: $unit
library will expand to: $library
sample_id will expand to: $sample_id
"
Yes! This is just what we need to prepare a killer SLURM job array
to map all of our trimmed reads in non-model-wgs-example-data/results/trimmed
to our genome.
Let’s proceed to Section 27.6.3.
8.5 Keeping track of how many resources were used or are being used
8.5.1 Checking out resource use by currently running jobs
If you want to get more information about a currently running job you can use:
scontrol show job <job_id>
where <job_id>
is the slurm job ID.
This can be helpful if you see someone using all the cores on a node, and you
want to see what they are doing. You can login via ssh nodeXX
(or whatever
the node name is) and then run top
or ps -ax
to find their job.
You might find that someone (or even yourself) has checked out 20 cores, but there job only ends up using one of them.
So, this can be helpful for guiding yourself and others.
8.5.2 How efficient were you with allocated cores and memory?
After a job has finished, some SLURM environments have the seff
command,
which stands for SLURM efficiency
. It gives a nice summary of how much
CPU use you made as a function of how much you were allocated. And it does the
same with memory.
The syntax is seff <job_id>
. For example:
Gives, in my case, output that looks like:
Job ID: 474963
Cluster: sedna
User/Group: eanderson/eanderson
State: COMPLETED (exit code 0)
Cores: 1
CPU Utilized: 00:10:30
CPU Efficiency: 99.21% of 00:10:35 core-walltime
Job Wall-clock time: 00:10:35
Memory Utilized: 443.09 MB
Memory Efficiency: 9.63% of 4.49 GB
8.6 Last year’s stuff. Now somewhat defunct
Finally, we want to contemplate strategies for turning the SLURM_ARRAY_TASK_ID
into “directions” for the script about what to do (which files to process, etc.)
We discussed one approach, above, using a TAB-delimited file, but want to note here that depending on how your actual script is configured, any number of approaches might be suitable.
For an example germane to doing alignment to a reference genome in our
homework chr-32-bioinformatics
, consider that you have 1280 pairs of files
to process, and that you will launch jobs to process 128 such pairs at a time.
A script called run-single-job.sh
that mapped the first 128 pairs of files
looked like this:
#!/bin/bash
#SBATCH --time=05:00:00
#SBATCH --output=run-single-out-%j
#SBATCH --error=run-single-error-%j
#SBATCH --mail-type=ALL
#SBATCH --mail-user=user@email.edu
./map-N-files-from-K.sh 128 1
If you wanted to map the next 128 pairs of files, the final line in the above script should look like:
and to do the next 128 pairs of files after that, the line should look like:
and so forth.
So, if everyone has already run the first 128 pairs of files (as everyone should have finished for the homework) I contend that, if we wanted to modify this script so that the remaining 1152 pairs of files could be run as a job array, our new script could look like this:
#!/bin/bash
#SBATCH --time=02:00:00
#SBATCH --output=run-array-out-%A_%a
#SBATCH --error=run-array-error-%A_%a
#SBATCH --mail-type=ALL
#SBATCH --mail-user=user@email.edu
#SBATCH --array=1-9
./map-N-files-from-K.sh 128 $((SLURM_ARRAY_TASK_ID * 128 + 1))
Save that file as run-array-job.sh
Note the changes:
- output and error files modified to use
%A_%a
- Added the –array command running from 1 to 9.
- Changed the starting index from 1 to
$((SLURM_ARRAY_TASK_ID * 128 + 1))
The last one deserves some explanation, it uses the integer arithmetic capabilities of the bash shell (Section 5.3.6).
So, what is that last line doing? You should check it for yourself. Doing so
will emphasize an important point when it comes to testing your own job-array scripts:
if you are testing the code in them interactively, you can simply assign any value you
want to a variable named SLURM_ARRAY_TASK_ID
and verify the results are what you would
like them to be. So, try this:
And to check them all, including when SLURM_ARRAY_TASK_ID=0 (which is the case that you already did for your homework, running it as a single job), try this:
for SLURM_ARRAY_TASK_ID in {0..9}; do
echo ./map-N-files-from-K.sh 128 $((SLURM_ARRAY_TASK_ID * 128 + 1))
done
Once you have convinced yourself that your run-array-job.sh
script will
work correctly, and it is set for an array=1-9, and you have configured
your email properly within it, then you should copy it to your
chr-32-bioinformatics
directory and submit it as a job to finish aligning
the reads from all the remaining individuals, so we will have them for
variant calling down the road.
THAT LAST STEP ABOVE IS PART OF YOUR HOMEWORK FOR NEXT WEEK THAT WILL BE DUE NEXT TUESDAY
8.7 PREPATION INTERLUDE: An in-class exercise to make sure everything is configured correctly
Here, we run through a number of steps to start aligning sequence data to a reference genome. The main purpose of this exercise is to ensure that everyone has their systems configured correctly. Namely, we want to make sure that:
- You can log in to your cluster (SUMMIT, Hummingbird, etc.)
- You can use a few SLURM (job scheduler) commands.
- You can find your way to your scratch space on SUMMIT (or Hummingbird). scratch is fast, largely unlimited short-term storage. It is where you will do the majority of your bioinformatics.
- You can use
rclone
to tranfer files from Google Drive to your cluster. Doing things this way will allow you to access your Lab’s Team Drive. The model for bioinformatic work is:- Copy data from your Lab’s Google Team Drive to scratch on the cluster.
- Do analyses of the data (in scratch)
- Copy the results back to your Lab’s Google Team Drive before it gets deleted (by the system administrators) off of scratch Note: today you are not using your lab’s Team drive, you are using the “class” shared drive, and you won’t be copying anything back to it. But this is a start at least…
- You have Miniconda installed, that you have a conda environment with bioinformatic tools in it, and that you can activate that environment and use the tools within it. Miniconda makes it easy to install software that you need to run analyses.
Being able to successfully do all these things will be necessary to complete the next homework assignment which involves aligning short reads to a reference genome and will be assigned after spring break.
If you are not already in a tmux session, start a new
tmux
session calledinclass
on the login node.Get a bash shell on a compute node
Check what host you are on with
hostname
hostname # The hostname should be something like shasXXX. # Do not proceed if you are on a login node still (ie. hostname is loginXX)
If you are still on a login node, try running
sinteractive
again.Use your alias
myjobs
(if you have created it) to see what jobs you have running.Navigate to your scratch directory using
cd
. On SUMMIT that will look like this:You can make a symbolic link to that in your home directory like this:
cd # returns you to hour home directory # this line makes a symbolic link called scratch in your home directory # that points to your scratch space at /scratch/summit/wcfunk\@colostate.edu/ ln -s /scratch/summit/wcfunk\@colostate.edu/ scratch
Just be sure to use your own user name instead of wcfunk. Once that symbolic link is set up, you can
cd
into it, just like it was your scratch directory:But if you don’t want to set up a symbolic link just use the full path:
On hummingbird you go to
where username is your ucsc username. Note that if you want to make a symbolic link to that, then while in your home directory do this:
Within your account’s scratch space, make a directory called
chinook-play
andcd
into it:Make sure you have followed the link in your email to the shared google drive folder I sent you last night. If the email went to an email address that is not associated with your rclone configuration for google drive, then request access (at google drive) for that email. (Or setup a new rclone config for your other gmail address…)
Now, use rclone to list the files in the folder I shared with you:
Note that
gdrive-pers
above is the name of my rclone configuration associated with the gmail account where I got an email (from myself at my CSU account) about the shared folder. You will have to changegdrive-pers
to be appropriate to your config, depending on how you set it up. Note also how you access folders shared with you using--drive-shared-with-me
.After you get the above command to work, you should see something like the following, though it will change as more things get added to this directory…
Use rclone to download
big-fastq-play.zip
(about 170 MB). Once again you have to change the name of the rclone remoate (gdrive-pers
in this case), to the name of your Google drive rclone remote.Unzip the file you just downloaded. Note: on Hummingbird,
unzip
is not installed by default. So I recommend doingconda install unzip
to add it to your base Miniconda environment.This creates a directory called
big-fastq-play
within yourchinook-play
directory.Create a directory called
fastq
and move the two paired-end FASTQ files frombig-fastq-play
into it:Download the Chinook salmon genome into a directory called
genome
mkdir genome cd genome wget ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/vertebrate_other/Oncorhynchus_tshawytscha/all_assembly_versions/GCA_002872995.1_Otsh_v1.0/GCA_002872995.1_Otsh_v1.0_genomic.fna.gz
That dude is 760 Mb, but it goes really quickly (at least on SUMMIT. Hummingbird seems to be a different story…)
Activate your
bioinf
conda environment to be able to usebwa
:Index the genome with
bwa
in order to align reads to it:This will take about 10 to 15 minutes. During that time, do this:
- If you are doing this within a
tmux
session, create a new tmux window (cntrl-b c
). Note that this shell opens on the login node (the one that you started your tmux session on.) - Rename this new tmux window “browse” (
cntrl-b ,
) - Navigate to
~/scratch/chinook-play/genome/
and list the files there. These new files are being created by bwa. - Switch back to your original tmux window.
- If you are doing this within a
If
bwa
is still running, ask me some questions. Or maybe I will tell you more about how conda works…If
bwa
is still running after that, ask me about nested tmux sessions.Truth be told. Indexing this thing can take a long time (30 to 40 minutes on a single core, perhaps). So, instead, here is what we will do:
- Kill the current
bwa
job but issuingcntrl-c
in the shell where it is running. - Use
rclone
to get the genome and abwa
index for it from my Google Drive. We want to put the result in thechinook-play
directory inside a directory calledchinook-genome-idx
. NOTE: Again, you will likely have to use your name for your rclone remote, rather thangdrive-pers
.
cd ../ # move up to the chinook-play directory rclone copy -P --drive-shared-with-me gdrive-pers:CSU-con-gen-2020/pre-indexed-chinook-genome chinook-genome-idx
That thing is 4.6 Gb, but can take less than a minute to transfer. (Let’s see if things slow down with everyone running this command at the same time.)
- Kill the current
Now when you do
ls
you should see something like this:For a final hurrah, we will start aligning those fastqs to the chinook genome. We will put the result in a directory we create called
sam
. To make it clearer what we are doing we will define some shell variables.:mkdir sam GENOME=chinook-genome-idx/GCA_002872995.1_Otsh_v1.0_genomic.fna.gz FQB=fastq/Battle_Creek_01_chinook_R SAMOUT=$(basename $FQB).sam # for fun, try echoing each of those variables and make sure they look right. # for example: echo $FQB # ... # then, launch bwa mem to map those reads to the genome. # the syntax is: # Usage: bwa mem [options] <idxbase> <in1.fq> [in2.fq] bwa mem $GENOME ${FQB}1.fq.gz ${FQB}2.fq.gz > sam/$SAMOUT 2>sam/$SAMOUT.error
Once that has started, it is hard to know anything is happening, because we have redirected both stdout and stderr to files. So, do what you did before, use tmux to go to another shell and then use
tail
orless
to look at the output files, which are insam/Battle_Creek_01_chinook_R.sam
andsam/Battle_Creek_01_chinook_R.sam.error
within thechinook-play
directory inscratch
. You might even count how many lines of alignments file have been formed:Note that, in practice, we would not usually save the
.sam
file to disk. Rather, we would convert it into a compressed.bam
file right as it comes off ofbwa mem
, and maybe even sort it as it is coming off…Note that this process might not finish before the end of class. That is not a huge problem if you are running the job within
tmux
. You can logout and it will still keep running.
However, as said before, most of the time you will run batch jobs rather than interactive jobs. I just wanted to first give everyone a chance to step through these tasks in an interactive shell on a compute node because that is crucial when developing your scripts, and we wanted to make sure that everyone has rclone, and miniconda installed and working.
8.8 More Boneyard…
Here is some nice stuff for summarizing all the information from the different runs from the chinook-wgs project:
Explain scratch space and how clusters are configured with respect to storage, etc.
Strategies—break names up with consistent characters:
- dashes within population names
- underscores for different groups of chromosomes
- periods for catenating pairs of pops
etc. Basically, it just makes it much easier to split things up when the time comes.
8.11 Compiling programs without admin privileges
Inevitably you will want to use a piece of software that is not available as a module or is not otherwise installed on they system.
Typically these software programs have a frightful web of dependencies.
Unix/Linux distros typically maintain all these dependencies as libraries or packages
that can be installed using a rpm
or yum
. However, the simple “plug-and-play” approach
to using these programs requires have administrator privileges so that the software can
be installed in one of the (typically protected) paths in the root (like /usr/bin
).
But, you can use these programs to install packages into your home directory. Once you have done that, you need to let your system know where to look for these packages when it needs them (i.e., when running a program or linking to it whilst compiling up a program that uses it as a dependency.
Hoffman2 runs CentOS. Turns out that CentOS uses yum
as a package manager.
Let’s see if we can install llvm using yum.
yum search all llvm # <- this got me to devtoolset-7-all.x86_64 : Package shipping all available toolsets.
# a little web searching made it look like llvm-toolset-7-5.0.1-4.el7.x86_64.rpm or devtoolset-7-llvm-7.0-5.el7.x86_64.rpm
# might be what we want. The first is a dependency of the second...
mkdir ~/centos
Was using instructions at https://stackoverflow.com/questions/36651091/how-to-install-packages-in-linux-centos-without-root-user-with-automatic-depen
Couldn’t get yum downloader to download any packages. The whole thing looked like it was going to be a mess, so I thought I would try with miniconda.
I installed miniconda (python 2.7 version) into /u/nobackup/kruegg/eriq/programs/miniconda/
and then did this:
# probably could have listed them all at once, but wanted to watch them go
# one at a time...
conda install numpy
conda install scipy
conda install pandas
conda install numba
# those all ran great.
conda install pysnptools
# that one didn't find a match, but I found on the web that I should try:
conda install -c bioconda pysnptools
# that worked!
Also we want to touch briefly on LD_PATH (linking failures—and note that libraries are often named libxxx.a) and CPATH (for failure to find xxxx.h), etc.
8.12 Job arrays
Quick note: Redefine IFS to break on TABs so you can have full commands in there. This is super useful for parsing job-array COMMLINES files.
IFS=$'\t\n'; BOP=($(echo boing | awk '{printf("first\tsecond\tthird that is long\tfourth\n");}')); IFS=$' \t\n';
Definitely mention the eval
keyword in bash for when you want to print
command lines with redirects.
Show the routine for it, and develop a good approach to efficiently orchestrating redos. If you know the taskIDs of the ones that failed then it is pretty easy to write an awk script that picks out the commands and puts them in a new file. Actually, it is probably better to just cycle over the numbers and use the -t option to launch each. Then there is now changing the job-ids file.
In fact, I am starting to think that the -t option is better than putting it into the file.
Question: if you give something on the command line, does that override the directive in the header of the file? If so, then you don’t even need to change the file. Note that using the qsub command line options instead of the directives really opens up a lot of possibilities for writing useful scripts that are flexible.
Also use short names for the jobs and have a system for naming the
redos (append numbers so you know which round it is, too)
possibly base the name on the ways things failed the first time. Like,
fsttf1
= “Fst run for things that failed due to time limits, 1”. Or
structure things so that redos can just be done by invoking it with -t
and the jobid.
8.13 Writing stdout and stderr to files
This is always good to do. Note that stdbuf
is super useful here so that
things don’t get buffered super long. (PCAngsd doesn’t seem to write antyhing till
the end…)
8.14 Breaking stuff down
It is probably worth talking about how problems can be broken down into smaller ones. Maybe give an example, and then say that we will be talking about this for every step of the way in bioinformatic pipelines.
One thing to note—sometimes processes go awry for one reason or another. When things are in smaller chunks it is not such a huge investment to re-run it. (Unlike stuff that runs for two weeks before you realize that it ain’t working right).