Chapter 8 High Performance Computing Clusters (HPCC’s)

One interesting aspect of modern next generation sequencing data is simply its sheer size: it is not uncommon to receive a half or a full terabyte of data from a sequencing center. It is impractical to store this much data on your laptop, let alone analyze it there. Crunching through such a quantity of data on a single computer or server could take a very long time. Instead, you will likely break up the analysis of such data into a number of smaller jobs, and send them off to run on an assortment of different computers in a High Performance Computing Cluster (HPCC).

Thus, even if you have immersed yourself in bioinformatic data file formats and honed your skills at shell programming and accessing remote computers, sequence data analysis remains such a formidable foe that there is still one last key area of computing in which you must be fluent, in order to comfortably do bioinformatics: you must understand how to submit and manage jobs sent to an HPCC.

My first experience with HPCC’s occurred when I started analyzing high-throughput sequencer output. I had over 15 years experience in shell programming at that time, and I was given some example analysis scripts to emulate, but I still found it took several weeks before I was moderately comfortable in an HPCC environment. Most HPCC’s have some sort of tutorial web pages that provide a little bit of background on cluster computing, but I didn’t find the ones available to me, at the time, to be particularly helpful.

The goal of this chapter is to provide the sort of background I wish that I had when I started doing cluster computing for bioinformatics. I will not be providing a comprehensive overview of parallel computation. For example, we will not focus at all upon the rich tradition of parallel computing applications through “message passing” interfaces which can maintain a synchronized analysis from a single program executing on multiple computers at the same time. Rather, we will focus on the manner in which most bioinformatic problems can be broken down into a series of smaller jobs, each of which can be run, independently, on its own processor without the need for maintaining synchrony between multiple processes.

We start with an overview of what an HPCC consists of, defining a few important terms. Then we provide some background on the fundamental problem of cluster computing: namely that a lot of people want to use the computing power of the cluster, but it needs to be allocated to users in an equitable fashion. An understanding of this forms the basis for our discussion of job scheduling and the methods you must use to tell the job scheduler what resources you will need, so that those resources will eventually be allocated to you. We will cover a job scheduler called SLURM, which stands for the Simple Linux Utility for Resource Management. It is the scheduler used on the Summit supercomputer in Boulder and the Hummingbird and Sedna clusters deployed at UCSC and the NWFSC, respectively.

8.1 An oversimplified, but useful, view of a computing cluster

At its simplest, an HPCC can be thought of as a whole lot of computers that are all put in a room somewhere for the purposes of doing a lot of computations. Most of these computers do not possess all the elements that typically come to mind when you think of a “computer.” For example, none of them are attached to monitors—each computer resembles just the “box” part of a desktop computer. Each of these “boxes” is called a node. The node is the unit in a cluster that corresponds most closely to what you think of as a “computer.” As is typical of most computers today, each of these nodes has some amount of Random Access Memory (RAM) that is only accessible by the node itself. RAM is the space in memory that is used for active computation and calculation.

Each of these nodes might also have a hard drive used for operating system software. The bulk of the hard drive space that each node can access, however, is in the form of a large array of hard disks that are connected to all of the nodes by an interface that allows data to be transferred back and forth between each node and the hard-drive array at speeds that would be expected of an internal hard drive (i.e., this array of hard drives is not just plugged in with a USB cable). Memory on hard drives is used for storing data and the results of calculations, but is not used for active calculations the way RAM is used. In order for calculations to be done on data that are on the hard drive array, it must first be read into a node’s RAM. After the calculations are done, the results are typically written back out onto the hard drive array.

On a typical cluster, there are usually several different “portions” of the hard drive array attached to every node. One part of the array holds home directories of the different users. Each user typically has a limited amount of storage in their home directory, but the data in these home directories is usually safe or protected, meaning you can put a file there and expect that it will be there next week, month, or year, etc. It is also likely that the home directories are backed up (but check the docs for your cluster to confirm this!). Since the space in home directories is limited, you typically will not put large data sets in your home directory for long-term storage, but you will store scripts and programs, and such items as cloned GitHub repositories there. (On the SUMMIT computer, space on the home directory is very restricted, and as we have already seen, your projects directory at /projects/username serves the purpose of a typical home directory on SUMMIT).

Another type of storage that exists on the hard drive array is called persistent long-term storage. This type of storage is purchased for use by research groups to store large quantities of data on the cluster for long periods of time. As discussed in the last chapter, the rise of cloud-based storage solutions, like Google Drive, offering unlimited storage to institutional users, makes persistent long-term storage less important (and less cost-effective) for many research groups. Finally, the third type of storage in the hard drive array is called scratch storage. There are usually fairly light limits (if any at all) to how much data can be placed in scratch storage, but there will typically be Draconian time limits placed on your scratch storage. For example, on the Hoffman2 cluster at UCLA, you are granted 2 Tb of scratch storage, but any files that have sat unmodified in scratch for more than 14 days will be deleted (and if space is tight, the system administrators may delete things from scratch in far fewer then 14 days.) Check your local cluster documentation for information about time and space limits on scratch.

On many clusters, scratch space is also configured to be very fast for input and output (for example, on many systems, the scratch storage will be composed of solid state drives rather than spinning hard disks). On jobs that require that a lot of data be accessed from the drive or written to it (this includes most operations on BAM files), significant decreases in overall running time can be seen by using fast storage. Finally, scratch space exists on a cluster expressly as the place to put data and outputs on the hard drive when running jobs. For all these reasons, when you run jobs, you will always want to read and write data from and to scratch. We will talk more about the specifics of doing so, but for now, you should be developing a generic picture of your cluster computing workflow that looks like:

  1. Download big data files from the cloud to scratch
  2. Run analyses on those data files, writing output back to scratch.
  3. When done, copy, to the cloud, any products that you wish to keep.
  4. Remove data and outputs left on scratch.

As is the case with virtually all modern desktop or laptop computers, within each node, there are multiple (typically between 16 and 48) cores, which are the computer chip units that actually do the computations within a node. A serial job is one that just runs on a single core within a node, while a parallel job might run on multiple cores, at the same time, within a single node. In such a parallel job, each core has access to the same data within the node’s RAM (a “shared-memory,” parallel job). The core is the fundamental unit of computing machinery that gets allocated to perform jobs in an HPCC.

Most of the nodes in a cluster are there to hold the cores which are the computational workhorses, slogging through calculations for the HPCC’s myriad users. However, some nodes are more appropriate to certain types of computations than others (for example, some might have lots of memory for doing genome assembly, while others will have big, hurkin’, graphical processing units to be used for GPU calculations). Or, some nodes might be available preferentially for different users than for others. For these reasons, nodes are grouped into different collections. Depending on the system you are using, these collections of different nodes are called, either, partitions or queues. The world of SLURM uses the term partitions for these collections, and we will adopt that language as well, using queue to refer to the line of jobs that are waiting to start on an HPCC; however we warn the reader that other job schedulers (like the Univa Grid Engine) use the term “queues” to refer to collections of nodes.

On every cluster, there will be one to several nodes that are reserved not for doing computation, but for allowing users to access the cluster. These are called the login nodes or the head nodes. These nodes are solely for logging in, light editing of scripts, minor manipulation of directories, and scheduling and managing jobs. They are absolutely not for doing major computations. For example, you should never login to the head node and immediately start using it, in an interactive bash session to, say, sort BAM files or run bwa mem to do alignments. Running commands that require a lot of computation—or a lot of input and output from the disk—on the login nodes is an egregious faux pas of cluster computing. Doing so can negatively impact the ability of other users to login or otherwise get their work done, and it might lead the system administrators to disable your account. Therefore, never do it! All of your hardcore computation on a cluster has to be done on a compute node. We will show how to do that shortly, but first we will talk about why.

8.2 Cluster computing and the job scheduler

When you do work, or stream a video, or surf the web on your laptop computer, there are numerous different computer processes running, to make sure that your computer keeps working and doing what it is supposed to be doing. In the case of your laptop, the operating system, itself, orchestrates all these different processes, making sure that each one is given some compute time on your laptop’s processors in order to get its work done. Your laptop’s operating system has a fair bit of flexibility in how it allocates resources to these different processes: it has multiple cores to assign different processes to, and it allows multiple processes to run on a single core, alternating between these different processes over different cycles of the central processing unit. Things work differently on a shared resource like an HPCC. The main, interesting problem of cluster computing is basically this: lots of people want to use the cluster to run big jobs, but the cluster does not run like a single computer.

The cluster is not happy to give lots of different jobs from lots of different users a chance to all run on the same core, sharing time by dividing up cycles. When a user wants to use the computational resources of an HPCC, she cannot just start a job and be confident that it will launch immediately and be granted at least a few CPU cycles every now and again. Rather, on an HPCC, every job that is submitted by a user will be assigned to a dedicated core (or several cores, if requested, and granted) with a dedicated amount of memory. If a core is not available for use, the job “gets in line” (into the “queue”, so to speak) where it sits and waits (doing nothing) until a core (or multiple cores if requested) with sufficient associated resources, like RAM memory, becomes available. When such a core becomes available in the cluster, the job gets launched on it. All of this is orchestrated by the job scheduler, of which SLURM is an example.

In this computing model, a job, once it is launched, ties up the core and the memory that has been allocated to it until the job is finished. While that job is running, no one else’s jobs or processes can run on the core or share the RAM memory that was allocated to the job. For this reason, the job scheduler, needs to know, ahead of time, how long each job might run and what resources will be required during that time. A simple contrived example illustrates things easily: imagine that Joe and Cheryl each have 1000 separate jobs to run. Each of Cheryl’s jobs involves running a machine-learning algorithm to identify seabirds in high-resolution, aerial images of the ocean, and takes only about 20 minutes running on a single core. Each of Joe’s jobs, on the other hand, involves mapping billions of sequencing reads, a task which requires about 36 hours when run on a single core. If their cluster has only 640 cores, and Joe submits his jobs first, then, if the job scheduler were naive, it might put all of his jobs in line first, requiring some 50 or 60 hours before the first of Cheryl’s jobs even runs. This would be a huge buzz kill for Cheryl. However, if Cheryl and Joe both have to provide estimates to the scheduler of how long their jobs will run, the scheduler can make more equitable decisions, starting a few of Joe’s jobs, but retaining many more cores for Cheryl’s jobs, each of which runs much faster.

Thus, when you want to run any jobs on a cluster, you must provide an estimate of the resources that the job will require. The three main axes upon which these resources are measured are:

  1. The number of cores the job will require.
  2. The maximum amount of RAM (memory) the job will require.
  3. The amount of time for which the job will run.

Requests for large amounts of resources for long periods of time generally take longer to start. There are two main reasons for this: either 1) the scheduler does not want to launch too many long-duration, high-memory jobs because it anticipates other users will want to use resources down the road and no single user should tie up the compute resources for too long; or 2) there are so many jobs running on myriad nodes and cores, that only infrequently do nodes with sufficient numbers of cores and RAM come available to start the new jobs.

The second reason is a particular bane of new cluster users who unwittingly request more resources than actually exist (i.e. 52 cores, when no single node has more than 32; or 50 Gb of RAM when no single node has more than 48 Gb). Unfortunately (or, perhaps comically, if you have a sick sense of humor), some job schedulers will not notify you of this sort of transgression. Rather, your job will just sit in line waiting to be launched, but it never will be, because sufficient resources never become available!
(Foruntately, however, the SLURM scheduler is “wise” enough to notify the user if she or he has requested more resources than will ever become available.)

It is worth noting that regardless of whether reason 1 or reason 2 is the dominant cause influencing how long it takes to start a job, asking for fewer resources for less time will generally allow your jobs to start faster. Particularly because of reason #2, however, breaking your jobs down (if possible) into small chunks that will run relatively quickly on a single core with low RAM needs can render many more opportunities for your jobs to start, letting you tap into resources that are not often fully utilized in a cluster. Since I started working on a large cluster in which it took a long time to start a job that required all or most of the cores on a single node—but in which there were many nodes harboring a few cores that were not being used—I tend to endorse the approach of breaking jobs down into small units that will run relatively quickly, with low RAM requirements, on a single core.

Since the requested resources for a job can play such a large role in wait times for jobs to start, you might wonder why people don’t intentionally underestimate the resources they request for their jobs. The answer is simple: the job scheduler is a highly efficient and completely dispassionate manager. If you requested 2 hours for your job, but your job has not finished in that amount of time, the job scheduler will waste no time hemming and hawing or having an emotional struggle with itself about whether it should stop your job. No, at 2 hours and 0.2 seconds it WILL kill your job, regardless of whether it is just about to finish, or not. Similarly, if you requested 4 Gb of RAM, but five hours into your job, the program you are running ends up using 5 Gb or RAM to store a large chunk of data, your job WILL be killed, immediately.

Thus, it is best to be able to accurately estimate the time and resources a job will require. You always want to request more time and resources than your job will actually need, but not too much more. A large part of getting good at computing in a shared cluster resource is gaining experience in predicting how long different jobs will run, and how much RAM they will require. Later we will describe how the records of your jobs, stored by the job scheduler, can be accessed and organized to aid in predicting the resource demand of future jobs.

8.3 Learning about the resources on your HPCC

Most computing clusters have a web page that describes the configuration of the system and the resources available on it. However, you can use various SLURM commands to learn about those resources as well, and also to gain information about which of the resources are in use and how many users are waiting to use them.

Requests for resources, and for information about the computing cluster, are made to the job scheduler using a few basic commands. As said, we will focus on the commands available in a SLURM-based system. In a later section, we will more fully cover the commands used to launch, schedule, and manage jobs. Here we will first explore the SLURM commands that you can use to “get to know” your cluster.

All SLURM commands begin with an s, and all SLURM systems support the sinfo command that gives you information about the cluster’s nodes and their status (whether they are currently running jobs or not.) On your cluster, man sinfo will tell you about this command.

On the Sedna cluster, shortly after it got installed and therefore did not have many users, when we issue the sinfo command we see this:

% sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
nodes*       up   infinite     28   idle node[01-28]
himem        up   infinite      1  alloc himem01
himem        up   infinite      2   idle himem[02-03]

which tells us that there are two partitions (collection of nodes) named nodes and himem. The himem partition has one node which is currently allocated to a job (STATE = alloc), and two more that are not currently allocated to jobs. The himem partition holds machines with lots of RAM memory for tasks like sequence assembly (and, when I ran that command, one of the nodes in himem was busy assembling the genome of some interesting sea creature. It also has 28 compute nodes in the nodes partition that are idle, which means they are available for use. This is a very small cluster.

If you do the same command on SUMMIT you get many more lines of output. There are many more different partitions, and there is a lot of information about how many nodes are in each:

% sinfo
PARTITION        AVAIL  TIMELIMIT  NODES  STATE NODELIST
shas*               up 1-00:00:00      2 drain* shas[0136-0137]
shas*               up 1-00:00:00      2  down* shas[0404,0506]
shas*               up 1-00:00:00      1  drain shas0101
shas*               up 1-00:00:00      3   resv shas[0102,0521,0853]
shas*               up 1-00:00:00     74    mix shas[0125,0130,0133,0138,0141,0149,0156,0158,0218,0222,0229,0236-0237,0240-0241,0243,0246,0255,0303,0311,0314,0322,0335,0337,0341,0343,0351,0357,0402,0411,0413,0415-0416,0418,0423,0428,0432-0433,0435-0436,0440,0452,0455-0456,0459,0501,0504,0514,0522-0524,0526-0527,0556,0608,0611-0613,0615-0616,0631,0637,0801,0810-0811,0815,0834-0835,0850,0855,0907-0909,0921]
shas*               up 1-00:00:00    359  alloc shas[0103-0124,0126-0129,0131-0132,0134-0135,0139-0140,0142-0148,0150-0155,0157,0159-0160,0201-0217,0219-0221,0223-0225,0227-0228,0230-0235,0238-0239,0242,0247-0254,0256-0260,0301-0302,0304-0310,0312-0313,0315-0321,0323-0334,0336,0338-0340,0342,0344-0350,0352-0356,0358-0360,0401,0405-0410,0412,0414,0417,0419-0422,0424-0427,0429-0431,0434,0437-0439,0442-0451,0453-0454,0457-0458,0460,0502-0503,0505,0507-0513,0515-0520,0525,0528-0555,0560,0605-0607,0609,0614,0617-0630,0632-0636,0638-0650,0652-0664,0802-0809,0812-0814,0816-0833,0836-0849,0851-0852,0854,0856-0860,0901-0906,0910-0913,0915-0920,0922-0932]
shas*               up 1-00:00:00     11   idle shas[0226,0244-0245,0403,0441,0557-0559,0610,0651,0914]
shas-testing        up   infinite      2 drain* shas[0136-0137]
shas-testing        up   infinite      2  down* shas[0404,0506]
shas-testing        up   infinite      1  drain shas0101
shas-testing        up   infinite      3   resv shas[0102,0521,0853]
shas-testing        up   infinite     74    mix shas[0125,0130,0133,0138,0141,0149,0156,0158,0218,0222,0229,0236-0237,0240-0241,0243,0246,0255,0303,0311,0314,0322,0335,0337,0341,0343,0351,0357,0402,0411,0413,0415-0416,0418,0423,0428,0432-0433,0435-0436,0440,0452,0455-0456,0459,0501,0504,0514,0522-0524,0526-0527,0556,0608,0611-0613,0615-0616,0631,0637,0801,0810-0811,0815,0834-0835,0850,0855,0907-0909,0921]
shas-testing        up   infinite    359  alloc shas[0103-0124,0126-0129,0131-0132,0134-0135,0139-0140,0142-0148,0150-0155,0157,0159-0160,0201-0217,0219-0221,0223-0225,0227-0228,0230-0235,0238-0239,0242,0247-0254,0256-0260,0301-0302,0304-0310,0312-0313,0315-0321,0323-0334,0336,0338-0340,0342,0344-0350,0352-0356,0358-0360,0401,0405-0410,0412,0414,0417,0419-0422,0424-0427,0429-0431,0434,0437-0439,0442-0451,0453-0454,0457-0458,0460,0502-0503,0505,0507-0513,0515-0520,0525,0528-0555,0560,0605-0607,0609,0614,0617-0630,0632-0636,0638-0650,0652-0664,0802-0809,0812-0814,0816-0833,0836-0849,0851-0852,0854,0856-0860,0901-0906,0910-0913,0915-0920,0922-0932]
shas-testing        up   infinite     11   idle shas[0226,0244-0245,0403,0441,0557-0559,0610,0651,0914]
shas-interactive    up   infinite      2 drain* shas[0136-0137]
shas-interactive    up   infinite      2  down* shas[0404,0506]
shas-interactive    up   infinite      1  drain shas0101
shas-interactive    up   infinite      3   resv shas[0102,0521,0853]
shas-interactive    up   infinite     74    mix shas[0125,0130,0133,0138,0141,0149,0156,0158,0218,0222,0229,0236-0237,0240-0241,0243,0246,0255,0303,0311,0314,0322,0335,0337,0341,0343,0351,0357,0402,0411,0413,0415-0416,0418,0423,0428,0432-0433,0435-0436,0440,0452,0455-0456,0459,0501,0504,0514,0522-0524,0526-0527,0556,0608,0611-0613,0615-0616,0631,0637,0801,0810-0811,0815,0834-0835,0850,0855,0907-0909,0921]
shas-interactive    up   infinite    359  alloc shas[0103-0124,0126-0129,0131-0132,0134-0135,0139-0140,0142-0148,0150-0155,0157,0159-0160,0201-0217,0219-0221,0223-0225,0227-0228,0230-0235,0238-0239,0242,0247-0254,0256-0260,0301-0302,0304-0310,0312-0313,0315-0321,0323-0334,0336,0338-0340,0342,0344-0350,0352-0356,0358-0360,0401,0405-0410,0412,0414,0417,0419-0422,0424-0427,0429-0431,0434,0437-0439,0442-0451,0453-0454,0457-0458,0460,0502-0503,0505,0507-0513,0515-0520,0525,0528-0555,0560,0605-0607,0609,0614,0617-0630,0632-0636,0638-0650,0652-0664,0802-0809,0812-0814,0816-0833,0836-0849,0851-0852,0854,0856-0860,0901-0906,0910-0913,0915-0920,0922-0932]
shas-interactive    up   infinite     11   idle shas[0226,0244-0245,0403,0441,0557-0559,0610,0651,0914]
sgpu                up 1-00:00:00      1   resv sgpu0501
sgpu                up 1-00:00:00      1  alloc sgpu0502
sgpu                up 1-00:00:00      9   idle sgpu[0101-0102,0201-0202,0301-0302,0401-0402,0801]
sgpu-testing        up   infinite      1   resv sgpu0501
sgpu-testing        up   infinite      1  alloc sgpu0502
sgpu-testing        up   infinite      9   idle sgpu[0101-0102,0201-0202,0301-0302,0401-0402,0801]
sknl                up 1-00:00:00      1  drain sknl0710
sknl                up 1-00:00:00      1   resv sknl0706
sknl                up 1-00:00:00     18  alloc sknl[0701-0705,0707-0709,0711-0720]
sknl-testing        up   infinite      1  drain sknl0710
sknl-testing        up   infinite      1   resv sknl0706
sknl-testing        up   infinite     18  alloc sknl[0701-0705,0707-0709,0711-0720]
smem                up 7-00:00:00      1   drng smem0201
smem                up 7-00:00:00      4  alloc smem[0101,0301,0401,0501]
ssky                up 1-00:00:00      1   drng ssky0944
ssky                up 1-00:00:00      1    mix ssky0952
ssky                up 1-00:00:00      3  alloc ssky[0942-0943,0951]
ssky-preemptable    up 1-00:00:00      1   drng ssky0944
ssky-preemptable    up 1-00:00:00      1    mix ssky0952
ssky-preemptable    up 1-00:00:00      9  alloc ssky[0933-0934,0937-0940,0942-0943,0951]
ssky-preemptable    up 1-00:00:00      9   idle ssky[0935-0936,0941,0945-0950]
ssky-ucb-aos        up 7-00:00:00      6  alloc ssky[0933-0934,0937-0940]
ssky-ucb-aos        up 7-00:00:00      3   idle ssky[0935-0936,0941]
ssky-csu-mbp        up 7-00:00:00      1   idle ssky0945
ssky-csu-asb        up 7-00:00:00      1   idle ssky0946
ssky-csu-rsp        up 7-00:00:00      4   idle ssky[0947-0950]

Yikes! That is a lot of info. The numbers that you see on the ends of the lines there are node numbers that belong to each category.

The first column gives the partition that the nodes on the row belong to. (Note that some nodes belong to multiple partitions). The second column tells us whether the machine(s) are operational (i.e., up). The third line is the time limit for jobs in the partition and the fourth column is the number of nodes in the row. The fifth column is very important—it describes the state of the nodes in the row. There are technically a lot of states that a node can be in but the most common and relevant ones are:

  • alloc: Every core in the node is allocated to jobs that are currently running.
  • mix: Some, but not all of the cores on the node are allocated to currently running jobs.
  • drain: Jobs are running on the node, but no new jobs will be run on it, as it is being reserved (typically for maintenance).
  • idle: No cores are allocated to jobs. These are available for use!

If you wanted to see information about each node in the shas partition, you could print long information for each node like this:

% sinfo -l -N

On Summit, that creates almost 1500 lines of output. Some nodes are listed several times because they belong to different partitions. To look at results for just the standard compute parition (shas: 380 nodes with Intel Xeon Haswell processors) you can use

% sinfo -l -N -p shas

Try that command. (Or try a similar command with an appropriate partition name on your own cluster.) If you want to see explicitly how many cores are available vs allocated on each node, how much total memory each node has, and how much of that total memory is free, in that partition you can do:

% sinfo -N -p shas -O nodelist,cpusstate,memory,allocmem,freemem

The top part of the output from that command on SUMMIT looks like:

NODELIST            CPUS(A/I/O/T)       MEMORY              ALLOCMEM            FREE_MEM            
shas0101            0/0/24/24           116368              0                   125156              
shas0102            0/24/0/24           116368              0                   112325              
shas0103            24/0/0/24           116368              116352              87228               
shas0104            24/0/0/24           116368              116352              80769  

This says shas0101 has 24 CPUs that are Out (not functional at this point). shas0102, on the other hand, has 24 CPUs that are Idle, while shas0103 has 24 CPUs that are allocated, and so forth. (In our parlance, here, CPU is being used to mean “core”). All of the nodes have 116 Gb of memory total. Most of them have about that much memory allocated to the jobs they are running.

We can throw down some awk to count the total number of available cores (and the total number of all the cores):

% sinfo -N -p shas -O  nodelist,cpusstate | awk -F"/" '{avail+=$2; tots+=$4} END {print "Out of", tots, "cores, there are", avail, "available"}' 
Out of 10848 cores, there are 331 available

That could explain why it can be hard to get time on the supercomputer: at this time, only about 3% of the cores in the system are idle, waiting for jobs to go on them.

If you want to see how many jobs are in line, waiting to be launched, you can use the squeue command. Simply issuing the command squeue will give a (typically very long) list of all jobs that are either currently running or are in the queue, waiting to be launched. This is worth doing in order to see how many different people are using (or waiting to use) the resources.

If we run the command on Sedna, we see that (like we saw before) only one job is currently running:

% squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               160     himem MaSuRCA_   ggoetz  R 34-15:56:16      1 himem01

Holy cow! Looking at the TIME column you can see that the genome assembly job has been running for over 34 days! (The time format is DAYS-Hours:Minutes:Seconds).

The output can be filtered to just PENDING or RUNNING jobs using the -t option. So,

squeue -t PENDING

lists all jobs waiting to be launched. If you wanted to see jobs that a particular user (like yourself) has running or pending, you can use the -u option. For example,

squeue -u eriq@colostate.edu

In fact that is such a worthwhile command that it is worth creating an alias for it by adding a line such as the following to your ~/.bashrc:

alias myjobs='squeue -u eriq@colostate.edu'

Then, typing myjobs shows all of the jobs that you have running, or waiting to be launched on the cluster. (Of course, you should replace eriq@colostate.edu with your own user name!)

8.4 Getting compute resources allocated to your jobs on an HPCC

8.4.1 Interactive sessions

Although most of your serious computing on the cluster will occur in batch jobs that get submitted to the scheduler and run without any further interaction from you, the user, we will begin our foray into the SLURM job scheduling commands by getting an interactive shell on a compute node. “Why?” you might ask? Well, before you launch any newly-written script as a batch job on the cluster, you really should run through the code, line by line, inside that script and ensure that things work as they should. Since your script likely does some heavy computing, you can’t do this on the head nodes (i.e., login nodes) of the cluster.

The system administrators of every different HPCC seem to have their own preferred (or required) way of requesting an interactive shell on a compute node. Before you start doing work on an HPCC, you should always read through the documents (usually available on a website associated with the HPCC) to learn the accepted way of doing certain tasks (like requesting an interactive shell). The following is a short survey of three different ways I have seen for requesting an interactive session. None of these methods is universally applicable to all HPCCs. (Again, read the documents for your own HPCC.)

One way to get an interactive shell on a compute node of a cluster is to request it with:

srun --pty /bin/bash

This tells slurm to run bash (which is typically found at /bin/bash on all Unix system) within a pseudo-terminal (the --pty part). If you run this command you might be given a bash shell on a compute node. This works on the Sedna cluster, but not on all others.

On the SUMMIT cluster at CU Boulder, there is a different recommended way of requesting an interactive shell. The sysadmins have written their own script to do it, called sinteractive. So, you could do:

sinteractive

However, that method uses screen under the hood, and that can interfere with you TMUX session. So, you might prefer to do:

# get on the scompile node
ssh scompile

# Then use srun, requesting the shas-interactive partition
srun --partition shas-interactive --pty /bin/bash

Note that you might only be able to get one job at a time on the shas-interactive partition.

You might also try the shas-testing partition on SUMMIT.

On the Hummingbird cluster, at UCSC, it takes a few more steps. The webpage at https://www.hb.ucsc.edu/getting-started/ tells you how to do that. Go to that page and search for Job Allocation – Interactive, Serial. That takes you to the instructions. Minimally, here is what it looks like to get one core on a node on the Instruction partition, for 1 hour, with 1 Gigabyte of memory allocated to you:

[~]--% salloc --partition=Instruction --nodes=1 --time=01:00:00 --mem=1G --cpus-per-task=1
salloc: Granted job allocation 48130
[~]--% ssh $SLURM_NODELIST

# check that you are on the right host.
[~]--% hostname
hbcomp-005.hbhpc.ucsc.edu

Regardless of the method used to request an interactive shell, when the command returns with a command prompt, you should always check the hostname to make sure that you are not, inadvertently, still on the login node. Do that with the hostname command:

hostname

Or, to make it very easy to see which node you are logged into, you can change the settings for your Unix command prompt to show you the name of the host you are logged into (Section 4.2.2). For example, to include in your command prompt the short name of the computer you are logged into, and the current working directory, separated by a : and a space, you could modify your .bashrc file to define the PS1 environment variable like so:

PS1='[\h: \W]--% '

Now, whenever you see your command prompt, it will indicate the name of the computer you are logged into.

It is also worth using your alias myjobs (see the last part of Section 8.3 to learn how to make such an alias) to check that your interactive shell session is listed amongst your running jobs.

Once you have obtained an interactive shell on a compute node by one of the above methods, it should behave like any other shell you work on. You can give commands with it and define shell variables, etc. In short, it is perfect for running through your scripts, line by line, to make sure that they are working.

When you are done using your interactive shell, you should logout from it. This can be done by typing logout, or by hitting the d key while holding down the control button (i.e. <cntrl>-d). This returns the core that you were using to the pool of cores that can be allocated to other users.

Different HPCCs have different settings for the default amount of time an interactive shell will be granted to a user. On some systems, you can request a shell for a certain amount of time using the --time option to the slurm command.

8.4.2 Batch jobs

After you have tested your bioinformatic scripts in an interactive shell and are ready to run them “for real,” you will want to launch the script in a non-interactive, or batch job. Such a job is one that you submit to the job scheduler, and then, when resources for the job become available, the scheduler will launch the job without any futher interaction from you. The job will run until it completes (or fails, or runs out of time, etc.), all without you having to interact directly with the computing cluster after submitting the job.

As you might expect, the SLURM command used to submit batch jobs is named (…wait for it!…) sbatch.
The syntax for using sbatch is:

sbatch sbatch_options script.sh script_arguments

where:

  • sbatch is the command
  • sbatch_options shows the place where different options to the sbatch command should be placed. These options can be a number of things like --time=00:10:00 and --mem=4G. We will talk about sbatch options in more detail below.
  • script.sh is the script that you have written that you want to run on the HPCC. We refer to this as the job script. It does not have to be named script.sh, but, since it is a shell script, it ought to have the .sh extension. The contents of this script file effectively define the job that you wish to run. It should be a shell script, and, on most SLURM systems you need to be specific about what “flavor” of shell it should be run in. This is done using the “shebang” line at the top of the script, like #!/bin/bash (see the paragraph immediately before 5.2.1). If you want to know what path should come after the #! (the “shebang”), you can give the command which bash from the shell on one of your cluster’s login nodes (or an interactive shell on a compute node), and the shell should respond with the absolute path to the bash shell interpreter.
  • script_arguments are any other options, or other arguments (like filenames) that you may want to pass to the script.

For practice in submitting batch jobs, we will first write a simple job script that prints a few pieces of information about the shell that is running and the environment variables defined within that shell.

So, first, change into your scratch directory and make a new directory called sbatch-play and cd into it. Then, using a text editor, copy the following lines to a shell script called easy.sh and save it within your current working directory (sbatch-play). Note that the first line might have to be modified for your system, depending on the output of which bash, as mentioned above:

#!/bin/bash

echo "Starting at $(date)"
echo 
echo "Current working directory is: $PWD"
echo "Current user is: $USER"
echo "Current hostname is: $(hostname)"
echo
echo "Currently, the PATH is set to: $PATH"
echo

# now, let's print the states of all the the environment variables
# and send them to stderr:
env > /dev/stderr

sleep 180
echo "Done at $(date)"

Once you have made that file, schedule it to run, telling the scheduler to put it in the queue to run, and that you are asking for only 5 minutes of computer time to run it, and to send stdout from the job to a file named output-XXXXX, and stderr to a file named error-XXXXX, where XXXXX is the job number that was assigned to your job. Job numbers get assigned in series by the job scheduler, so if the number is 4373238, then yours is the 4,373,238-th job that has been requested on the cluster.

sbatch --time=00:05:00 --output=output-%j --error=error-%j  easy.sh

You can check with your myjobs alias to see if the job has been scheduled and/or if it is running. Once it is running, you should see two new files in your current working directory output-XXXXX and error-XXXXX.

First, check the output-XXXXX file, using the cat command. You should see that the current working directory printed by the easy.sh script is the same as the current working directory from which you scheduled the job. Additionally, the PATH variable accessed inside the easy.sh script is the same as the PATH variable in the shell that you scheduled the job from. You can check this by issuing:

echo $PATH

in the shell from which you scheduled the job.

The take-home message from this little exploration is that the shell in which your job script, easy.sh, gets executed, inherits some of the important environment variables (PATH, PWD, etc.) that are in effect in the shell from which you scheduled the job.

However, a number of new environment variables have also been defined, most of them supplying extra information from SLURM itself. To see that, we can compare the file error-XXXXX (which holds information about all the environment variables in effect when the job scheduler launched easy.sh) to the output of the env command (which prints the values of all the environment variables) when we invoke it in the shell from which the job was scheduled. Try doing that like this:

# first, make a file of the environment variables in effect in the
# shell the job was scheduled from.  Sort these alphabetically to
# make it easier to compare between the two environments
env | sort > scheduling_env.txt

# also, sort the error-XXXXX file
sort error-XXXXX > slurm_running_env.txt

# then use less to compare sched_env.txt and error-XXXXX.

# The adventurous can use tmux to split their screen vertically into 
# two panes to make this easier:

Sorting the files, as we have done above, garbles up some multi-line bash functions, but still makes it easy to see how the files differ. In particular, the large block of environment variables that start with SLURM_ give information about the job resources and job numbers, etc. These can come in handy, though we will end up being more concerned with some of those variables when we start using SLURM job arrays.

8.4.2.1 The sbatch --test-only option

Especially when you start using an HPCC, scheduling jobs can be a confusing and somewhat daunting process. I used to always find my blood pressure going up when I submitted large jobs. Questions swirled in my head: “Will my script run? How long will I have to wait for my job to launch? Have I requested more resources than are available on the cluster such that my job will never launch?”, etc. SLURM provides the --test-only option to sbatch that can be helpful in this regard. When sbatch is run with the --test-only option, the scheduler verifies that your script looks like a shell script, and then gives you information about how many cores in how many nodes and in which partition the job will be scheduled for. Try it like this:

sbatch --test-only --time=00:05:00 --output=output-%j --error=error-%j  easy.sh

# then see what happens if you request more than 24 hours:
sbatch --test-only --time=24:05:00 --output=output-%j --error=error-%j  easy.sh

As the name of the option implies, when you are using --test-only, your job does not actually get scheduled. sbatch with the --test-only option also gives an estimate of the time when your job would be launched. This estimate seems to be completely uninformative. It might tell you that it estimates your job would launch in 3 hours, but it could be much faster than that. One absolutely critical piece of information the --test-only option will give you is a nice big warning if you are requesting more resources than are available to you. For example, on the shas partition on SUMMIT, job run lengths are capped at 24 hours, so you can’t request more than that amount of time. You can always use the --test-only option to ensure that you are not requesting more resources than are available.

We will note here that the --test-only option is somewhat akin to rclone’s --dry-run option, which lets you see what files would be copied, without actually copying them. This sort of option is always a nice feature when you want to check that you have set your requests up properly before committing computing or network resources to them.

8.4.2.2 Get messages from SLURM

If you want to be notified by email when a job starts and finishes, (or fails or gets killed by the scheduler). You can use the --mail-user option to supply an email address and the --mail-type option to tell the scheduler when it should email you with updates regarding your job status. For example, supply your own email address in the following (replacing yourname@yourmail.edu) and try it:

sbatch --mail-type=ALL --mail-user=yourname@yourmail.edu --time=00:05:00 --output=output-%j --error=error-%j  easy.sh

You should receive an email from the cluster when your job starts and completes.

This can be a handy feature when you are not running too many jobs. If you have a large number of jobs, being notified every time a job starts and finishes can quickly fill up your inbox. In those cases, consider setting --mail-type=FAIL to only be notified when jobs have failed.

8.4.2.3 Include options inside your job script

You might note that the above command line is getting quite long and cumbersome, and it would surely become very difficult to remember and type commands with even more options, every time you wanted to start a job. On top of that, the options that you would want to invoke are typically going to be specific to the job script that you are launching (i.e. easy.sh in the above example). Consequently, SLURM let’s you specify the options passed to sbatch inside the script file itself. This leads to much simpler commands on the command line, and makes your workflows somewhat more reproducible. To include the sbatch options in the job script, you add them below the “shebang” line, in lines that start with #SBATCH before any other commands in the script.

For example, we could make a new file called easy2.sh like this:

#!/bin/bash

#SBATCH --time=00:05:00
#SBATCH --output=output-%j
#SBATCH --error=error-%j 


echo "Starting at $(date)"
echo 
echo "Current working directory is: $PWD"
echo "Current user is: $USER"
echo "Current hostname is: $(hostname)"
echo
echo "Currently, the PATH is set to: $PATH"
echo

# now, let's print the states of all the the environment variables
# and send them to stderr:
env > /dev/stderr

sleep 180

And we could then test it like this:

sbatch --test-only easy2.sh

or schedule it like this:

sbatch easy2.sh

As you might imagine, sbatch can take many different options. We list the main ones that you will be concerned with while doing bioinformatics in Table 8.1.

TABLE 8.1: Commonly used options to sbatch.
Option What it does
--cpus-per-task=<ncpus> Indicate that ncpus cores are needed for each task in the job
--chdir=<directory> Set the working directory of the batch script to “directory” before the script is run. This is seldom used if you schedule batch scripts from the directory in which they should be run.
--error=<filename pattern> Path to which stderr should be written. Like the --output option this llows for %j, %A, and %a (see below).
--job-name=<jobname> Assign a name to the batch job. This name then appears in, for example, squeve -u user. It should be short and descriptive.
--mail-type=<type> Upon which job-scheduling events for this job should email be sent to the user? Choices include NONE, BEGIN, END, FAIL, REQUEUE, ALL.
--mail-user=<user> Address to which notification emails should be sent.
--mem=<size[units]> How much memory is requested per node. Most easily specified in gigabytes with a trailing as in 4G. For example, --mem=4G.
--mem-per-cpu=<size[units]> How much memory is requested per core. For example, --mem-per-cpu=4.6G. Only one of --mem or --mem-per-cpu should be given.
--output=<filename pattern> Path to which stdout should be written while executing the batch script. Path can include %j, which expands to the job ID, or, if running as a job array, %A expands to the job ID, and %a to the job array index.
--partition=<partition_names> Request an allocation of resources from a compute node in one of the partitions listed in partition_names, which is a comma-separated list of partitions. For example, --partition=shas,compute,himem
--time=<time> Request an allocation of a certain amount of time. Specified in days-hours:minutes:seconds or hours:minutes:seconds, for example: --time=2-12:00:00 or --time=05:00:00. Jobs running longer than this time will be killed.
--test-only Usually given directly on the command line to verify the job submission is compliant without actually scheduling the job.

8.4.2.4 Override within-file sbatch directives from the command line

Since you can write options for the sbatch command either 1) on the command line (as in sbatch --time=00:03:00 ...), or 2) within the script file given to the sbatch command (i.e., lines like #SBATCH --time=00:05:00 within the top part of the script), you might wonder what happens when the same option is supplied both on the command line and within the script. It turns out that the version of the option on the command line takes precedence. That is, any options that are given on the command line itself will override the options if they are given within the file. For example, if we submitted the above easy2.sh script with the command:

sbatch --time=01:00:00 easy2.sh

it would be submitted with a time limit of 1 hour (01:00:00) rather than of 5 minutes (00:05:00, as given within the file.)

This feature is particularly useful when you find yourself needing to re-run just certain elements of a SLURM job array, as we will see in a few sections.

8.4.2.5 The vagaries of conda within sbatch

As we noted above, the shell environment in which SLURM executes your scheduled job inherits many important variables (like PATH and the current working directory) from the shell that you submit the job from. One might hope that this would be sufficient to allow you to activate a conda environment from within your job script. This turns out not to be the case: even though several conda environment variables are set up and the path to conda is set, there are some conda initialization features that have to happen within your job script in order to be able to explicitly set your conda environment within your job script.

To see this, first make a script called conda-test1.sh that tries to activate a different conda environment within the job script. In this case, it is activating an environment called trimmo. If you don’t have a trimmo environment, then you should replace trimmo with the name of another environment that you do have.

#!/bin/bash

#SBATCH --time=00:03:00
#SBATCH --output=conda-test1-output-%j
#SBATCH --error=conda-test1-error-%j 


echo "Starting at $(date)"
echo 
echo "Current working directory is: $PWD"
echo "Current user is: $USER"
echo "Current hostname is: $(hostname)"
echo
echo "Currently, the PATH is set to: $PATH"
echo
echo "Activating environment bioinf"
conda activate trimmo
echo
echo "Now the PATH is: $PATH"

sleep 120

and then run that with:

sbatch conda-test1.sh

Make a note of the job number reported after you submit this job. That is the XXXX you will use in conda-test1-error-XXXX.

After that has run, reading the contents of the conda-test1-error-XXXX file you should see an error about not being able to activate a conda environment. Likewise, if you look at the two PATHs printed out in conda-test1-output-XXXX, you will see that they are the same. So, indeed, the trimmo environemnt was not activated.

One way to remedy this problem is to source your ~/.bashrc (which seems to be where Miniconda3 establishes its “conda init” block) from within your job script.

For example, create a new file called conda-init2.sh and fill it with the following:

#!/bin/bash

#SBATCH --time=00:03:00
#SBATCH --output=conda-test2-output-%j
#SBATCH --error=conda-test2-error-%j 

source ~/.bashrc

echo "Starting at $(date)"
echo 
echo "Current working directory is: $PWD"
echo "Current user is: $USER"
echo "Current hostname is: $(hostname)"
echo
echo "Currently, the PATH is set to: $PATH"
echo
echo "Activating environment bioinf"
conda activate trimmo
echo
echo "Now the PATH is: $PATH"
echo
echo "And the environment variables look like:"
echo "================================"
env

sleep 120

Then run it with:

sbatch conda-test2.sh

If conda is working properly within the job script there should be no output to conda-test2-error-XXXX and you should see that invoking conda activate trimmo changed the PATH variable to include paths from the trimmo environment.

The above points are very important if you want to use software like bwa, samtools, and bcftools, that you installed using Miniconda, from within a job script.

To repeat: to activate a conda environment named env_name, for example, from within a job script running under sbatch you must add these lines:

source ~/.bashrc
conda activate env_name

8.4.2.6 Test a longer running bash script

Now we are going to make a job script to map a pair of trimmed fastq files in our non-model-wgs-example-data directory.

Make a new directory in non-model-wgs-example-data called scripts and then make a job script called map_s002---1.sh inside the scripts directory with these lines in it (modifying the email to be your email address)

#!/bin/bash

#SBATCH --time=01:00:00
#SBATCH --output=results/slurm_logs/map/map_s002---1-output-%j
#SBATCH --error=results/slurm_logs/map/map_s002---1-error-%j 
#SBATCH --mail-type=ALL 
#SBATCH --mail-user=yourname@yourmail.edu


source ~/.bash_profile  # or ~/.bashrc if that is how you are set up
conda activate bwasam


mkdir -p results/bam  # a directory in which to put the output from stdout
mkdir -p results/logs/bwa  # a directory in which to put the output from stderr
mkdir -p results/logs/samtools-sort
SU=s002---1

RG='@RG\tID:s002_T199968_Lib-1_HY75HDSX2_1_CAGGTTCA+GGCGTTAT\tSM:T199968\tPL:ILLUMINA\tLB:Lib-1\tPU:HY75HDSX2.1.CAGGTTCA+GGCGTTAT'
PREFIX=results/trimmed/$SU
OUT=results/bam/$SU.bam
BWALOG=results/logs/bwa/$SU.log
SAMSORTLOG=results/logs/samtools-sort/$SU.log

bwa mem \
-R $RG \
resources/genome.fasta \
${PREFIX}_R1.fq.gz  ${PREFIX}_R2.fq.gz 2> $BWALOG |  \
samtools view -u - |  \
samtools sort -l 9 -o $OUT - 2> $SAMSORTLOG

And schedule it to run (from within the non-model-wgs-example-data directory) with:

mkdir -p results/slurm_logs/map
sbatch scripts/map_s002---1.sh

Note that we make the directory results/slurm_logs/map to put the slurm output into, but most of the program output should go to the output files and the logs.

While this job is running it is worthwhile to reflect on how crazy it would be to prepare a separate script for each of the pairs of FASTQ files in our non-model-wgs-example-data data set. Doing so would mean a lot of copying and pasting followed by painstaking revision of the read group strings in the files, etc. There would be lot of nearly redundant scripts and it would even be a hassle to manually launch each one. Obviously, using a little programming and some variable names, it would be better to have just a single script that could be applied, with minor adjustments that happen at run-time, to all the pairs of FASTQ files.

The next sections shows one powerful way that the SLURM scheduler makes such automation of repetitive tasks a little easier: job arrays.

8.4.3 SLURM Job Arrays

Very often in bioinformatics, large jobs can be broken down into a series of smaller tasks. As an example, aligning sequencing read data from an individual sample to a reference genome can always be done independently of the sequences from any other individuals. Therefore a job that demands that sequencing data from 96 different individuals be aligned to a reference genome can be broken into 96 different tasks: each task involves mapping the sequencing data of just a single individual to the reference genome.

The bash shell for loop provides a way to cycle over different repetitive tasks. In the case of alignment over multiple individuals we might use a for loop in a shell script in the following fashion, to repeatedly run a script called align.sh on the paired-end read from each of four different individuals:

INDIVS="Ind_1 Ind_2 Ind_3 Ind_4"

for ind in $INDIVS; do
  align.sh $ind.R1.fq.gz $ind.R2.fq.gz 
done

Such an approach will work OK on an HPCC if each task (aligning a single individual, in this case) takes just a short amount of time. But, what if aligning each individual takes almost a day of compute time, and you are only authorized to run jobs less than 24 hours on your HPCC? What is needed in that event is a convenient way to break jobs down into separate tasks, such that each task is run as its own separate job on the HPCC.

SLURM provides a feature known as job arrays for handling just this type of use case. When you submit a job array, you are, in effect, submitting to SLURM a recipe that is telling the scheduler that you are submitting some number of instances of a particular job. As resources become available, each new instance of that job starts to run. The different “instances” can be different replicates of a complex simulation, or analyses done on different instances of a data set. Our alignment example falls into the latter case: each instance of the job would be aligning the reads of a different individual.

SLURM job arrays are submitted by using the --array=<array_spec> option to sbatch. Each task of a job array is associated with a whole number, and the <array_spec> argument to the --array option tells which numbers (and, hence, which tasks) should be scheduled to run. For example, if you submit a script named mytasks.sh to sbatch, and the script includes in the SLURM preamble a line that says:

#SBATCH --array=1-10

then SLURM will schedule ten different jobs, indexed from 1 up to 10. That is all well and good, but, in order to be useful, those 10 different jobs had best do different things (for example, operate on 10 different data sets). The user must see to it that each task of a job array performs the proper instance of a job, and this is done using the shell environment variable SLURM_ARRAY_TASK_ID.

When the first of the 10 scheduled tasks in the above job array is launched, the environment variable SLURM_ARRAY_TASK_ID is set to the value 1, and then the contents of mytasks.sh are executed. When the second task is launched, SLURM_ARRAY_TASK_ID is set to 2, and then mytasks.sh is run, and so forth, until the tenth job is launched with SLURM_ARRAY_TASK_ID set to 10. Therefore, inside the mytasks.sh script, the user can use the value of SLURM_ARRAY_TASK_ID (accessed via variable substitution with a dollar sign, e.g., $SLURM_ARRAY_TASK_ID) to direct the script to perform the proper instance of the job that is to be done.

Thus, we could write our alignment example with four individuals, above, as a job array with a script file like this:

#SBATCH --array=1-4

IND=Ind_$SLURM_ARRAY_TASK_ID

align.sh $IND.R1.fq.gz $IND.R2.fq.gz

Submitting that job with sbatch will submit four different tasks as part of a job array, each task being the alignment of sequencing reads from one of four different individuals.

8.4.3.1 Naming SLURM job array tasks

Internally, when you submit a job array to SLURM, it creates a job ID for the job array and then it reserves specific job numbers for the tasks within the array. However, you will find it easiest to refer to the separate tasks in a SLURM job array using the syntax ArrayJobNumber_ArrayTaskNumber. In other words, if we submitted the job array script listed above for aligning the reads from four individuals, and SLURM told us that the job-array had been assigned the job number 6677, then you can refer to the different tasks as 6677_1 for the job aligning reads from Ind_1, 6677_2 for the job aligning reads from Ind_2, and so forth.

When using the --output or --error options to direct stdout and stderr to files during execution of the scripts submitted using the --array option to sbatch, you can name those files using %A which expands to the overall job array number and %a which expands to the SLURM array task ID. So for example:

#SBATCH --output  stdout_%A_%a.txt

would be useful.

8.4.3.2 Variations on the <array_spec>

In the simplest case, you will want to evaluate tasks with indexes that proceed from 1 up to some stopping number, stepping by 1 each time. However, you can exert rather finely tuned control over which array task IDs you would like to be executed. The important thing to remember with all these variations is that no spaces are allowed!

As we have seen already, a dash is used to represent a range of numbers, such as:

#SBATCH --array=1-50

To change the step size, you can immmediately follow the second number of the range with a colon, :, and give the step size. For example, to cycle over jobs with array indexes 1, 4, 7, and 10, you could do:

#SBATCH --array=1-10:3

If you wanted to pick out specific single indexes, you can combine single values and ranges with commas. For example, using just single values interspersed with commas:

#SBATCH --array=1,2,3,6,9,15

You can even combine ranges with step sizes with commas. This would run indexes 1, 11, 21 and 100, 150, 200:

#SBATCH --array=1-21:10,100-200:50

Being able to pick out single array index values to be done is wonderfully convenient for redoing tasks that might have failed. For example, suppose you ran a job with 500 tasks, and upon investigating the job logs you found that array tasks 47, 368, 477, and 489 had failed because insufficient time was allocated to them. In that case you could update the --time setting of your script and then re-run just those four jobs by including:

#SBATCH --array=47,368,477,489

Be warned, however, that SLURM does not check your <array_spec> to ensure that each task index is given only once. If you specify that the same index should be run twice, as is the case for 4 in the following:

#SBATCH --array=1-10,4

then the job will be run twice, and possibly those two jobs will run at the same time so that output from each job will intermittently overwrite output from the other job, leading to corrupted results. Just be careful not to request the same task index more than once!

Finally, if you want to be friendly to the other users of your cluster and want no more than X of your array tasks to be running at any time, you can stipulate that by ending the <array_spec> with %X. For example, to have no more than 5 array tasks running at any one time, you could do:

#SBATCH --array=1-20%5

8.4.3.3 Translating Array Indexes to Job Instances

The user is left to translate what a job array index of, say, 7, means in terms of what actions that array task should take. Quite often you will want to map an array index to a different file to analyze, or perhaps a different region of a chromosome to do variant calling on, etc. A flexible and generic way of doing this mapping from array indexes to job specifics is to first define the variables (things like filenames, etc.) required by each array task in a simple TAB-delimited text file in which the first row holds the names of the variables in different TAB-separated columns, and each row below that holds the values that those variables should take for different values of the array index. The array index itself should be listed in the first column.

For example, if we have 10 different files to process, each with different read-group values (Section 19.2), you can specify the filenames and elements of the read group values in a TAB-delimited text file called files-and-groups.tsv that looks like this:

index   file_prefix LB  Flowcell    Lane
1   DPCh_plate1_A01_S1_L1_R Lib-1   HTYYCBBXX   1
2   DPCh_plate1_A01_S1_L2_R Lib-1   HTYYCBBXX   2
3   DPCh_plate1_A01_S1_L3_R Lib-1   HTYYCBBXX   3
4   DPCh_plate1_A01_S1_L4_R Lib-1   HTYYCBBXX   4
5   DPCh_plate1_A01_S1_L5_R Lib-1   HTYYCBBXX   5
6   DPCh_plate1_A01_S1_L6_R Lib-1   HTYYCBBXX   6
7   DPCh_plate1_A01_S1_L7_R Lib-1   HTYYCBBXX   7
8   DPCh_plate1_A01_S1_L8_R Lib-1   HTYYCBBXX   8
9   DPCh_plate1_A02_S2_L1_R Lib-1   HTYYCBBXX   1
10  DPCh_plate1_A02_S2_L2_R Lib-1   HTYYCBBXX   2

Then, when executing a job array script, this file can be used to convert the SLURM_ARRAY_TASK_ID into the necessary file names and other variables with a simple awk script and variable assignments within your job script that look like this:

# make a command line that holds assignments of values to shell variables
ASSIGN=$(awk -F"\t" -v N=$SLURM_ARRAY_TASK_ID '
  NR==1 {for(i=1;i<=NF;i++) vars[i]=$i; next}
  $1 == N {for(i=1;i<=NF;i++) printf("%s=%s; ", vars[i], $i)}
' files-and-groups.tsv)

# make those assignments
eval $ASSIGN

Then, when SLURM_ARRAY_TASK_ID is equal to 7, for example, the variable ASSIGN gets the string value:

index=7; file_prefix=DPCh_plate1_A01_S1_L7_R; LB=Lib-1; Flowcell=HTYYCBBXX; Lane=7;

and then evaluating that string with eval $ASSIGN will define for you the values of the shell variables index, file_prefix, LB, Flowcell, and Lane, as indicated above. These shell variables may be used in subsequent code in your script (via variable subtitution with the $, like $file_prefix) to ensure that the proper files are processed, etc., for each different value of SLURM_ARRAY_TASK_ID.

8.4.3.4 SLURM Job Array hands-on

Here, you can run some small scripts to get your feet wet with defining slurm job arrays. Create a directory in your scratch directory named job-array-fun and then cd into that directory. In that directory, save the following lines to a script named test-arrays.sh and then submit it via sbatch test-arrays.sh. It should produce three output files.

#!/bin/bash
#SBATCH --time=00:05:00
#SBATCH --output=my_output_%A_%a
#SBATCH --array=1-3

echo "The SLURM_ARRAY_JOB_ID is : $SLURM_ARRAY_JOB_ID"
echo "The SLURM_ARRAY_TASK_ID is: $SLURM_ARRAY_TASK_ID"
echo "The SLURM_JOB_ID is: $SLURM_JOB_ID"
echo
echo "You can refer to this individual SLURM array task as: ${SLURM_ARRAY_JOB_ID}_${SLURM_ARRAY_TASK_ID}"
echo

sleep 180

Once this has been submitted, use your myjobs alias to see how it looks when it is queued: something like this:

            JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
     4499013_[1-3]      shas test-arr eriq@col PD       0:00      1 (Priority)

If it is taking a while to start this job, and you don’t have a myjobs alias, put one in your ~/.bashrc. Add these lines:

# aliases for SLURM stuff
alias myjobs='squeue -u username'

where you have to change username to be your login name on the HPCC. Note that, for Colorado State University affiliates on SUMMIT, that username is of the form, CSUeID@colostate.

If you would like to see each one of the tasks in a job array listed on its own line, you can use the -r, --array option to squeue. In terms of a myjobs alias that means issuing myjobs -r or myjobs --array (both are equivalent):

% myjobs -r
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
         4499052_1      shas test-arr eriq@col PD       0:00      1 (Priority)
         4499052_2      shas test-arr eriq@col PD       0:00      1 (Priority)
         4499052_3      shas test-arr eriq@col PD       0:00      1 (Priority)

I want people to look closely at the variables SLURM_ARRAY_JOB_ID and SLURM_ARRAY_TASK_ID, that get written to stdout by each job into the files my_output_%A_%a (where %A and %a expand to the job array ID and the job array task ID, respectively) and to understand how those are different from SLURM_JOB_ID in this context. We will discuss that in class.

Unfortunately, it seems like it can take a long time for that to start on SUMMIT. So, here are the results:

==> my_output_27234_1 <==
The SLURM_ARRAY_JOB_ID is : 27234
The SLURM_ARRAY_TASK_ID is: 1
The SLURM_JOB_ID is: 27235

You can refer to this individual SLURM array task as: 27234_1


==> my_output_27234_2 <==
The SLURM_ARRAY_JOB_ID is : 27234
The SLURM_ARRAY_TASK_ID is: 2
The SLURM_JOB_ID is: 27236

You can refer to this individual SLURM array task as: 27234_2


==> my_output_27234_3 <==
The SLURM_ARRAY_JOB_ID is : 27234
The SLURM_ARRAY_TASK_ID is: 3
The SLURM_JOB_ID is: 27234

You can refer to this individual SLURM array task as: 27234_3

Next, we want to explore the possibility of overriding the #SBATCH --array option given internally in the test-arrays.sh with one given on the command line. Thus, try submitting the job with:

sbatch --array=100,200,300 test-arrays.sh

When that job has finished, go ahead and look at the output files and make sure it is clear how the SLURM_JOB_IDs and the SLURM_ARRAY_JOB_IDs and the SLURM_ARRAY_TASK_IDs vary.

8.4.3.5 More on converting Task IDs to variables from a tab delimited file

We want to return, for a moment, to the useful task of assigning shell variables from a line of values in a TAB-separated-values file, like was done in the section 8.4.3.3. This can be incredibly useful, so we will be a little more formal about it.

8.4.3.5.1 Make a numbered-units.tsv file

First, we will describe how to make a file called numbered-units.tsv from the file units.tsv that we have in our non-model-wgs-example-data directory. This file is just like units.tsv, but the first column is called index, and it holds the row numbers of all the values in the file. We do this with awk. To learn more about awk see Section 6.1.

# Do this within you non-model-wgs-example-data directory
awk 'BEGIN {OFS="\t"} NR==1 {print "index", $0; next} {print ++n, $0}' units.tsv > numbered-units.tsv

Now, the first few columns of file numbered-units.tsv look like this:

index   sample  unit    library flowcell        platform        lane    sample_id       barcode
1       s001    1       Lib-1   HY75HDSX2       ILLUMINA        1       T199967 AAGACCGT+CAATCGAC
2       s002    1       Lib-1   HY75HDSX2       ILLUMINA        1       T199968 CAGGTTCA+GGCGTTAT
3       s002    2       Lib-1   HY75HDSX2       ILLUMINA        2       T199968 CAGGTTCA+GGCGTTAT
4       s003    1       Lib-1   HY75HDSX2       ILLUMINA        2       T199969 TTACCGAC+CGTATTCG
5       s003    2       Lib-1   HY75HDSX2       ILLUMINA        3       T199969 TTACCGAC+CGTATTCG
6       s003    3       Lib-1   HTYYCBBXX       ILLUMINA        2       T199969 TTACCGAC+CGTATTCG
7       s004    1       Lib-1   HY75HDSX2       ILLUMINA        3       T199970 TCGTCTGA+TCAAGGAC
8       s004    2       Lib-2   HY75HDSX2       ILLUMINA        4       T199970 AGTGACCT+CTCCTAGA
9       s005    1       Lib-1   HY75HDSX2       ILLUMINA        2       T199971 TTCCAGGT+AAGCACTG
10      s005    2       Lib-1   HY75HDSX2       ILLUMINA        4       T199971 TTCCAGGT+AAGCACTG
11      s005    3       Lib-3   HTYYCBBXX       ILLUMINA        2       T199971 AGCCTATC+GTTACGCA
12      s006    1       Lib-1   HY75HDSX2       ILLUMINA        1       T199972 TACGGTCT+GCAATGGA
13      s006    2       Lib-1   HTYYCBBXX       ILLUMINA        3       T199972 TACGGTCT+GCAATGGA
14      s006    3       Lib-2   HY75HDSX2       ILLUMINA        2       T199972 TCATCTCC+CTAGCAAG
15      s006    4       Lib-2   HTYYCBBXX       ILLUMINA        4       T199972 TCATCTCC+CTAGCAAG
16      s007    1       Lib-1   HY75HDSX2       ILLUMINA        2       T199973 TAGGAGCT+GTTAAGGC
17      s007    2       Lib-1   HY75HDSX2       ILLUMINA        3       T199973 TAGGAGCT+GTTAAGGC
18      s007    3       Lib-2   HY75HDSX2       ILLUMINA        2       T199973 CCAGTATC+ATCTCGCT
19      s007    4       Lib-2   HY75HDSX2       ILLUMINA        3       T199973 CCAGTATC+ATCTCGCT
20      s008    1       Lib-1   HY75HDSX2       ILLUMINA        1       T199974 TACTCCAG+CCTATACC
21      s008    2       Lib-2   HY75HDSX2       ILLUMINA        1       T199974 TTGCGAGA+GTGCCATA
22      s008    3       Lib-3   HY75HDSX2       ILLUMINA        1       T199974 GAACGAAG+GGTGATTC

Having the row number in the first column is helpful if we want to pick out particular rows in the file.

8.4.3.5.2 Make a line-assign.sh script and put it in your PATH

Next, we are going to wrap the awk code we used in section 8.4.3.3 into a little shell script that we put in our PATH so that we can use it from any directory we might be in.

Create a directory called bin in your home directory with:

mkdir -p ~/bin

Then open a file called line_assign.sh in the bin directory, using nano with this command:

nano ~/bin/line_assign.sh

Copy the following script into nano and save it and exit nano.

#!/bin/bash
# simple script.  Give it a TAB delimited file $2 with the first column holding
# indexes.  This will pick out the line that matches $1, and return a command
# line assigning the column header names as shell variables whose values are the
# follows in the file $2.

if [ $# -ne 2 ]; then
  echo "Wrong number of arguments in $0 
Expecting two arguments, with syntax like this:

    line_assign.sh IndexValue  TSV-file
    
Where 
    IndexValue is the number in the first column of the row you want to
       extract values from.
    TSV-file is a tab-separated-values file with the first column named
       index, and the other columns named according to shell variables whose
       values you want to set to the values in the row IndexValue
       
Typical usage:

    eval \$(line_assign.sh 7 numbered-units.tsv)

That will set, in the current shell, a series of shell variables whose
names are the column names to values in the
row IndexValue
" > /dev/stderr

exit 1

fi

awk -F"\t" -v LINE=$1 '
  $1 == "index" {for(i=1; i<=NF; i++) vars[i]=$i; next}
  $1 == LINE {for(i=1; i<=NF; i++) printf("%s=\"%s\"; ", vars[i], $i)}
' $2

Once you have saved and exited that file, you have to use chmod to’ make sure that it is executable:

chmod u+x ~/bin/line_assign.sh

Finally, we just have to make sure that your shell always knows to look in ~/bin for shell scripts. So, edit your .bashrc file with nano, using:

nano ~/.bashrc

to add the following line above the # >>> conda initialize >>> block:

export PATH=$PATH:~/bin

When you have made that change to your ./bashrc file, be sure to run

source ~/.bashrc

to make sure the change is applied to your current shell. (NOTE: you only need to do this to get the changes in currently open terminals. When you login fresh, your full ~/.bashrc contents will be loaded).

After this, you should be able to type

line_assign.sh

from any directory on your system, and get the usage message about how to use the line_assign.sh script.

Finally, in your non-model-wgs-example-data directory, run the script to see the command lines it returns:

line_assign.sh 10 numbered-units.tsv

See how that is a series of commands separated by semicolons. Each one is an assignment of a value to a shell variable.

Recalling what we learned about how to substitute the output of commands back on the command line with the $(comm) syntax (Section 5.4) and how to use the eval keyword to ensure the shell evaluates substituted values (Section 5.3.3), note that you can do this:

eval $(line_assign.sh 10 numbered-units.tsv)

and, all of sudden, you will have shell variables named index, sample, unit, library, flowcell, platform, lane, sample_id, barcode, fq1, fq2, kb1, and kb2, and their values are all of the values in row index 10 of the files numbered-units.tsv.

Try it out by looking a the values of some these. Evaluate all of the following:

echo "
sample will expand to: $sample
unit will expand to: $unit
library will expand to: $library
sample_id will expand to: $sample_id
"

Yes! This is just what we need to prepare a killer SLURM job array to map all of our trimmed reads in non-model-wgs-example-data/results/trimmed to our genome.

Let’s proceed to Section 27.6.3.

8.5 Keeping track of how many resources were used or are being used

8.5.1 Checking out resource use by currently running jobs

If you want to get more information about a currently running job you can use:

scontrol show job <job_id>

where <job_id> is the slurm job ID.

This can be helpful if you see someone using all the cores on a node, and you want to see what they are doing. You can login via ssh nodeXX (or whatever the node name is) and then run top or ps -ax to find their job.

You might find that someone (or even yourself) has checked out 20 cores, but there job only ends up using one of them.

So, this can be helpful for guiding yourself and others.

8.5.2 How efficient were you with allocated cores and memory?

After a job has finished, some SLURM environments have the seff command, which stands for SLURM efficiency. It gives a nice summary of how much CPU use you made as a function of how much you were allocated. And it does the same with memory.

The syntax is seff <job_id>. For example:

% seff 474963

Gives, in my case, output that looks like:

Job ID: 474963
Cluster: sedna
User/Group: eanderson/eanderson
State: COMPLETED (exit code 0)
Cores: 1
CPU Utilized: 00:10:30
CPU Efficiency: 99.21% of 00:10:35 core-walltime
Job Wall-clock time: 00:10:35
Memory Utilized: 443.09 MB
Memory Efficiency: 9.63% of 4.49 GB

8.6 Last year’s stuff. Now somewhat defunct

Finally, we want to contemplate strategies for turning the SLURM_ARRAY_TASK_ID into “directions” for the script about what to do (which files to process, etc.)

We discussed one approach, above, using a TAB-delimited file, but want to note here that depending on how your actual script is configured, any number of approaches might be suitable.

For an example germane to doing alignment to a reference genome in our homework chr-32-bioinformatics, consider that you have 1280 pairs of files to process, and that you will launch jobs to process 128 such pairs at a time. A script called run-single-job.sh that mapped the first 128 pairs of files looked like this:

#!/bin/bash

#SBATCH --time=05:00:00
#SBATCH --output=run-single-out-%j
#SBATCH --error=run-single-error-%j
#SBATCH --mail-type=ALL
#SBATCH --mail-user=user@email.edu


./map-N-files-from-K.sh 128 1

If you wanted to map the next 128 pairs of files, the final line in the above script should look like:

./map-N-files-from-K.sh 128 129

and to do the next 128 pairs of files after that, the line should look like:

./map-N-files-from-K.sh 128 257

and so forth.

So, if everyone has already run the first 128 pairs of files (as everyone should have finished for the homework) I contend that, if we wanted to modify this script so that the remaining 1152 pairs of files could be run as a job array, our new script could look like this:

#!/bin/bash

#SBATCH --time=02:00:00
#SBATCH --output=run-array-out-%A_%a
#SBATCH --error=run-array-error-%A_%a
#SBATCH --mail-type=ALL
#SBATCH --mail-user=user@email.edu
#SBATCH --array=1-9


./map-N-files-from-K.sh 128 $((SLURM_ARRAY_TASK_ID * 128 + 1))

Save that file as run-array-job.sh

Note the changes:

  • output and error files modified to use %A_%a
  • Added the –array command running from 1 to 9.
  • Changed the starting index from 1 to $((SLURM_ARRAY_TASK_ID * 128 + 1))

The last one deserves some explanation, it uses the integer arithmetic capabilities of the bash shell (Section 5.3.6).

So, what is that last line doing? You should check it for yourself. Doing so will emphasize an important point when it comes to testing your own job-array scripts: if you are testing the code in them interactively, you can simply assign any value you want to a variable named SLURM_ARRAY_TASK_ID and verify the results are what you would like them to be. So, try this:

SLURM_ARRAY_TASK_ID=1
echo ./map-N-files-from-K.sh 128 $((SLURM_ARRAY_TASK_ID * 128 + 1))

And to check them all, including when SLURM_ARRAY_TASK_ID=0 (which is the case that you already did for your homework, running it as a single job), try this:

for SLURM_ARRAY_TASK_ID in {0..9}; do
  echo ./map-N-files-from-K.sh 128 $((SLURM_ARRAY_TASK_ID * 128 + 1))
done

Once you have convinced yourself that your run-array-job.sh script will work correctly, and it is set for an array=1-9, and you have configured your email properly within it, then you should copy it to your chr-32-bioinformatics directory and submit it as a job to finish aligning the reads from all the remaining individuals, so we will have them for variant calling down the road.

THAT LAST STEP ABOVE IS PART OF YOUR HOMEWORK FOR NEXT WEEK THAT WILL BE DUE NEXT TUESDAY

8.7 PREPATION INTERLUDE: An in-class exercise to make sure everything is configured correctly

Here, we run through a number of steps to start aligning sequence data to a reference genome. The main purpose of this exercise is to ensure that everyone has their systems configured correctly. Namely, we want to make sure that:

  • You can log in to your cluster (SUMMIT, Hummingbird, etc.)
  • You can use a few SLURM (job scheduler) commands.
  • You can find your way to your scratch space on SUMMIT (or Hummingbird). scratch is fast, largely unlimited short-term storage. It is where you will do the majority of your bioinformatics.
  • You can use rclone to tranfer files from Google Drive to your cluster. Doing things this way will allow you to access your Lab’s Team Drive. The model for bioinformatic work is:
    1. Copy data from your Lab’s Google Team Drive to scratch on the cluster.
    2. Do analyses of the data (in scratch)
    3. Copy the results back to your Lab’s Google Team Drive before it gets deleted (by the system administrators) off of scratch Note: today you are not using your lab’s Team drive, you are using the “class” shared drive, and you won’t be copying anything back to it. But this is a start at least…
  • You have Miniconda installed, that you have a conda environment with bioinformatic tools in it, and that you can activate that environment and use the tools within it. Miniconda makes it easy to install software that you need to run analyses.

Being able to successfully do all these things will be necessary to complete the next homework assignment which involves aligning short reads to a reference genome and will be assigned after spring break.

  1. If you are not already in a tmux session, start a new tmux session called inclass on the login node.

    tmux new -s inclass
  2. Get a bash shell on a compute node

    # on summit:
    sinteractive
    
    # on hummingbird
    salloc --partition=Instruction --nodes=1 --time=02:00:00 --mem=4G --cpus-per-task=1
    ssh $SLURM_NODELIST
    
    # otherwise, you can try this (will work on Sedna)
    srun --pty /bin/bash
  3. Check what host you are on with hostname

    hostname
    
    # The hostname should be something like shasXXX. 
    #  Do not proceed if you are on a login node still (ie. hostname is loginXX)

    If you are still on a login node, try running sinteractive again.

  4. Use your alias myjobs (if you have created it) to see what jobs you have running.

    myjobs
    
    # or, if you don't have that aliased:
    squeue -u your_username
    
    # You should see that you have one job running.  That "job" is a bash shell
    # that you are interactively working with.
  5. Navigate to your scratch directory using cd. On SUMMIT that will look like this:

    cd /scratch/summit/wcfunk\@colostate.edu/

    You can make a symbolic link to that in your home directory like this:

    cd  # returns you to hour home directory
    
    # this line makes a symbolic link called scratch in your home directory
    # that points to your scratch space at /scratch/summit/wcfunk\@colostate.edu/
    ln -s /scratch/summit/wcfunk\@colostate.edu/ scratch

    Just be sure to use your own user name instead of wcfunk. Once that symbolic link is set up, you can cd into it, just like it was your scratch directory:

    cd scratch

    But if you don’t want to set up a symbolic link just use the full path:

    cd /scratch/summit/wcfunk\@colostate.edu/

    On hummingbird you go to

    cd /hb/scratch/username/

    where username is your ucsc username. Note that if you want to make a symbolic link to that, then while in your home directory do this:

    ln -s /hb/scratch/username scratch
  6. Within your account’s scratch space, make a directory called chinook-play and cd into it:

    mkdir chinook-play
    cd chinook-play/
  7. Make sure you have followed the link in your email to the shared google drive folder I sent you last night. If the email went to an email address that is not associated with your rclone configuration for google drive, then request access (at google drive) for that email. (Or setup a new rclone config for your other gmail address…)

  8. Now, use rclone to list the files in the folder I shared with you:

    rclone lsd --drive-shared-with-me gdrive-pers:CSU-con-gen-2020

    Note that gdrive-pers above is the name of my rclone configuration associated with the gmail account where I got an email (from myself at my CSU account) about the shared folder. You will have to change gdrive-pers to be appropriate to your config, depending on how you set it up. Note also how you access folders shared with you using --drive-shared-with-me.

  9. After you get the above command to work, you should see something like the following, though it will change as more things get added to this directory…

    -1 2020-03-09 17:51:06        -1 chr32-160-chinook
    -1 2020-03-05 07:33:21        -1 pre-indexed-chinook-genome
  10. Use rclone to download big-fastq-play.zip (about 170 MB). Once again you have to change the name of the rclone remoate (gdrive-pers in this case), to the name of your Google drive rclone remote.

    rclone copy -P  --drive-shared-with-me gdrive-pers:CSU-con-gen-2020/big-fastq-play.zip ./
  11. Unzip the file you just downloaded. Note: on Hummingbird, unzip is not installed by default. So I recommend doing conda install unzip to add it to your base Miniconda environment.

    unzip big-fastq-play.zip

    This creates a directory called big-fastq-play within your chinook-play directory.

  12. Create a directory called fastq and move the two paired-end FASTQ files from big-fastq-play into it:

    mkdir fastq 
    mv big-fastq-play/data/Battle_Creek_01_chinook_R* fastq/
    # then make sure that they got moved
    ls fastq
    # if the files were moved, then feel free to remove the big fastq stuff
    rm -r big-fastq-play*
  13. Download the Chinook salmon genome into a directory called genome

     mkdir genome
     cd genome
     wget ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/vertebrate_other/Oncorhynchus_tshawytscha/all_assembly_versions/GCA_002872995.1_Otsh_v1.0/GCA_002872995.1_Otsh_v1.0_genomic.fna.gz

    That dude is 760 Mb, but it goes really quickly (at least on SUMMIT. Hummingbird seems to be a different story…)

  14. Activate your bioinf conda environment to be able to use bwa:

    conda activate bioinf
  15. Index the genome with bwa in order to align reads to it:

    bwa index GCA_002872995.1_Otsh_v1.0_genomic.fna.gz 

    This will take about 10 to 15 minutes. During that time, do this:

    • If you are doing this within a tmux session, create a new tmux window (cntrl-b c). Note that this shell opens on the login node (the one that you started your tmux session on.)
    • Rename this new tmux window “browse” (cntrl-b ,)
    • Navigate to ~/scratch/chinook-play/genome/ and list the files there. These new files are being created by bwa.
    • Switch back to your original tmux window.
  16. If bwa is still running, ask me some questions. Or maybe I will tell you more about how conda works…

  17. If bwa is still running after that, ask me about nested tmux sessions.

  18. Truth be told. Indexing this thing can take a long time (30 to 40 minutes on a single core, perhaps). So, instead, here is what we will do:

    1. Kill the current bwa job but issuing cntrl-c in the shell where it is running.
    2. Use rclone to get the genome and a bwa index for it from my Google Drive. We want to put the result in the chinook-play directory inside a directory called chinook-genome-idx. NOTE: Again, you will likely have to use your name for your rclone remote, rather than gdrive-pers.
    cd ../ # move up to the chinook-play directory
    rclone copy  -P  --drive-shared-with-me gdrive-pers:CSU-con-gen-2020/pre-indexed-chinook-genome chinook-genome-idx 

    That thing is 4.6 Gb, but can take less than a minute to transfer. (Let’s see if things slow down with everyone running this command at the same time.)

  19. Now when you do ls you should see something like this:

    (bioinf) [chinook-play]--% ls
    chinook-genome-idx  fastq  genome
  20. For a final hurrah, we will start aligning those fastqs to the chinook genome. We will put the result in a directory we create called sam. To make it clearer what we are doing we will define some shell variables.:

    mkdir sam
    GENOME=chinook-genome-idx/GCA_002872995.1_Otsh_v1.0_genomic.fna.gz
    FQB=fastq/Battle_Creek_01_chinook_R
    SAMOUT=$(basename $FQB).sam
    
    # for fun, try echoing each of those variables and make sure they look right.
    # for example:
    echo $FQB
    
    # ...
    
    # then, launch bwa mem to map those reads to the genome.
    # the syntax is:
    # Usage: bwa mem [options] <idxbase> <in1.fq> [in2.fq]
    bwa mem $GENOME ${FQB}1.fq.gz ${FQB}2.fq.gz > sam/$SAMOUT 2>sam/$SAMOUT.error
  21. Once that has started, it is hard to know anything is happening, because we have redirected both stdout and stderr to files. So, do what you did before, use tmux to go to another shell and then use tail or less to look at the output files, which are in sam/Battle_Creek_01_chinook_R.sam and sam/Battle_Creek_01_chinook_R.sam.error within the chinook-play directory in scratch. You might even count how many lines of alignments file have been formed:

    awk '!/^@/' Battle_Creek_01_chinook_R.sam | wc 
  22. Note that, in practice, we would not usually save the .sam file to disk. Rather, we would convert it into a compressed .bam file right as it comes off of bwa mem, and maybe even sort it as it is coming off…

  23. Note that this process might not finish before the end of class. That is not a huge problem if you are running the job within tmux. You can logout and it will still keep running.

However, as said before, most of the time you will run batch jobs rather than interactive jobs. I just wanted to first give everyone a chance to step through these tasks in an interactive shell on a compute node because that is crucial when developing your scripts, and we wanted to make sure that everyone has rclone, and miniconda installed and working.

8.8 More Boneyard…

Here is some nice stuff for summarizing all the information from the different runs from the chinook-wgs project:

qacct -o eriq -b 09271925 -j ml | tidy-qacct

Explain scratch space and how clusters are configured with respect to storage, etc.

Strategies—break names up with consistent characters:

  • dashes within population names
  • underscores for different groups of chromosomes
  • periods for catenating pairs of pops

etc. Basically, it just makes it much easier to split things up when the time comes.

8.9 The Queue (SLURM/SGE/UGE)

8.10 Modules package

8.11 Compiling programs without admin privileges

Inevitably you will want to use a piece of software that is not available as a module or is not otherwise installed on they system.

Typically these software programs have a frightful web of dependencies.

Unix/Linux distros typically maintain all these dependencies as libraries or packages that can be installed using a rpm or yum. However, the simple “plug-and-play” approach to using these programs requires have administrator privileges so that the software can be installed in one of the (typically protected) paths in the root (like /usr/bin).

But, you can use these programs to install packages into your home directory. Once you have done that, you need to let your system know where to look for these packages when it needs them (i.e., when running a program or linking to it whilst compiling up a program that uses it as a dependency.

Hoffman2 runs CentOS. Turns out that CentOS uses yum as a package manager.

Let’s see if we can install llvm using yum.

yum search all llvm # <- this got me to devtoolset-7-all.x86_64 : Package shipping all available toolsets.

# a little web searching made it look like llvm-toolset-7-5.0.1-4.el7.x86_64.rpm or devtoolset-7-llvm-7.0-5.el7.x86_64.rpm
# might be what we want.  The first is a dependency of the second...
mkdir ~/centos

Was using instructions at https://stackoverflow.com/questions/36651091/how-to-install-packages-in-linux-centos-without-root-user-with-automatic-depen

Couldn’t get yum downloader to download any packages. The whole thing looked like it was going to be a mess, so I thought I would try with miniconda.

I installed miniconda (python 2.7 version) into /u/nobackup/kruegg/eriq/programs/miniconda/ and then did this:

# probably could have listed them all at once, but wanted to watch them go 
# one at a time...
conda install numpy
conda install scipy
conda install pandas
conda install numba

# those all ran great.

conda install pysnptools

# that one didn't find a match, but I found on the web that I should try:
conda install -c bioconda pysnptools 

# that worked!

Also we want to touch briefly on LD_PATH (linking failures—and note that libraries are often named libxxx.a) and CPATH (for failure to find xxxx.h), etc.

8.12 Job arrays

Quick note: Redefine IFS to break on TABs so you can have full commands in there. This is super useful for parsing job-array COMMLINES files.

IFS=$'\t\n'; BOP=($(echo boing | awk '{printf("first\tsecond\tthird that is long\tfourth\n");}')); IFS=$' \t\n';

Definitely mention the eval keyword in bash for when you want to print command lines with redirects.

Show the routine for it, and develop a good approach to efficiently orchestrating redos. If you know the taskIDs of the ones that failed then it is pretty easy to write an awk script that picks out the commands and puts them in a new file. Actually, it is probably better to just cycle over the numbers and use the -t option to launch each. Then there is now changing the job-ids file.

In fact, I am starting to think that the -t option is better than putting it into the file.

Question: if you give something on the command line, does that override the directive in the header of the file? If so, then you don’t even need to change the file. Note that using the qsub command line options instead of the directives really opens up a lot of possibilities for writing useful scripts that are flexible.

Also use short names for the jobs and have a system for naming the redos (append numbers so you know which round it is, too) possibly base the name on the ways things failed the first time. Like, fsttf1 = “Fst run for things that failed due to time limits, 1”. Or structure things so that redos can just be done by invoking it with -t and the jobid.

8.13 Writing stdout and stderr to files

This is always good to do. Note that stdbuf is super useful here so that things don’t get buffered super long. (PCAngsd doesn’t seem to write antyhing till the end…)

8.14 Breaking stuff down

It is probably worth talking about how problems can be broken down into smaller ones. Maybe give an example, and then say that we will be talking about this for every step of the way in bioinformatic pipelines.

One thing to note—sometimes processes go awry for one reason or another. When things are in smaller chunks it is not such a huge investment to re-run it. (Unlike stuff that runs for two weeks before you realize that it ain’t working right).