Chapter 14 Managing Workflows with Snakemake
In the preceding sections we have covered a few approaches for using RStudio projects and RMarkdown to create reproducible research repositories with visually pleasing HTML outputs that can help readers quickly see what was required to generate different figures/tables/outputs. Such an approach is well suited to projects that are undertaken primarily on one’s own desktop or laptop computer, and which don’t take an inordinately long time to run. Unfortunately, that does not describe most bioinformatic analyses handling whole-genome sequencing or RAD sequencing data for conservation genetics. Most of those analyses will be run on a university computer cluster, and trying to manage all of that within an RMarkdown-based framework is at best rather difficult, and at worst it can become a shrieking horror.
For managing these sorts of bioinformatic workflows there is a Python-based framework called Snakemake that is well-supported, flexible, and incredibly powerful. Understanding how Snakemake works and becoming familiar with its many features involves a non-trivial learning curve; however, for anyone spending a sizable chunk of their graduate career managing bioinformatic analyses, or for anyone that is a lab bioinformatician running bioinformatic workflows across many species (for example), the benefits of learning Snakemake will continue to pay dividends for years to come.
Someone who has mastered all the topics in Part I of this handbook (Unix programming,
working on remote computers, etc.) certainly will have the skills to write
what I will call input-oriented, forward-marching workflows. I call these
“forward-marching” because they start from a set of
input files (for example files of sequences from a sequencer), and then
the workflow is defined as a series of sequential steps, one after the other.
They are “input-oriented” in the sense that such a workflow starts
with a series of inputs, but the workflow itself doesn’t really know
what it is trying to produce from those inputs until it has run all the
way through and actually turned those inputs into outputs.
For example, all the input
files might get trimmed or cleaned, then they would all get mapped to a genome,
and then those mapped sequences would be used to call variants, so as to
find SNPs in the data, etc. If you were writing this in an
input-oriented, forward-marching fashion, then, to deal with the fact that you had multiple files of
DNA sequences (for example, one for each individual bird or fish that had
been sampled), you might write Unix for
loops to cycle over all the input files
as in Section 5.7, or, you could define a SLURM job array to start a separate job
instance for each input file, as in Section 8.4.3. In each case,
you would have to do some extra programming to deal with the different input files, and
if one of the steps of your workflow failed on just one, or a few, of the files,
you might spend a large amount of time tracking those failures down and than
manually re-running that small number of jobs.
By contrast, Snakemake takes a different approach to managing workflows. We will call it an output-oriented, backward-looking approach. We call it that because workflows in Snakemake are defined first and foremost in terms of the output files that are desired, along with instructions on how to create those output files from necessary input files and bioinformatic programs. They are backward-looking in the sense that, once you have developed a Snakemake workflow, you get results by telling Snakemake which output files you want to create, and then it “looks backwards” through the workflow to determine which input files are needed to create the requested outputs. Sometimes it has to look backwards through several steps before it identifies all the necessary input files. Once it has found those necessary inputs, it then runs forward through the steps to create the output files. In this phase, the workflow looks like it is “forward-marching”, in the sense that outputs are being created from inputs. But, in order to get to that “forward-running” phase, Snakemake had to look backward to figure out what inputs to use.
The above constitutes some subtle points, that might not be clear upon first reading, but we will try to summarize it in a few pithy phrases:
- A workflow defined as a typical Unix shell script can be thought of as a process that runs forward. You give it a lot of input files and it just cranks through a bunch of steps until the output files are made.
- A workflow defined with Snakemake works differently. First you define the workflow in terms of a series of “rules.” Then, when you request any given set of output files, Snakemake will look backwards through the rules of the workflow and figure out exactly which steps must be performed, on which input files, in order to create the requested output files. Once it has determined that, it runs just those necessary steps.
There are many advantages to this output-oriented, backward-looking approach:
- If your workflow has many steps, and some of them have already been run, then Snakemake automatically recognizes that, and will not re-run steps in the workflow that have already been completed. In an input-oriented system (like a traditional Unix script), you would have to spend the time to figure out which steps had already been run, and then run your script only from that point forward. Doing so can be a hassle and can also be prone to errors.
- A workflow defined by Snakemake, being explicit about the inputs needed for each step, naturally defines “work units” that can be run independently of one another. Accordingly, Snakemake, itself, can break a huge bioinformatic workflow into a number of small jobs that can be run in parallel, saving you, the user, from having to write scripts to launch a series of SLURM job arrays.
- The fact that Snakemake automatically keeps track of which inputs already exist—and which might still need to be generated—provides huge benefits when some of your jobs fail. Anyone who has used a cluster has stories about jobs that inexplicably fail. Without a workflow management system like Snakemake, you can spend almost as much of your own time managing these failed jobs as it took to launch all the jobs in the first place.
In addition to these obvious advantages of the output-oriented approach, Snakemake also includes a number of features that make it easy to use your workflow on a variety of different platforms. It is tightly integrated with conda (Section 7.6.2), letting the user define conda environments for each step in the workflow. This means that if you move your whole workflow to a new cluster, you don’t have to spend any time coordinating the installation of the programs you need—if you set things up properly with Snakemake and conda, that will happen automatically. If you distribute your workflows to other people to use, this is particularly helpful, since you will spend far less time assisting them in setting up their computer environment to run your workflow. Snakemake can also be customized to work in your own cluster environment. Finally, there are interfaces to allow your Snakemake workflows to run seamlessly in the cloud.
Full documentation for Snakemake can be found at https://snakemake.readthedocs.io/en/stable/. This documentation is comprehensive, but can feel a little daunting at first. On the other hand, the developers of Snakemake also provide an excellent and accessible tutorial at https://snakemake.readthedocs.io/en/stable/tutorial/tutorial.html. If you have not run through that tutorial yet, you should do that before going through the next sections in this chapter.
In the following I present another short tutorial that focuses a little more on understanding how wildcards in Snakemake work, as this seems to be a stumbling block for many students. In addition, I try to offer a few tidbits of advice regarding input functions and simple ways to work interactively with Python while you are developing your Snakemake workflows. I hope this will be particularly helpful for students that aren’t super familiar with Python.
Before proceeding, you will need to follow the instructions
here to install Snakemake into a conda environment named snakemake
.
Once installed, activate the snakemake
environment and then add the tree
Unix utility to
it:
tree
is a utility that we will use in the following. It
provides a nice way of visualizing directories and their contents.
Once you have done that, make a directory called smk-tut
and
cd
into it to do this tutorial.
14.1 A Simple Snakemake Example
Here is a somewhat make-believe example that will let us grok out the basics of Snakemake and the logic of simple workflows without worrying about performing actual operations (like trimming or mapping) on files. This is a useful excercise because it can also be helpful when you are developing your own complex workflows to first make sure that the workflow logic (from Snakemake’s point of view) is correct, without having to spend a long time on the actual processing of files.
We will pretend that we have paired-end sequence files from a sequencing center and
we want to compare the outcome of using two different programs to trim leftover adapters
and low-quality sequence from these files. These two (utterly fictitious) programs
are TrimmoMcAwesome
and trim_stupendous
1.
We will pretend that both TrimmoMcAwesome
and trim_stupendous
work with this sort of syntax:
TrimmoMcAwesome file_read1.fq file_read2.fq outpath1 outpath2
# or
trim_stupendous file_read1.fq file_read2.fq outpath1 outpath2
which will trim the paired-end reads file_read1.fq
and file_read2.fq
and put the
results for read1 and read2, respectively, into the files outpath1
and outpath2
.
For now, let us assume that we have 10 input files: 2 paired-end read files
for each of 5 samples named 001
through 005
. Please create those files
(even though they will be empty) with this Unix command:
Please note that the use of curly braces here is an example of “brace expansion” within
the Unix shell. You will see lots of curly braces in the upcoming material, and you should
be aware that the curly braces that you will see in Snakemake code
(in the Snakefile
as it were) are different. Those are typically serving the
purpose of delimiting wildcards. (Though, they have other roles within Python as well.).
Once you have done that, use tree
to see what the directory structure in
smk-tut
looks like:
% tree .
.
└── data
└── raw
├── 001_R1.fq
├── 001_R2.fq
├── 002_R1.fq
├── 002_R2.fq
├── 003_R1.fq
├── 003_R2.fq
├── 004_R1.fq
├── 004_R2.fq
├── 005_R1.fq
└── 005_R2.fq
That handy little summary shows the names of 10 files inside a directory called
raw
inside a directory called data
.
14.1.1 Snakefile #1: A simple rule to TrimmoMcAwesome the files from 001
If you did the tutorial, you will recall that the instructions for the workflow
go into the file, Snakefile
. Make a Snakefile within the smk-tut
directory
with these contents, which give it the rule for running TrimmoMcAwesome on
the reads from sample 001
:
rule trim_awesome:
input:
"data/raw/001_R1.fq",
"data/raw/001_R2.fq"
output:
"results/awesome/001_R1.fq",
"results/awesome/001_R2.fq"
shell:
"TrimmoMcAwesome {input} {output}"
Notice that this rule takes the raw fastq files as input to TrimmoMcAwesome, and
puts the trimmed output files into a diretory results/awesome
.
Once you have that Snakefile, you can ask snakemake to show you what it
would do if you ran it. Give it the -n
option (dry run) and the -p
option
(print the shell commands) by putting them together, preceded by a single dash:
The result you get should look like this:
Building DAG of jobs...
Job counts:
count jobs
1 trim_awesome
1
[Mon Aug 9 11:43:31 2021]
rule trim_awesome:
input: data/raw/001_R1.fq, data/raw/001_R2.fq
output: results/awesome/001_R1.fq, results/awesome/001_R2.fq
jobid: 0
TrimmoMcAwesome data/raw/001_R1.fq data/raw/001_R2.fq results/awesome/001_R1.fq results/awesome/001_R2.fq
Job counts:
count jobs
1 trim_awesome
1
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.
It is worth sitting down and looking closely at this output. The first block shows how many
jobs will run. The second block shows the rule that will be run in the job, and it
explicitly shows the values of the input
and output
of the block. The line at the top
of the third block shows the shell command that would be run. It looks correct. Great!
Another thing worthy of note: the requested output files in this rule are
actual file names and they appear in the first rule of the Snakefile.
Because of this, we were able simplty to run snakemake -np
without
explicitly requesting any output files on the command line—Snakemake used
the two outputs from that first rule block.
Alternatively, we could have explicitly requested those output files on the command line, like:
# do this on your unix terminal in directory smk-tut
snakemake -np results/awesome/001_R1.fq results/awesome/001_R2.fq
Try that, and note that you get the same result.
But now, try requesting (on the command line) the TrimmoMcAwesome output for a different sample, like 002:
# do this on your unix terminal in directory smk-tut
snakemake -np results/awesome/002_R1.fq results/awesome/002_R2.fq
What happened when you did that? Well Snakemake should have replied with an error
message telling you, essentially, “What’s your problem asking for those output files? My Snakefile
doesn’t tell me how to make that output file!” And that is what it should tell you because
the rule trim_awesome
only gives instructions on how to make the output files
results/awesome/001_R1.fq
and results/awesome/001_R2.fq
So, what if we want a rule to make results/awesome/002_R1.fq
and results/awesome/002_R2.fq
and
awesome-trimmed output for all the other samples too. The naive,
and exceedingly inefficient, approach would be to
write a new rule block for every sample in your Snakefile till it looked like:
rule trim_awesome_001:
input:
"data/raw/001_R1.fq",
"data/raw/001_R2.fq"
output:
"results/awesome/001_R1.fq",
"results/awesome/001_R2.fq"
shell:
"TrimmoMcAwesome {input} {output}"
rule trim_awesome_002:
input:
"data/raw/002_R1.fq",
"data/raw/002_R2.fq"
output:
"results/awesome/002_R1.fq",
"results/awesome/002_R2.fq"
shell:
"TrimmoMcAwesome {input} {output}"
# and so forth, until
rule trim_awesome_005:
input:
"data/raw/005_R1.fq",
"data/raw/005_R2.fq"
output:
"results/awesome/005_R1.fq",
"results/awesome/005_R2.fq"
shell:
"TrimmoMcAwesome {input} {output}"
Boy! Doing that would really suck! It would require lots of typing, lots of chances to make an error, and if you wanted to change something in the shell block, for example, you would have to change all of the rules. What a hassle! Obviously, no sane person would do that. And, thankfully no one has to, because Snakemake allows a single block to apply to multiple instances of output (and hence input) by using wildcards. The next section takes us through that.
14.1.2 Snakefile #2: A single TrimmoMcAwesome rule using wildcards
The best way to start thinking about wildcards is to imagine all the different output files you would want from a rule, and then investigate the parts that change. With samples 001 through 005, the TrimmoMcAwesome output we would want would be these pairs of files:
results/awesome/001_R1.fq
results/awesome/001_R2.fq
results/awesome/002_R1.fq
results/awesome/002_R2.fq
results/awesome/003_R1.fq
results/awesome/003_R2.fq
results/awesome/004_R1.fq
results/awesome/004_R2.fq
results/awesome/005_R1.fq
results/awesome/005_R2.fq
OK! So, these are really similar, except that there is a part pertaining to the sample number that is changing between each pair of files.
In fact, we could write down the general pattern for all of these output file names in a sort of shorthand like this:
results/awesome/{sample}_R1.fq
results/awesome/{sample}_R2.fq
In other words every one of the 10 file names can be generated by replacing
{sample}
in
results/awesome/{sample}_R1.fq
results/awesome/{sample}_R2.fq
with a value. For example, if you replace {sample}
with 003
:
results/awesome/003_R1.fq
results/awesome/003_R2.fq
This shorthand is exactly the way that Snakemake specifies a whole family
of output files using wildcards. We will call
"results/awesome/{sample}_R1.fq", "results/awesome/{sample}_R2.fq"
the
wildcard phrase. The wildcard in that wildcard phrase
is{sample}
. Each wildcard in a wildcard phrase can take
values or instances. When wildcard phrases are used in a Snakemake rule
block, the values/instances of the wildcards within them are found when an output file is requested
that matches the wildcard phrase. This will become more clear as we proceed.
First, let us update our Snakefile to look like this:
rule trim_awesome:
input:
"data/raw/{sample}_R1.fq",
"data/raw/{sample}_R2.fq"
output:
"results/awesome/{sample}_R1.fq",
"results/awesome/{sample}_R2.fq"
shell:
"TrimmoMcAwesome {input} {output}"
So, go ahead and make that your Snakefile.
Now, see what happens if you do:
Aha! Snakemake gets upset at you because it is trying to run rule trim_awesome but the output requested in it is a wildcard phrase with no values or instances for the wilcards. Snakemake doesn’t really know how to run a rule to obtain a file that is a wilcard phrase in which the wildcards do not have explicit values.
However, if you explicitly ask for certain output files, like:
# do this on your unix terminal in directory smk-tut
snakemake -np results/awesome/002_R1.fq results/awesome/002_R2.fq
then Snakemake finds the rule to create those output files (because the requested files match the wildcard phrase with appropriate values of the wildcard) and it knows what the input files should be because it propagates wildcard values from the requested output into the wildcard phrases of the input. Diagrammatically that process looks like this:
The upshot of this is that we can now easily request all the sample output files like this:
# do this on your unix terminal in directory smk-tut
snakemake -np results/awesome/00{1..5}_R{1,2}.fq
Again, we note that these curly braces in the above shell command line
are different from the curly
braces you have seen in the Snakemake/Python wildcards. These are just
Unix brace-expansion curly braces. They merely expand into results/awesome/001_R1.fq results/awesome/001_R2.fq results/awesome/002_R1.fq results/awesome/002_R2.fq results/awesome/003_R1.fq results/awesome/003_R2.fq results/awesome/004_R1.fq results/awesome/004_R2.fq results/awesome/005_R1.fq results/awesome/005_R2.fq
Cool! Snakemake was able to figure out how to generate all of those output files by running 5 jobs!
Before we leave this section, let’s see what kind of error message Snakemake gives us when we ask for output that it doesn’t have what it needs to generate. Try this:
# do this on your unix terminal in directory smk-tut
snakemake -np results/awesome/006_R1.fq results/awesome/006_R2.fq
From the error message, we see that Snakemake propagated the wildcard
value {sample} = 006
to the input, but didn’t find the appropriate input
files:
Building DAG of jobs...
MissingInputException in line 4 of /Users/eriq/Desktop/smk-tut/Snakefile:
Missing input files for rule trim_awesome:
data/raw/006_R1.fq
data/raw/006_R2.fq
This is good. Snakemake tells us (in the form of an error message before quitting) when we are asking for something it can’t produce.
14.1.3 Snakefile #3: Add trim_stupendous
in there
Now that we know how to deploy wildcards in our output and input file names
we can use them when we make another block for our trim_stupendous
jobs:
Update your Snakefile so it looks like this:
rule trim_awesome:
input:
"data/raw/{sample}_R1.fq",
"data/raw/{sample}_R2.fq"
output:
"results/awesome/{sample}_R1.fq",
"results/awesome/{sample}_R2.fq"
shell:
"TrimmoMcAwesome {input} {output}"
rule trim_stupendous:
input:
fq1="data/raw/{sample}_R1.fq",
fq2="data/raw/{sample}_R2.fq"
output:
"results/stupendous/{sample}_R1.fq",
"results/stupendous/{sample}_R2.fq"
shell:
"trim_stupendous {input.fq1} {input.fq2} {output}"
Note that we have written the input
block a little bit differently
in that we named the different input file wildcard phrases. We named
them as fq1
and fq2
using this syntax:
input:
fq1="data/raw/{sample}_R1.fq",
fq2="data/raw/{sample}_R2.fq"
And then we also explicitly put those files into the shell command by accessing
the input
with a .fq1
and .fq2
, where appropriate, like this:
"trim_stupendous {input.fq1} {input.fq2} {output}"
The reasons for this are twofold:
- You should be familiar with the fact that you can give names to particular
parts of the
input
oroutput
files (or even theparams
orlog
files), because this makes it a lot easier to keep track of the inputs when there are many of them. - In a moment, we will be exploring how to use Snakemake’s
unpack()
function, and naming the files this way provides us an opportunity to use it.
For now, we don’t worry too much about it, except to note that we can request all the different TrimmoMcAwesome and trim_stupendous output files like this:
# do this on your unix terminal in directory smk-tut
snakemake -np results/awesome/00{1..5}_R{1,2}.fq results/stupendous/00{1..5}_R{1,2}.fq
and Snakemake can figure out that it needs to run 10 jobs.
14.2 Ugly, complex file names…a job for Snakemake input functions!
I hope that all of the above has been fine and well for everyone. But I suspect that
the same question is on everyone’s mind: “Hey! What kind of unrealistic BS is this? I
never get files named something like 001_R1.fq
from the sequencing center!!”
Indeed, more often than not, you will be dealing with FASTQ files that might be named
something like this:
kcr-wiwa-885261-L002-HGGXXX_R1.fastq.gz
kcr-wiwa-885261-L002-HGGXXX_R2.fastq.gz
kcr-wiwa-896753-L002-HGGXXX_R1.fastq.gz
kcr-wiwa-896753-L002-HGGXXX_R2.fastq.gz
plate2-WIWA67365-L002-HHHGYY_R1.fastq.gz
plate2-WIWA67365-L002-HHHGYY_R2.fastq.gz
where the R1 and R2 parts of the names are referring to read 1 and
read 2 of paired end reads, 885261
, 896753
, and
WIWA67365
are sample IDs
or field numbers for the animals you are sequencing, and the rest is
not only a bit gibberishy, but also is not necessarily consistent across the
different individuals. In this sort of case, you cannot easily
propagate wildcard values into an input file name that is specified
by a single, consistent, wildcard phrase. Not only that, but
we probably don’t want to specify a wildcard like {sample}
to take values like
885261
, 896753
, and WIWA67365
. We certainly could, but because the
these field numbers are somewhat long and complex, if we do so, our
output file names will end up being kind of ugly too.
But don’t fear! Snakemake allows for what are called input functions that make it possible to deal with these types of cases. An input function is merely a Python function that takes as input some wildcard values, and will map those to any arbitrary output values. And, used in conjunction with a simple table that gives clean sample names to the complex field_numbers, it is possible to keep the whole workflow much cleaner.
To prepare for this part of the lesson, we are going to want to create
files with those names, inside a directory called raw_nasty_names
. Do that
by evaluating this code in the shell in your smk-tut
directory:
mkdir -p data/raw_nasty_names
for i in kcr-wiwa-885261-L002-HGGXXX_R1.fastq.gz \
kcr-wiwa-885261-L002-HGGXXX_R2.fastq.gz \
kcr-wiwa-896753-L002-HGGXXX_R1.fastq.gz \
kcr-wiwa-896753-L002-HGGXXX_R2.fastq.gz \
plate2-WIWA67365-L002-HHHGYY_R1.fastq.gz \
plate2-WIWA67365-L002-HHHGYY_R2.fastq.gz; do
touch data/raw_nasty_names/$i;
done
Let’s step back and imagine that we have a CSV file that gives us some sample ID’s that go to each of the field_numbers which are also associated with the paths to the fastq files. It would look something like this:
sample,field_number,fastq1,fastq2
s001,885261,data/raw_nasty_names/kcr-wiwa-885261-L002-HGGXXX_R1.fastq.gz,data/raw_nasty_names/kcr-wiwa-885261-L002-HGGXXX_R2.fastq.gz
s002,896753,data/raw_nasty_names/kcr-wiwa-896753-L002-HGGXXX_R1.fastq.gz,data/raw_nasty_names/kcr-wiwa-896753-L002-HGGXXX_R2.fastq.gz
s003,WIWA67395,data/raw_nasty_names/plate2-WIWA67365-L002-HHHGYY_R1.fastq.gz,data/raw_nasty_names/plate2-WIWA67365-L002-HHHGYY_R2.fastq.gz
In fact, this file is super important. It tells you how the nice tidy
sample names like s001
and s002
correspond to the
field numbers of your animals, and also to the paths of the FASTQ files for them.
The next step is very important. Copy the above contents into a file called samples.csv
within your
smk-tut
directory. Once you have done that, we are ready to proceed.
Here is where we are going: we are going to read the samples.csv
table
into aPython variable that will
be accessible by Snakemake, and which
can be used by it to translate wildcards like {sample} = s001
into paths to FASTQ files.
In order to easily handle the samples.csv
file as a variable in Python we will rely on the Pandas
package. If you are new to Python, you can think of Pandas as the
Python equivalent of R’s ‘dplyr’ and ‘readr’ packages.
Let’s play with it a little bit to familiarize ourselves with it…
14.2.1 Playing with Pandas
The Pandas package is included in your snakemake
conda environment, and we are going to play with
it a little bit, so, open a new terminal window, change the directory in it to your
smk-tut
directory, activate your snakemake conda environment and then
start an interactive python session:
Now, you have an interactive Python session in that terminal window. And you can do the following
to import the Pandas package into the session and read in the samples.csv
file and print it:
# do this in your python interpreter...
# this imports the pandas package functionality in an object named pd
import pandas as pd
# this reads the CSV file and sets an index using the values in the "sample" column.
# For R people you can think of this "index" as serving the role of rownames.
samples_table = pd.read_csv("samples.csv").set_index("sample", drop=False)
# now print the table
samples_table
After running the last line, you should have seen something like:
sample field_number fastq1 fastq2
sample
s001 s001 885261 data/raw_nasty_names/kcr-wiwa-885261-L002-HGGX... data/raw_nasty_names/kcr-wiwa-885261-L002-HGGX...
s002 s002 896753 data/raw_nasty_names/kcr-wiwa-896753-L002-HGGX... data/raw_nasty_names/kcr-wiwa-896753-L002-HGGX...
s003 s003 WIWA67395 data/raw_nasty_names/plate2-WIWA67365-L002-HHH... data/raw_nasty_names/plate2-WIWA67365-L002-HHH...
OK, samples_table
is just a table of data. Importantly, it is
indexed by the sample name (s001, s002, etc.) This
means that if we want to access values in any particular row, we can find that row according to
the sample name, and then pick values out of particular columns within that row using this syntax:
# do this in your python interpreter...
# get the field number of sample s002
samples_table.loc["s002", "field_number"]
# get the path to fastq file 1 for sample s003:
samples_table.loc["s003", "fastq1"]
That is quite tasty! The .loc
is a method in Pandas to return the location (i.e., the row)
of a table according to its index. And then if you want to get the value from any particular
column, you just put the column name in there.
Note that if you want to get the values in multiple columns as a Python list you can do like this:
# do this in your python interpreter...
# get the paths to both fastq files for sample s003 in a list:
list(samples_table.loc["s003", ["fastq1", "fastq2"]])
This is totally sweet! This python code takes a value like s002
—which could be the value
of a wildcard—and it returns another value based on it. In fact, it can return the
path of a FASTQ file. This is just the sort of thing we need to convert wildcard values
to FASTQ file pahts in our Snakefile. However, in order to use this type of code
we have to wrap it up into a Snakemake input function.2
Before we go big and do that in our Snakefile, we are going to spoof it in our interactive Python session. This turns out to be a great way to develop and test your Snakemake input functions.
14.2.3 Snakefile #4: an input function for trim_awesome
Now, that we have experimented with writing input functions in our
Python interpreter, it is time to incorporate them into our Snakefile
to turn the wildcards into file paths for the trim_awesome
block.
To do so, we simply have to make sure that within the Snakefile,
before we do anything else, we define the input functions we need,
and then use those functions in the input
section of the
trim_awesome
block. So we must:
- read the
samples.csv
file into thesamples_table
variable, - define the functions
fq1_from_sample
andfq2_from_sample
. - Put those functions into the
trim_awesome
block in place of the input file names.
So, update your Snakefile so that it looks like this:
import pandas as pd
samples_table = pd.read_csv("samples.csv").set_index("sample", drop=False)
# fastq1 input function definition
def fq1_from_sample(wildcards):
return samples_table.loc[wildcards.sample, "fastq1"]
# fastq2 input function definition
def fq2_from_sample(wildcards):
return samples_table.loc[wildcards.sample, "fastq2"]
rule trim_awesome:
input:
fq1_from_sample, # <--- Note how the function names go here,
fq2_from_sample # <--- in place of the file names
output:
"results/awesome/{sample}_R1.fq",
"results/awesome/{sample}_R2.fq"
shell:
"TrimmoMcAwesome {input} {output}"
rule trim_stupendous:
input:
fq1="data/raw/{sample}_R1.fq",
fq2="data/raw/{sample}_R2.fq"
output:
"results/stupendous/{sample}_R1.fq",
"results/stupendous/{sample}_R2.fq"
shell:
"trim_stupendous {input.fq1} {input.fq2} {output}"
Then invoke that new Snakefile with:
# do this in your unix terminal in directory smk-tut
snakemake -np results/awesome/s00{1..3}_R{1,2}.fq
And, Behold! Look at the shell commands that get scheduled, and see that they properly include the raw_nasty_names files. Cool.
14.3 Input functions from Python Dictionaries
The previous section showed us how to produce input
files for trim_awesome from the wildcards using the input
functions fq1_from_sample
and fq2_from_sample
, and it worked!
For the trim_stupendous
block, however, note that we are using
named input files:
and the names get used explicitly in the shell command:
(see the {input.fq1}
and {input.fq2}
, there). We could
redefine the trim_stupendous
block to use our input functions like this:
rule trim_stupendous:
input:
fq1=fq1_from_sample,
fq2=fq2_from_sample
output:
"results/stupendous/{sample}_R1.fq",
"results/stupendous/{sample}_R2.fq"
shell:
"trim_stupendous {input.fq1} {input.fq2} {output}"
There would be nothing wrong with that at all!
However, there is another way to pass inputs to Snakemake rule blocks—by using Python Dictionaries. A Python Dictionary, or dict for short, is very much like a “named list” in R. The elements of a dictionary can be accessed using their names (called “keys” in Python). Let’s go back and evaluate some code in our Python interpreter to illustrate this:
# do this in your python interpreter...
# define a dictionary called boing:
boing = {"fq1": "bing", "fq2": "bong"}
# print it:
boing
# access the fq2 element from it:
boing["fq2"]
Cool. That just shows us how to define a dictionary in Python.
Next, let’s make a function that takes wildcards
as input and returns
a dictionary (with keys fq1
and fq2
) of file paths to the nasty-name
FASTQs.
# do this in your python interpreter...
# define an input function that returns a dict
def fq_dict_from_sample(wildcards):
return {
"fq1": samples_table.loc[wildcards.sample, "fastq1"],
"fq2": samples_table.loc[wildcards.sample, "fastq2"]
}
# make sure our spoofed wildcards variable is set:
wildcards.sample = "s003"
# Now see what that function returns on that wildcards input:
fq_dict_from_sample(wildcards)
OK! It returns a dict object. So, we should be able to use
that dict to return the proper paths for fq1
and fq2
.
BUT in order for this to work properly (i.e., the way we
expect it to) we need to wrap our input function inside
a special Snakemake function called unpack()
which turns the
dict into a collection of named inputs.
Replace the contents of your Snakefile with this:
import pandas as pd
samples_table = pd.read_csv("samples.csv").set_index("sample", drop=False)
# fastq1 input function definition
def fq1_from_sample(wildcards):
return samples_table.loc[wildcards.sample, "fastq1"]
# fastq2 input function definition
def fq2_from_sample(wildcards):
return samples_table.loc[wildcards.sample, "fastq2"]
# function to return both fastqs, keyed by `fq1` and `fq2`, in a dict.
# note that the order of the dict elements is not important here. We
# illustrate that by putting key `fq2` first.
def fq_dict_from_sample(wildcards):
return {
"fq2": samples_table.loc[wildcards.sample, "fastq2"],
"fq1": samples_table.loc[wildcards.sample, "fastq1"]
}
rule trim_awesome:
input:
fq1_from_sample, # <--- Note how the function names go here,
fq2_from_sample # <--- in place of the file names
output:
"results/awesome/{sample}_R1.fq",
"results/awesome/{sample}_R2.fq"
shell:
"TrimmoMcAwesome {input} {output}"
rule trim_stupendous:
input:
unpack(fq_dict_from_sample) # <--- see, it is wrapped in unpack()
output:
"results/stupendous/{sample}_R1.fq",
"results/stupendous/{sample}_R2.fq"
shell:
"trim_stupendous {input.fq1} {input.fq2} {output}"
Then, give it a whirl with:
# do this in your unix terminal in directory smk-tut
snakemake -np results/stupendous/s00{1..3}_R{1,2}.fq
Booyah! This can be quite helpful if you have a lot of named
arguments that you wish to create as a dict from a function.
Note that if you use an input function that returns a dict,
and you forget to wrap it in the unpack()
, then you will typically
see an error like this:
Function did not return str or list of str.
14.5 A Python Primer for Using Snakemake
Snakefiles are evaluated in a Python context, and any valid Python code can be placed within them, and it will be evaluated.
Stuff to talk about:
- Lists
- Cycling over list elements and testing them
- Building up a list with
for
- Adding lists together
- Testing if something is in a list.
- Getting the unique values in a list.
- Formatting strings.
- The Config variable, an OrderedDict
- Loading it with load_configfile
- Accessing different levels by specific names.
- Cycling over all members of a level and getting a list back
- Getting lists out of the keys() and vals()
14.6 Using Snakemake on a Computing Cluster
One key advantage of Snakemake is that it can allow intelligent use of the resources on a
high-performance computing cluster. This is done by supplying customization to Snakemake via its
--cluster
command line option. Here we focus on methods for running your Snakemake workflow
on an HPCC that uses the SLURM scheduler (See Chapter 8). Although getting Snakemake to schedule jobs
using SLURM’s sbatch
command can be as easy as invoking Snakemake with a command that includes an
option like --cluster sbatch
, making such a simple addition to the command line will not bring
all the Snakemake goodness to SLURM that you might hope. Rather than just a simple sbatch
argument
to the --cluster
option, the Snakemake documentation notes that you can provide more job-specific information
to the --cluster
option. For example, suppose you have rules that use different numbers of threads, specified
in the rule blocks like this:
rule easy:
input:
"small_input_{sample}.txt"
threads: 1
output:
"small_output_{sample}.txt"
rule hard:
input:
"big_input_{sample}.txt"
threads: 8
output:
"big_output_{sample}.txt"
Here the rule easy
uses just a single thread, while the rule hard
is multithreaded, and we
want to run it with 8 threads. If we were running this locally on a single workstation, Snakemake
would automatically supply 8 cores (if available) to instances of rule hard
that were running.
However, when running on a cluster, the --cluster
command argument would need to be
expanded to instruct sbatch
to request more cores for rule hard
than for rule easy
.
A Snakemake command that started like this would work:
In this case, the number of threads for each rule gets substituted for {threads}
, so that
it is used to set the --cpus-per-task
option from sbatch
. In this way, jobs scheduled for
rule easy
will request 1 CPU (core) from the cluster, while jobs scheduled for the
hard
rule will request 8 cores from the cluster.
The argument to the --cluster
option could be further decorated to allow Snakemake
to determine other options for sbatch
like where the slurm output and error files should be
written, which partition of the cluster to use, how much memory to request, whom should be
emailed when a job fails, and so forth (see Section 8.4.2 for a discussion of the options
available for sbatch
). Clearly, writing all of this additional information on the
command line for snakemake
would be a major hassle (and it would be easy to forget what
to put there). Who in their right mind would want to write
snakemake --cluster 'sbatch -p highmem --cpus-per-taks={threads} -o slurm_outputs/{rule}/{rule}_{wildcards}_%j.out -e logs_errors/{rule}/{rule}_{wildcards}_%j.err --mail-type=FAIL --mail-user=me@mail.com'
on the command line every time they launched something on the cluster with Snakemake?
Fortunately, Snakemake has a system whereby you can record commonly used options to Snakemake
in a profile. This profile system is not specific only to using Snakemake on a cluster—it can
be used to store any Snakemake options that you want to be applied to your
Snakemake command line, when the profile is invoked as the argument to the
--profile
Snakemake option. The profile system is currently recommended for configuring
Snakemake for use with clusters, and there are some pre-made and curated profiles that are recommended
by the creators of Snakemake for use on clusters. These profiles are available at
https://github.com/snakemake-profiles, a site that includes
a profile specifically for SLURM, at
https://github.com/Snakemake-Profiles/slurm.
Here we will explore the use of that profile for using Snakemake on SLURM.
One advantage of using such pre-built profiles is that they should be well tested, and
should work well. Additionally, the SLURM profile is designed to take advantage of
Snakemake’s --cluster-status
command line option. This option allows Snakemake to
use a script to keep track of the jobs submitted to SLURM, periodically
checking on them to determine if the job is still running or has failed. This is an
important feature when running in a cluster environment.
The profile available at https://github.com/Snakemake-Profiles/slurm for interfacing with SLURM is still being tweaked (as evidenced by discourse in the GitHub issues) but is quite mature, and, at the time of writing, it is working well for me! There are, however, a couple of features of this profile (and the profile system) that can be confusing at first glance. Primarily:
- The old Snakemake way of dealing with cluster execution via a “cluster-config” file, has been
deprecated by Snakemake. The cluster-config file was a YAML file that you could pass to Snakemake,
so that Snakemake could supply rule-specific decorations to the
sbatch
commands used to send jobs to the cluster. Now, however, the Snakemake-recommended SLURM profile includes the ability to supply a cluster-config file. It is important to understand that this cluster-config file is not read and used directly by Snakemake in the deprecated fashion of using the--cluster-config
command; rather, it is read and used by some Python script within the SLURM profile cluster. It is quite important for sensibly assigning paths to SLURM input and output files. - The SLURM snakemake profile is set up to allow the user to quickly make different versions of the profile using a system called “cookiecutter.” You might not be familiar with Cookiecutter, but fear not! It is quite straightforward and convenient.
- The “Cluster Status” script included in the profile may not work on your SLURM system depending on if
sacct
orscontrol
are available. Fortunately, one of these commands is typically available to typical users on SLURM.
In the next section we will focus on configuring the Snakemake SLURM profile for
yourself. After that, we will have a quick look at resources
and at
grouping jobs in Snakemake.
14.6.1 Installing and Configuring the Snakemake SLURM profile
The SLURM profile for snakemake is packaged up as a cookiecutter
template. cookiecutter
is
a Python project that makes it easy to create things like python packages from a template.
So, the SLURM profile on GitHub at https://github.com/Snakemake-Profiles/slurm is
a template for the profile, and cookiecutter
makes it easy to download and customize it for
yourself. cookiecutter
will first download the GitHub repo to a directory within ~/.cookiecutters
.
Once it has the repo there, it will run through a series of questions about how
you might want to customize the SLURM Snakemake profile. This could be done multiple times
to make multiple profiles, but we will just do one for now.
To make all this work, you must have cookiecutter
installed on your system.
And when you run some of the scripts in this profile, it appears that the
cookiecutter
libraries might be required. Just to be safe, be sure to add
cookiecutter
to your snakemake
conda environment:
Once that is done, run cookiecutter
with:
At which point cookiecutter
starts asking you questions about how you
want to customize your profile. I am going to set up a profile for use
on my favorite cluster, SEDNA. I don’t worry about setting up sbatch_defaults
,
and I don’t worry about cluster_name
because SEDNA is not a multi-cluster
environment. However, I do want to set up a path to a cluster-config file
that is relative to the slurm profile I will make (This is necessary
to have a reasonable system for placing the slurm output and error logs.)
Here is a transcript of my session:
profile_name [slurm]: slurm.sedna
sbatch_defaults []:
cluster_config []: cluster-config.yaml
Select advanced_argument_conversion:
1 - no
2 - yes
Choose from 1, 2 [1]: 1
cluster_name []:
You can do something similar, but you might name the profile something else,
like slurm.summit
, if you are making this for the SUMMIT cluster.
When that is done, what the hell just happened? If everything went according to plan,
this should have made a profile that is in a directory named, in my case,
slurm.sedna
in my current working directory. Here is what that directory looks like
for me:
slurm.sedna/
├── config.yaml
├── CookieCutter.py
├── settings.json
├── slurm-jobscript.sh
├── slurm-status.py
├── slurm-submit.py
└── slurm_utils.py
The file that includes the choices given to cookiecutter
are in the file
settings.json
. Here is what mine looks like:
{
"SBATCH_DEFAULTS": "",
"CLUSTER_NAME": "",
"CLUSTER_CONFIG": "cluster-config.yaml",
"ADVANCED_ARGUMENT_CONVERSION": "no"
}
Note that if, after using the SLURM profile you decide that you might
want to add some sbatch defaults, you can easily to that by
editing the settings.json
file in the profile—there is no need
to run cookiecutter
again.
Meanwhile, the actual config file that Snakemake will read to get a sense
for how to run in this SLURM context is config.yaml
and it looks like this:
restart-times: 3
jobscript: "slurm-jobscript.sh"
cluster: "slurm-submit.py"
cluster-status: "slurm-status.py"
max-jobs-per-second: 1
max-status-checks-per-second: 10
local-cores: 1
latency-wait: 60
The items in this YAML file are effectively command line options to Snakemake, in the sense that if you call this profile by invoking Snakemake with:
# this line assumes that the profile is in
# the current working directory and ./slurm.sedna
# is its relative path
snakemake --profile ./slurm.sedna
It has the effect of launching Snakemake with these options:
snakemake --restart-times 3 \
--jobscript slurm-jobscript.sh \
--cluster slurm-submit.py \
--cluster-status slurm-status.py \
--max-jobs-per-second 1 \
--max-status-checks-per-second 10 \
--local-cores 1 \
--latency-wait 60
In words, this command line says
- If a job fails, try restarting it up to three times. (Note that you can edit this in the config.yaml to remove, or comment out, that option, if you are testing scripts that you expect might fail…)
- use
slurm-jobscript.sh
(which is included in the profile you just made) as the SLURM jobscript. - Submit jobs to the cluster using the
slurm-submit.py
script (which is also included in the profile you just made) - …and so forth
It is worth mentioning at this point that, by copying or moving the entire
profile directory (slurm.sedna
in our example) you can store Snakemake profiles in a place
where Snakemake will look for them, so that you can have access to them regardless of
what directory you are working in. This location varies depending on what kind
of computer you are using. You can find it in the output from snakemake --help
. For
example, on my Mac, I see:
--profile PROFILE Name of profile to use for configuring Snakemake. Snakemake will search for a corresponding folder in
/Library/Application Support/snakemake and /Users/eriq/Library/Application Support/snakemake.
Alternatively, this can be an absolute or relative path. The profile folder has to contain a file
'config.yaml'. This file can be used to set default values for command line options in YAML format. For
example, '--cluster qsub' becomes 'cluster: qsub' in the YAML file. Profiles can be obtained from
https://github.com/snakemake-profiles. (default: None)
Whereas, on SEDNA’s Linux operating system, we have:
--profile PROFILE Name of profile to use for configuring Snakemake. Snakemake will search for a corresponding folder in /etc/xdg/snakemake and
/home/eanderson/.config/snakemake. Alternatively, this can be an absolute or relative path. The profile folder has to contain a file
'config.yaml'. This file can be used to set default values for command line options in YAML format. For example, '--cluster qsub' becomes
'cluster: qsub' in the YAML file. Profiles can be obtained from https://github.com/snakemake-profiles. (default: None)
We won’t put our profile in this search path because we want to play with it and modify it a little bit still.
We have one last thing to do to complete our SLURM profile:
recall that we told cookiecutter
that we would have a cluster-config file named cluster-config.yaml
. We have to
actually make that file and use it to make sure our slurm output and
error files go somewhere reasonable. So, create a
cluster-config.yaml
file in your profile directory with these contents.
__default__ :
ntasks: 1
cpus-per-task: "{threads}"
nodes: "1"
output: "slurm_logs/{rule}/{rule}.{wildcards}-%j.out"
error: "slurm_logs/{rule}/{rule}.{wildcards}-%j.err"
In the same way that the profiles config.yaml
file holds command line options that will be passed to
Snakemake, this file includes command line options that will be passed (by default in this case)
to the sbatch
command used to submit jobs to SLURM. The super cool thing about this
however, is that the brace notation can be used to substitute values for the submitted
jobs, like the rule name, the number of threads requested by the rule, and all the wildcards
for an instance of a rule.
For the output files, note that we also use SLURM’s notation of %j
to include the
job number in the name of the log file. This makes it easier to find the logs
from jobs that failed when you were emailed a fail report with the job number, for instance.
14.6.2 Playing with our new SLURM profile
14.6.2.1 A First Run
Now, let us make a silly Snakefile to test how Snakemake works and behaves when using this SLURM profile.
Make a directory called slurm-play
and put your slurm profile in it.
Then we will put a Snakefile in it that has a few rules in it that do nothing,
but which will let us see how things work with SLURM.
# a rule claiming 1 thread that does nothing for 15 seconds
rule t1s15:
input:
"data/{sample}.txt"
output:
"results/t1s15/{sample}.txt"
threads: 1
shell:
"sleep 15; "
" echo gratuitous output; "
" echo gratuitous error output > /dev/stderr; "
"echo {input} > {output}"
# a rule claiming 10 threads that does nothing for 15 seconds
rule t10s15:
input:
"data/{sample}.txt"
output:
"results/t10s15/{sample}.txt"
threads: 10
shell:
"sleep 15; "
"echo {input} > {output} "
Put some “input” files in there by touching them.
Now, if we tell it to produce 20 outputs in the t1s15
directory, utilizing
a single core it will take 15 seconds to produce each one.
There is no SLURM happening there. But now, let’s try running the command on the cluster, restricting ourselves to no more than 10 jobs running at once:
It never has more than 10 jobs going, but it does them all.
Here is what my jobs look like using squeue
(more on that later!)
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) CPUS MIN_MEMORY
51799 node snakejob eanderson R 0:10 1 node03 1 4700M
51800 node snakejob eanderson R 0:10 1 node02 1 4700M
51801 node snakejob eanderson R 0:10 1 node02 1 4700M
51802 node snakejob eanderson R 0:10 1 node09 1 4700M
51803 node snakejob eanderson R 0:08 1 node09 1 4700M
51804 node snakejob eanderson R 0:08 1 node09 1 4700M
51805 node snakejob eanderson R 0:08 1 node09 1 4700M
51806 node snakejob eanderson R 0:08 1 node09 1 4700M
51807 node snakejob eanderson R 0:08 1 node09 1 4700M
51808 node snakejob eanderson R 0:08 1 node09 1 4700M
10 jobs are running, and each one is using one core with 4.7 Gb of memory (which is the default amount of memory per core on SEDNA).
Not only that, but all the SLURM logs went where we wanted them to, and contain what they should:
(snakemake) [node09: slurm-play]--% tree slurm_logs/
slurm_logs/
└── t1s15
├── t1s15.sample=0001-52004.err
├── t1s15.sample=0001-52004.out
├── t1s15.sample=0002-51993.err
├── t1s15.sample=0002-51993.out
├── t1s15.sample=0003-52001.err
├── t1s15.sample=0003-52001.out
...
Their contents are also just what we expect them to: the have captured everythign from the processes that was written to stdout and to stderr but was not redirected:
(snakemake) [node09: slurm-play]--% head slurm_logs/*/*
==> slurm_logs/t1s15/t1s15.sample=0001-52004.err <==
gratuitous error output
[Thu Apr 22 09:56:23 2021]
Finished job 0.
1 of 1 steps (100%) done
==> slurm_logs/t1s15/t1s15.sample=0001-52004.out <==
gratuitous output
==> slurm_logs/t1s15/t1s15.sample=0002-51993.err <==
gratuitous error output
[Thu Apr 22 09:56:23 2021]
Finished job 0.
1 of 1 steps (100%) done
==> slurm_logs/t1s15/t1s15.sample=0002-51993.out <==
gratuitous output
...
Notice how the nanes of the slurm logs have the name of the rule,
all the values of the wildcards in key=value
format, and
the SLURM job number on them. That is RAD!!
14.6.2.2 Make the slurm job names more informative
The only thing that I am not super happy with is that the name of every
job is snakejob
. That really doesn’t supply me with a lot of intuition
about what the hell that job is doing. It would be awesome if we could
change the job name to something that was formed from the rule name and
the wildcards.
Keeping in mind that such names could get quite long, we can first write a function
called myjobs
in our ~/.bashrc
file to make it easy to change how much space is allotted
to the job name in SLURM’s squeue
report. To do this, on your cluster account,
add the following lines to your ~/.bashrc
(just above the conda initialization block
if you have that):
function myjobs {
if [ $# -ne 1 ]; then
JL=10;
else
JL=$1;
fi;
squeue -u $(whoami) -o "%.12i %.9P %.${JL}j %.10u %.2t %.15M %.6D %.18R %.5C %.12m";
}
And then source your new ~/.bashrc
like this:
With this bash function you can say myjobs
to get a list of all your slurm jobs,
and if you need more space for the job names, you can indicate that with a trailing
number, like:
will give you 50 characters for your job names.
Armed with our new myjobs
function, let’s modify our SLURM profile so that
the job names have the rule name and the wildcards in them. The job name
can be set using sbatch
’s --job-name
option. So, we should be able to set
that in our cluster-config.yaml
file in our slurm profile! So, update your
cluster-config.yaml
to add the bottom line in it like this:
__default__ :
ntasks: 1
cpus-per-task: "{threads}"
nodes: "1"
output: "slurm_logs/{rule}/{rule}.{wildcards}-%j.out"
error: "slurm_logs/{rule}/{rule}.{wildcards}-%j.err"
job-name: "{rule}.{wildcards}"
Then we can try running our jobs again:
And then with a quick myjobs 30
we see that our job names are much more informative. Booyah!
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) CPUS MIN_MEMORY
56770 node t1s15.sample=0009 eanderson R 0:15 1 node09 1 4700M
56771 node t1s15.sample=0015 eanderson R 0:15 1 node09 1 4700M
56772 node t1s15.sample=0018 eanderson R 0:15 1 node09 1 4700M
56773 node t1s15.sample=0001 eanderson R 0:12 1 node09 1 4700M
56774 node t1s15.sample=0008 eanderson R 0:12 1 node09 1 4700M
56775 node t1s15.sample=0014 eanderson R 0:12 1 node09 1 4700M
56776 node t1s15.sample=0004 eanderson R 0:12 1 node09 1 4700M
56777 node t1s15.sample=0016 eanderson R 0:12 1 node09 1 4700M
56778 node t1s15.sample=0010 eanderson R 0:12 1 node09 1 4700M
56779 node t1s15.sample=0003 eanderson R 0:12 1 node09 1 4700M
14.6.2.3 Naively launching jobs that claim more threads
OK, but now, let us try launching some jobs that claim more threads and (because of the way
we have set up cluster-config.yaml
) should request more CPUs per task.
We particularly want to see what --cores/--jobs
restricts.
When we do this. Snakemake launches no more than 3 slurm jobs at a time, but each one is allocated 10 cores.
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) CPUS MIN_MEMORY
56780 node t10s15.sample=0014 eanderson R 0:02 1 node09 10 4700M
56781 node t10s15.sample=0019 eanderson R 0:02 1 node08 10 4700M
56782 node t10s15.sample=0020 eanderson R 0:02 1 node08 10 4700M
So, --cores
(which is identical to --jobs
) in the cluster context does not do what you might at first think. This
is super important to understand. If you are launching jobs on a cluster and you don’t want to eat up more
than X cores/CPUs on the cluster you cannot do that using the --cores
option if you have rules
that use more threads than 1. In order to allow Snakemake to properly control the maximum amount of resources
that your Snakemake workflow will request at any one time, you need to define the resources used
by each job and then define the maximum of those resources that you want you workflow to consume at
any particular time.
If you are happy to request as many resources as you possibly can and let the SLURM scheduler figure out how much to give you, this might not be super relevant. But if you share a cluster with a smallish number of people, and you don’t want to completely bomb the resource, it might be important.
14.6.3 Managing cluster resources with Snakemake
Whether or not you are worried about consuming too many resources on the cluster, you typically have to inform the SLURM scheduler about how many resources you need allocated for any job request. Most of the time, these requests will involve:
- How much time you expect your job to run.
- How much memory you expect your job to require.
- How many CPUs your job might need.
- How many nodes your job will require.
The SLURM snakemake profile appears to be able to handle each of these. With specific resource names getting mapped to particular SLURM options in the following manner:
RESOURCE_MAPPING = {
"time": ("time", "runtime", "walltime"),
"mem": ("mem", "mem_mb", "ram", "memory"),
"mem-per-cpu": ("mem-per-cpu", "mem_per_cpu", "mem_per_thread"),
"nodes": ("nodes", "nnodes"),
}
That is to say, that if you specify resources in a block for a rule like this:
Then, those values should get passed along to SLURM via the --time
and --mem-per-cpu
options to sbatch
.
14.6.4 Grouping short jobs to run on a single node
Within your workflow, you might have a point at which thousands of
short jobs must be run, with each one only requiring a couple of seconds.
Sending each one of these jobs to the cluster with a separate sbatch
command is a
very bad idea. There is a lot of overhead involved in scheduling jobs,
and, so, it is best if each job takes a non-trivial amount of time to run.
Not only that, but your thousands of jobs might have to wait a long time in the
queue to start. It would be much better if you could instruct Snakemake to send
the thousands of short jobs that arise as instances of a rule to a single node where
they can be run one after another. This way, there is only one job submission to
SLURM, but thousands of jobs get run.
Such behavior can be achieved by defining Job Groups within the Snakefile or
on the command line.
In its default configuration, job grouping in Snakemake is designed to allow fast-running jobs to “hitchhike” onto the same cluster job submission of long-running jobs. The idea is that if Rule A runs a long job for each value of sample, and then rule B takes the output for Rule A for each sample and runs 1 (or many) short jobs, then it might be good to, in effect, “tack the rule B job onto the end of the rule A job for each sample.” This is achieved by making rule A and rule B members of the same group.
Any rule can be assigned to a group by using the group
keyword in the
rule itself. For example, the following rule is considered to be part
a group called “group1”:
Here we extend our example above to a situation like that described above. In the following Snakefile our “long-running” jobs (rule t1s15) take 15 seconds, and for each {sample} value of that rule, the next rule (rule t1s0) almost instantaneously creates an output for each value of {sample} and also for every value of rep. But both of these rules have been put into the same group (“some_group”), so we can see how this would change the evaluation.
# a rule claiming 1 thread that does nothing for 15 seconds
rule t1s15:
input:
"data/{sample}.txt"
output:
"results/t1s15/{sample}.txt"
threads: 1
group:
"some_group"
shell:
"sleep 15; "
" echo gratuitous output; "
" echo gratuitous error output > /dev/stderr; "
"echo {input} > {output}"
# for each sample and rep (of which we will request 10 in the output)
# there is a single short job that gets run
rule t1s0:
input:
"results/t1s15/{sample}.txt"
output:
"results/t1s0/{sample}_{rep}.txt"
group:
"some_group"
shell:
"echo {input} > {output}"
We will pretend that we have 5 reps of rule t1s0
for each sample, and
we can easily specify that by the target output file we want:
# do this on the command line in the directory with the Snakefile
rm -r results
snakemake -np results/t1s0/{0001..0020}_r{01..05}.txt
Doing this without any cluster specification shows this for the number of jobs:
Job counts:
count jobs
100 t1s0
20 t1s15
That makes sense, we have 5 reps (r01, r02, …, r05) for each of 20 samples.
But, we certainly would not want to run each of those jobs separately in a cluster
environment. Let’s do a dry run with our slurm profile. In my case that looks like:
At the bottom, it reports the number of jobs as the same:
Job counts:
count jobs
100 t1s0
20 t1s15
120
But if we look at the jobs listed we see groups in it like this (note the second line):
[Thu Apr 29 08:28:00 2021]
group job some_group (jobs in lexicogr. order):
[Thu Apr 29 08:28:00 2021]
rule t1s0:
input: results/t1s15/0013.txt
output: results/t1s0/0013_r04.txt
jobid: 92
wildcards: sample=0013, rep=r04
echo results/t1s15/0013.txt > results/t1s0/0013_r04.txt
[Thu Apr 29 08:28:00 2021]
rule t1s0:
input: results/t1s15/0013.txt
output: results/t1s0/0013_r05.txt
jobid: 90
wildcards: sample=0013, rep=r05
echo results/t1s15/0013.txt > results/t1s0/0013_r05.txt
[Thu Apr 29 08:28:00 2021]
rule t1s0:
input: results/t1s15/0013.txt
output: results/t1s0/0013_r03.txt
jobid: 8
wildcards: sample=0013, rep=r03
echo results/t1s15/0013.txt > results/t1s0/0013_r03.txt
[Thu Apr 29 08:28:00 2021]
rule t1s0:
input: results/t1s15/0013.txt
output: results/t1s0/0013_r02.txt
jobid: 116
wildcards: sample=0013, rep=r02
echo results/t1s15/0013.txt > results/t1s0/0013_r02.txt
[Thu Apr 29 08:28:00 2021]
rule t1s0:
input: results/t1s15/0013.txt
output: results/t1s0/0013_r01.txt
jobid: 66
wildcards: sample=0013, rep=r01
echo results/t1s15/0013.txt > results/t1s0/0013_r01.txt
[Thu Apr 29 08:28:00 2021]
rule t1s15:
input: data/0013.txt
output: results/t1s15/0013.txt
jobid: 9
wildcards: sample=0013
sleep 15; echo gratuitous output; echo gratuitous error output > /dev/stderr; echo data/0013.txt > results/t1s15/0013.txt
[Thu Apr 29 08:28:00 2021]
group job some_group (jobs in lexicogr. order):
[Thu Apr 29 08:28:00 2021]
rule t1s0:
input: results/t1s15/0006.txt
output: results/t1s0/0006_r02.txt
jobid: 74
wildcards: sample=0006, rep=r02
...
So, all the jobs under a particular group job heading will be run together. Above, all the grouped jobs for sample=0013 are shown.
When we try running that with:
It sends each group off as a separate job.
There are some hiccups, however! In the group job context, there is not a single rule that is associated with each job submission to SLURM, so these lines in the cluster-config create some serious problems:
output: "slurm_logs/{rule}/{rule}.{wildcards}-%j.out"
error: "slurm_logs/{rule}/{rule}.{wildcards}-%j.err"
job-name: "{rule}.{wildcards}"
So, having those in the cluster config file basically caused the job to fail.
When those lines are removed, all 20 group jobs run (with names like snakejob.some_group.ca961
) and they spew their
output streams to standard slurm output files named like in the current working directory:
slurm-56801.out slurm-56804.out slurm-56807.out slurm-56810.out slurm-56813.out slurm-56816.out slurm-56819.out
slurm-56802.out slurm-56805.out slurm-56808.out slurm-56811.out slurm-56814.out slurm-56817.out slurm-56820.out
slurm-56803.out slurm-56806.out slurm-56809.out slurm-56812.out slurm-56815.out slurm-56818.out
Sigh…Clearly using group jobs requires some care when using the slurm profile the way that you might want to.
To lump multiple connected subgraphs together you can use the --group-components
option.
There are some subtleties with how job grouping works, however. First off, if a group consists of a single rule, then all instances of that rule (i.e., all invocations of that rule with different sets of wildcards) will be grouped together and sent to a single node for evaluation on the cores available to that node.
I will write more about this later. It all comes down to what the connected subgraphs of a Snakemake job look like. I won’t write about that until I have some pictures I can provide.
The astute reader will note that these names are similar to the actual programs
Trimmomatic
andtrim_galore
. Just having some fun…↩︎Note that you can write the code in a more “inline” fashion in your Snakefile by using what are called “lambda functions” in Python. However, it can be cleaner and easier to understand explicitly written input functions.↩︎