Chapter 27 An example full workflow from sequences to variants
The purpose of this chapter is to present an accessible example workflow showing all the steps of taking raw sequences in FASTQ files through to variants in VCF files. It assumes that some sort of reference genome is available already (be it a chromosome-level assembly or just a scaffold-level assembly). The workflow shown here is based on the GATK “Best Practices,” with some modifications to make the workflow reasonable for non-model organisms that lack many of the genomic resources expected by the authors of the GATK “Best Practices” (like files of known variation in the species, etc.)
As an example data set we have created a small version of a typical NGS data set. The original data set included 384 Chinook salmon from the Yukon River basin, sequenced across 4 lanes of an Illumina NovaSeq machine. To create the subsetted data set, we extracted just the reads that map to a small part (about 10 megabases from each of four chromosomes, and then a variety of shorter scaffolds) of the latest (as of 2022) reference genome (Otsh_v2.0). for Chinook salmon. And we included data from only 8 of the 384 individual fish.
Included in the data set is a reference genome that has been downsized similarly, to include only those portions of the genome from which we extracted the read from those 8 fish.
Finally, the reads from different individuals were massaged to make it appear that some individuals were sequenced on multiple lanes or flow cells and so that some individuals underwent several different library preps. This is an important feature to ensure that we learn how to properly deal with sequencing of individual libraries across multiple lanes when marking PCR duplicates.
The data set is available publicly as a .tar file at this link: https://drive.google.com/file/d/1LMK-DCkH1RKFAWTR2OKEJ_K9VOjJIZ1b/view?usp=sharing. If you view that site on a browser you can choose to download the file.
Downloading the file to a Linux server can be done using wget
on the command line. Because the file
is large, a little bit of extra trickery is required to download it. This (rather complex) command line
should work:
wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1LMK-DCkH1RKFAWTR2OKEJ_K9VOjJIZ1b' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1LMK-DCkH1RKFAWTR2OKEJ_K9VOjJIZ1b" -O non-model-wgs-example-data.tar && rm -rf /tmp/cookies.txt
Once the file non-model-wgs-example-data.tar
is downloaded, that .tar archive can be untarred with
That will create the directory non-model-wgs-example-data
, which contains all the files.
The contents of that directory look like this (in a tree
view):
non-model-wgs-example-data
├── README.Rmd
├── README.nb.html
├── data
│ └── messy_names
│ └── fastqs
│ ├── T199967_T2087_HY75HDSX2_L001_R1_001.fastq.gz
│ ├── T199967_T2087_HY75HDSX2_L001_R2_001.fastq.gz
│ ├── T199968_T2087_HY75HDSX2_L001_R1_001.fastq.gz
│ ├── T199968_T2087_HY75HDSX2_L001_R2_001.fastq.gz
│ ├── T199968_T2087_HY75HDSX2_L002_R1_001.fastq.gz
│ ├── T199968_T2087_HY75HDSX2_L002_R2_001.fastq.gz
│ ├── T199969_T2087_HTYYCBBXX_L002_R1_001.fastq.gz
│ ├── T199969_T2087_HTYYCBBXX_L002_R2_001.fastq.gz
│ ├── T199969_T2087_HY75HDSX2_L002_R1_001.fastq.gz
│ ├── T199969_T2087_HY75HDSX2_L002_R2_001.fastq.gz
│ ├── T199969_T2087_HY75HDSX2_L003_R1_001.fastq.gz
│ ├── T199969_T2087_HY75HDSX2_L003_R2_001.fastq.gz
│ ├── T199970_T2087_HY75HDSX2_L003_R1_001.fastq.gz
│ ├── T199970_T2087_HY75HDSX2_L003_R2_001.fastq.gz
│ ├── T199970_T2094_HY75HDSX2_L004_R1_001.fastq.gz
│ ├── T199970_T2094_HY75HDSX2_L004_R2_001.fastq.gz
│ ├── T199971_T2087_HY75HDSX2_L002_R1_001.fastq.gz
│ ├── T199971_T2087_HY75HDSX2_L002_R2_001.fastq.gz
│ ├── T199971_T2087_HY75HDSX2_L004_R1_001.fastq.gz
│ ├── T199971_T2087_HY75HDSX2_L004_R2_001.fastq.gz
│ ├── T199971_T2099_HTYYCBBXX_L002_R1_001.fastq.gz
│ ├── T199971_T2099_HTYYCBBXX_L002_R2_001.fastq.gz
│ ├── T199972_T2087_HTYYCBBXX_L003_R1_001.fastq.gz
│ ├── T199972_T2087_HTYYCBBXX_L003_R2_001.fastq.gz
│ ├── T199972_T2087_HY75HDSX2_L001_R1_001.fastq.gz
│ ├── T199972_T2087_HY75HDSX2_L001_R2_001.fastq.gz
│ ├── T199972_T2094_HTYYCBBXX_L004_R1_001.fastq.gz
│ ├── T199972_T2094_HTYYCBBXX_L004_R2_001.fastq.gz
│ ├── T199972_T2094_HY75HDSX2_L002_R1_001.fastq.gz
│ ├── T199972_T2094_HY75HDSX2_L002_R2_001.fastq.gz
│ ├── T199973_T2087_HY75HDSX2_L002_R1_001.fastq.gz
│ ├── T199973_T2087_HY75HDSX2_L002_R2_001.fastq.gz
│ ├── T199973_T2087_HY75HDSX2_L003_R1_001.fastq.gz
│ ├── T199973_T2087_HY75HDSX2_L003_R2_001.fastq.gz
│ ├── T199973_T2094_HY75HDSX2_L002_R1_001.fastq.gz
│ ├── T199973_T2094_HY75HDSX2_L002_R2_001.fastq.gz
│ ├── T199973_T2094_HY75HDSX2_L003_R1_001.fastq.gz
│ ├── T199973_T2094_HY75HDSX2_L003_R2_001.fastq.gz
│ ├── T199974_T2087_HY75HDSX2_L001_R1_001.fastq.gz
│ ├── T199974_T2087_HY75HDSX2_L001_R2_001.fastq.gz
│ ├── T199974_T2094_HY75HDSX2_L001_R1_001.fastq.gz
│ ├── T199974_T2094_HY75HDSX2_L001_R2_001.fastq.gz
│ ├── T199974_T2099_HY75HDSX2_L001_R1_001.fastq.gz
│ └── T199974_T2099_HY75HDSX2_L001_R2_001.fastq.gz
├── fastq
│ ├── s001---1_R1.fq.gz -> ../data/messy_names/fastqs/T199967_T2087_HY75HDSX2_L001_R1_001.fastq.gz
│ ├── s001---1_R2.fq.gz -> ../data/messy_names/fastqs/T199967_T2087_HY75HDSX2_L001_R2_001.fastq.gz
│ ├── s002---1_R1.fq.gz -> ../data/messy_names/fastqs/T199968_T2087_HY75HDSX2_L001_R1_001.fastq.gz
│ ├── s002---1_R2.fq.gz -> ../data/messy_names/fastqs/T199968_T2087_HY75HDSX2_L001_R2_001.fastq.gz
│ ├── s002---2_R1.fq.gz -> ../data/messy_names/fastqs/T199968_T2087_HY75HDSX2_L002_R1_001.fastq.gz
│ ├── s002---2_R2.fq.gz -> ../data/messy_names/fastqs/T199968_T2087_HY75HDSX2_L002_R2_001.fastq.gz
│ ├── s003---1_R1.fq.gz -> ../data/messy_names/fastqs/T199969_T2087_HY75HDSX2_L002_R1_001.fastq.gz
│ ├── s003---1_R2.fq.gz -> ../data/messy_names/fastqs/T199969_T2087_HY75HDSX2_L002_R2_001.fastq.gz
│ ├── s003---2_R1.fq.gz -> ../data/messy_names/fastqs/T199969_T2087_HY75HDSX2_L003_R1_001.fastq.gz
│ ├── s003---2_R2.fq.gz -> ../data/messy_names/fastqs/T199969_T2087_HY75HDSX2_L003_R2_001.fastq.gz
│ ├── s003---3_R1.fq.gz -> ../data/messy_names/fastqs/T199969_T2087_HTYYCBBXX_L002_R1_001.fastq.gz
│ ├── s003---3_R2.fq.gz -> ../data/messy_names/fastqs/T199969_T2087_HTYYCBBXX_L002_R2_001.fastq.gz
│ ├── s004---1_R1.fq.gz -> ../data/messy_names/fastqs/T199970_T2087_HY75HDSX2_L003_R1_001.fastq.gz
│ ├── s004---1_R2.fq.gz -> ../data/messy_names/fastqs/T199970_T2087_HY75HDSX2_L003_R2_001.fastq.gz
│ ├── s004---2_R1.fq.gz -> ../data/messy_names/fastqs/T199970_T2094_HY75HDSX2_L004_R1_001.fastq.gz
│ ├── s004---2_R2.fq.gz -> ../data/messy_names/fastqs/T199970_T2094_HY75HDSX2_L004_R2_001.fastq.gz
│ ├── s005---1_R1.fq.gz -> ../data/messy_names/fastqs/T199971_T2087_HY75HDSX2_L002_R1_001.fastq.gz
│ ├── s005---1_R2.fq.gz -> ../data/messy_names/fastqs/T199971_T2087_HY75HDSX2_L002_R2_001.fastq.gz
│ ├── s005---2_R1.fq.gz -> ../data/messy_names/fastqs/T199971_T2087_HY75HDSX2_L004_R1_001.fastq.gz
│ ├── s005---2_R2.fq.gz -> ../data/messy_names/fastqs/T199971_T2087_HY75HDSX2_L004_R2_001.fastq.gz
│ ├── s005---3_R1.fq.gz -> ../data/messy_names/fastqs/T199971_T2099_HTYYCBBXX_L002_R1_001.fastq.gz
│ ├── s005---3_R2.fq.gz -> ../data/messy_names/fastqs/T199971_T2099_HTYYCBBXX_L002_R2_001.fastq.gz
│ ├── s006---1_R1.fq.gz -> ../data/messy_names/fastqs/T199972_T2087_HY75HDSX2_L001_R1_001.fastq.gz
│ ├── s006---1_R2.fq.gz -> ../data/messy_names/fastqs/T199972_T2087_HY75HDSX2_L001_R2_001.fastq.gz
│ ├── s006---2_R1.fq.gz -> ../data/messy_names/fastqs/T199972_T2087_HTYYCBBXX_L003_R1_001.fastq.gz
│ ├── s006---2_R2.fq.gz -> ../data/messy_names/fastqs/T199972_T2087_HTYYCBBXX_L003_R2_001.fastq.gz
│ ├── s006---3_R1.fq.gz -> ../data/messy_names/fastqs/T199972_T2094_HY75HDSX2_L002_R1_001.fastq.gz
│ ├── s006---3_R2.fq.gz -> ../data/messy_names/fastqs/T199972_T2094_HY75HDSX2_L002_R2_001.fastq.gz
│ ├── s006---4_R1.fq.gz -> ../data/messy_names/fastqs/T199972_T2094_HTYYCBBXX_L004_R1_001.fastq.gz
│ ├── s006---4_R2.fq.gz -> ../data/messy_names/fastqs/T199972_T2094_HTYYCBBXX_L004_R2_001.fastq.gz
│ ├── s007---1_R1.fq.gz -> ../data/messy_names/fastqs/T199973_T2087_HY75HDSX2_L002_R1_001.fastq.gz
│ ├── s007---1_R2.fq.gz -> ../data/messy_names/fastqs/T199973_T2087_HY75HDSX2_L002_R2_001.fastq.gz
│ ├── s007---2_R1.fq.gz -> ../data/messy_names/fastqs/T199973_T2087_HY75HDSX2_L003_R1_001.fastq.gz
│ ├── s007---2_R2.fq.gz -> ../data/messy_names/fastqs/T199973_T2087_HY75HDSX2_L003_R2_001.fastq.gz
│ ├── s007---3_R1.fq.gz -> ../data/messy_names/fastqs/T199973_T2094_HY75HDSX2_L002_R1_001.fastq.gz
│ ├── s007---3_R2.fq.gz -> ../data/messy_names/fastqs/T199973_T2094_HY75HDSX2_L002_R2_001.fastq.gz
│ ├── s007---4_R1.fq.gz -> ../data/messy_names/fastqs/T199973_T2094_HY75HDSX2_L003_R1_001.fastq.gz
│ ├── s007---4_R2.fq.gz -> ../data/messy_names/fastqs/T199973_T2094_HY75HDSX2_L003_R2_001.fastq.gz
│ ├── s008---1_R1.fq.gz -> ../data/messy_names/fastqs/T199974_T2087_HY75HDSX2_L001_R1_001.fastq.gz
│ ├── s008---1_R2.fq.gz -> ../data/messy_names/fastqs/T199974_T2087_HY75HDSX2_L001_R2_001.fastq.gz
│ ├── s008---2_R1.fq.gz -> ../data/messy_names/fastqs/T199974_T2094_HY75HDSX2_L001_R1_001.fastq.gz
│ ├── s008---2_R2.fq.gz -> ../data/messy_names/fastqs/T199974_T2094_HY75HDSX2_L001_R2_001.fastq.gz
│ ├── s008---3_R1.fq.gz -> ../data/messy_names/fastqs/T199974_T2099_HY75HDSX2_L001_R1_001.fastq.gz
│ └── s008---3_R2.fq.gz -> ../data/messy_names/fastqs/T199974_T2099_HY75HDSX2_L001_R2_001.fastq.gz
├── resources
│ └── genome.fasta
├── samples.tsv
└── units.tsv
- The README.* files give some background on how this data set was whittled out of the larger, complete data set.
- The directory
data/messy_names/fastqs
, contains all the FASTQ files as they might appear after downloading from the sequencing center. Characteristic of such files, they all have rather daunting names. - The directory
fastq
at the top level of the directory contains symbolic links to the big files with messy names. These are there simply to provide students a simpler variety of names of files to deal with when doing the initial steps of a sequencing exercise. But, ultimately, we will explore ways of directly addressing the files with more complex names. - The file
samples.tsv
just contains all the unique “tidy” sample names (likes001
,s002
, etc.) in a single column namedsample
. - The file
units.tsv
contains crucial information about all the different pairs of FASTQ files. Each pair is considered a separate “unit” of one of the samples.
Some of the information in the file pertains to which flowcell and lane the sample-unit was sequenced on, as well as which library the sample-unit was prepared in. All of this relates to the “journey” that was taken by each little piece of DNA that was sequenced (Section 19.1). It is worth having a look at the file. - The file
resources/genome.fasta
contains the “genome” for this exercise.
For information about FASTQ and FASTA files see Sections 17.2 and 17.3.
27.1 An overview of the whole workflow
We provide here a simple(ish) flow diagram that shows the different steps/jobs in this workflow that have to be done. The diagram is made from a Snakemake workflow (see Chapter 14) that Eric has implemented for whole genome sequencing in non-model organisms. The acyclic directed graph (DAG) output from Snakemake has been condensed with the function,condense_dag()
, from one of Eric’s R packages
(SnakemakeDagR). Each step/job is a separate
box, and the arrows between the boxes show the dependence of output from the parent box
(at the tail end of the arrow) as input for the child box (at the head end of the arrow).
The numbers in the boxes and along the arrows show how many independent instances of each job
must be run (for different samples, units, chromosomes, etc.)
Burn this DAG into your memory. We will return to versions of it in each section below to give an indication of which part of the workflow we are tackling.
27.2 Let’s have a look at the reference genome
A number of people in the class had problems installing samtools on SUMMIT last week. The reason for this appears to be that the conda package management system migrated to a new version of the SSH library, but samtools looks for an older version of the libcrypto library (perhaps because it uses some of the hashing functions from that library). This was a known problem when conda upgraded its SSH dependencies, but seems to have resurfaced with later versions of samtools. However, it can be fixed by installing version 1.11 of samtools.
27.2.1 First off, make sure that you have added mamba to your base environment
I was playing with a lot of things over the weekend, and it was clear that mamba was orders of magnitude faster than conda for installing environments. (It almost felt like conda merely stalled on some environments—that wouldn’t make for fun classtime…just waiting for software to be installed).
So, I am going to require that people use mamba
from here on out. First, test to see
if you have mamba in your base conda environment: activate your base conda environment
and see if you have mamba:
If this gives you a screen full of syntax about how to use mamba, like:
usage: mamba [-h] [-V] command ...
conda is a tool for managing and deploying applications, environments and packages.
Options:
positional arguments:
command
clean Remove unused packages and caches.
compare Compare packages between conda environments.
config Modify configuration values in .condarc. This is modeled after the git config command. Writes to the user .condarc file (/home/eriq@colostate.edu/.condarc) by
default.
create Create a new conda environment from a list of specified packages.
help Displays a list of available conda commands and their help strings.
info Display information about current conda install.
init Initialize conda for shell interaction. [Experimental]
install Installs a list of packages into a specified conda environment.
list List linked packages in a conda environment.
package Low-level conda package utility. (EXPERIMENTAL)
remove Remove a list of packages from a specified conda environment.
uninstall Alias for conda remove.
run Run an executable in a conda environment. [Experimental]
search Search for packages and display associated information. The input is a MatchSpec, a query language for conda packages. See examples below.
update Updates conda packages to the latest compatible version.
upgrade Alias for conda update.
repoquery Query repositories using mamba.
then you have mamba.
If, instead, it tells you it doesn’t know anything about the mamba command, then you need to install it. Make sure that you install it into your base conda environment, using this command:
Now, if you need to get a working samtools (version 1.11) into an environment called samtools, you can do this:
To use samtools you then must activate the environment
27.2.2 Indexing the reference genome
Samtools has wonderful subcommand, faidx
that indexes a fasta file.
We can index our example reference genome by running this command from the
top level of our non-model-wgs-example-data
directory.
This creates the file resources/genome.fasta.fai
. The extension
.fai
stands for “faidx index.” Look at resources/genome.fasta.fai
using the less
viewer. The first few lines look like:
CM031199.1 14000000 122 60 61
CM031200.1 12000000 14233578 60 61
CM031201.1 10000000 26433700 60 61
CM031202.1 6000000 36600489 60 61
JAEPEU010001976.1 276 42700570 60 61
JAEPEU010009262.1 333 42700932 60 61
JAEPEU010007942.1 367 42701352 60 61
JAEPEU010006675.1 416 42701807 60 61
JAEPEU010003074.1 440 42702311 60 61
Each line gives information about a differet sequence in the referenc fasta file, listed in the order that the sequences appear in the fasta file.
The five columns of each line hold this information:
- NAME: Name of this reference sequence
- LENGTH: Total length of this reference sequence, in bases
- OFFSET: Offset in the FASTA/FASTQ file of this sequence’s first base
- LINEBASES: The number of bases on each line
- LINEWIDTH: The number of bytes in each line, including the newline
This information can be used to rapidly find the parts of the file where
any position in the genome is found, for example, if you wanted to find
the 100th base in sequence JAEPEU010001976.1
you can see that you have
to go 42700570
bytes from the start of the file to get to base 1 of
`JAEPEU010001976.1
, and then you would have to go forward in the file
for 99 bases. 59 bases forward would get you to the last base on the first
line, then you have an additional byte to eat up (the line ending) and then
40 more bytes would put you at position 100. So, you could reach
the desired point in the fasta file by just going
42700570 + 59 + 1 + 40
bytes forward in the file. This is the sort
of thing that can be done very quickly by a computer.
Fortunately you don’t have to do this arithmetic in your head to
find sequences at certain places in the reference genome. Rather, once
you have indexed your FASTA file, you can use samtools faidx
with any
number of optional genomic coordinates, to extract those seqences.
A genomic coordinate (or “range”) of a sequence has the form:
chromosome_or_sequence_name:start_position-end_positions
For example:
CM031199.1:50151-50300
gives the genomic coordinates (or the “genome address”) of the 150 base pair
sequence whose first base is at position 50,151 in the genome, and whose
last is at position 50,300. When using samtools faidx
bases are numbered
so that the first one is 1
. This is called base-1 indexing. Other
ways of addressing regions of genomes (like those used in BED files, use a
different base-0 scheme.)
You can think of base-1 schemes as those that give the address of each base starting from 1. To understand a base-0 scheme in that light, it makes sense to think about addressing each SNP according to individual “pickets” that separate each base. (Probably should draw a picture, but you can see it at this UCSC genome browser blog site).
At any rate, if you want to get the 500 bases starting from 1,000,001 on the first two chromosomes in our example reference genome you would do:
Note that if you find it difficult to keep track of all those zeroes, you are free to put commas into the numbers anywhere that is helpful, for example:
So, if you ever need to extract a certain sequence from a genome, this is one way you can do it!
27.3 A word on organizing workflows
Now, before we start rolling along with our example bioinformatic workflow,
let us take a moment and appreciate how clean and tidy our non-model-wgs-example-data
directory is. If I do ls on it I see this:
That is all right! There are only seven files or directories there and it is not a mess.
As we proceed with a bioinformatic pipeline/workflow, it will behoove us to
keep things super nice and clean. In fact I am going to recommend that nearly
everything we do in our workflow should go inside a single directory called
results
.
In general, your workflow will involve several types of large files. Some of those
are just things like genomes (.fasta files), genomic features (.gff) files, or known
variants, etc. In general these sorts of things that you might download from a
repository like GenBank should be placed into the resources
directory, possibly
in their own subdirectory within resources
. The other types of files are those that
get produced when you run bioinformatic software to analyze your data—for example, trimmed
versions of your fastqs after you have trimmed off the low quality bases, etc. These latter
types of files should all go into separate folders within a directory called results
.
The directory they go in should reflect what type of step in your bioinformatic process
produced them, for example: trimmed
, mapped
, variants
, etc.
Additionally, there should be a directory, results/logs
where you should redirect stderr
for
each sample that is run in a certain process. All this will become clear as we start
working through our example.
27.4 Step 1a. Quality Checking the Reads
The first thing we are going to do is Quality Check our fastq files. The goal of this step is to get a sense for how good (or bad) the data that we got back from the sequencing center is, in terms of number of reads per sample, base quality scores, distribution of A, C, G, and T bases, the number of missing bases, and the occurrence of overrepresented sequences (that could indicate contamination or untrimmed Illumina adapters, etc.)
The following figure shows where this fits into our overall workflow. We will be using FASTQC to profile each FASTQ file, and then we will summarize all those results with MULTIQC.There is a standard piece of Java software called FASTQC that does a reasonably comprehensive job of this. We will focus on running it interactively for now (we will talk about dispatching jobs via SLURM later). So, on the cluster, we will want to get a shell on a compute node. If you want to, on SUMMIT, you can get an interactive shell with:
But, to be honest, sinteractive
is a somewhat old-school shell function that uses
screen to connect you to the interactive shell. This causes some problems for TMUX. So,
I find a better way to get a shell from the oversubscriped interactive partition on
SUMMIT to be:
When you do this, and then reconnect to such a session with TMUX, your scrollback shouldn’t
be totally messed up (Thanks Caitlin for reminding me of this sinteractive
atrocity!).
I’ve checked with the sys-admins on SUMMIT and they say this is a perfectly acceptable
way of getting an interactive shell on SUMMIT. Also, the -t 2:00:00
says
“give me this interactive shell for 2 hours,” which will give us more than
enough time to get through a whole class session.
The first thing we will need to do is get FASTQC using mamba. Put it in its own
environment named fastqc
.
Hooray for the wicked speed of mamba!
Once this is set up, we can get the instructions for using fastqc like this:
That is a fairly lengthy statement about the syntax of this program,
so I won’t copy it into the handbook, here, But read through it and you
shall find that it looks like if we want to just
put all the fastq files on the command line
then fastqc
ought to gobble them right up and put the output in files that
are named intelligently. That sounds pretty convenient. Let’s test it out
on a couple of the smallest files that we have (about 15 Mb each):
# make some directories to put the results in
mkdir -p results/qc/fastqc results/logs/fastqc
# now, try running fastqc on two small files:
fastqc -o results/qc/fastqc --noextract fastq/s006---1_R1.fq.gz fastq/s006---1_R2.fq.gz
Note that fastqc spits out some output to tell you about the progress that it is making. It turns out that is going to both stderr (the percent complete lines) and to stdout (the “complete” lines). So, we probably should capture both of those streams in the future.
When it is done, look at the output in results/qc/fastqc
. See that there are files with
pretty self-explanatory names. For each FASTQ file there is an .html
file that has
a report file (with lots of pretty pictures) that can be opened in a web browser. Note that
you have to copy those to your laptop if you want to see them. You could, for example,
do that with Globus. The other output for each FASTQ file is a .zip
file of data
that we will use later.
That didn’t take too long. Let’s try the whole schmeer! We can do that easily with
globbing. Note the redirection of stdout to a log file and the 2>&1
which means, “send
stderr to the same place stdout is going to.”
Note that if you want to see how much progress is being made on the file
you can look at the bottom of the log file at results/logs/fastqc/fastqc.log
using the tail
command in another shell (opened for example with tmux, before you started
the last command). In fact, you can do one better than that by using tail -f
.
Since I had another shell going in tmux, I can do this with it:
this continuously spits new output to the fastqc.log
file to stdout.
The whole thing should take about 5 to 10 minutes on an interactive shell
on SUMMIT.
27.4.1 A Little Looking at the HTMLs
Let’s just go through a few of the different pieces of information in here. It is relatively self-explanatory. I do want to call out the adapter content pane, and show that it found some instances of the Illumina Universal Adapter. These are refer to Illumina’s TruSeq Version 3 adapters. So, we see that some of the reads include sequence from the adapters. Cool. We will come back to that.
27.4.2 Summarizing output from a lot of different files with MultiQC
As you probably noticed if you tried to open all the html reports for every FASTQ file, it is a bit laborious to look at every report…and this is just for a small number of files. Imagine if you had to do that for 768 different FASTQ files!
A good solution for summarizing such outputs is the Java-based program, MultiQC. This program offers an insightful system for summarizing log files and outputs from a wide range of different bioinformatics programs. It is capable of summarizing much more than just fastqc files, but that is how we will first use it.
Let’s get it and put it in a conda environment called multiqc.
Once that is done (and if you use conda
instead of mamba
it might
never be done) you can activate the environment and get some information
about how to use multiqc with:
Reading the program syntax, it seems like we should be able to
point it to the results/qc/fastqc
directory and tell it to put its output
into a results/qc/multiqc
directory like this:
Note that the command line above is using the -v
option, which gives us
verbose output. This is kind of fun because it shows us all the different kinds of
output file types that MultiQC can parse—a whole boatload of them. All we have
now our fastqc files. But we can see the summary report by downloading
results/qc/multiqc/multiqc_report.html
and opening it in your browser. Super cool!
27.5 Step 1b. Trimming the Reads
The fastqc reports showed that most of the sequences we have are of pretty good quality, but, as expected, the quality drops off a bit later in the read. Also, we saw that there were some adapter sequences stuck into 1 to 3% of the reads. We can use the tool TrimmoMatic to:
- Trim off parts of reads that are of low quality.
- Chuck out entire reads that are of low quality.
- Remove adapter sequences that are improperly inserted into our reads.
To use Trimmomatic, we will have to get it. If you don’t have it in a conda environment yet, you know the drill:
Once that is installed you can get a terse and somewhat cryptic message about how to run the program like this:
Which tells us this:
Usage:
PE [-version] \
[-threads <threads>] \
[-phred33|-phred64] \
[-trimlog <trimLogFile>] \
[-summary <statsSummaryFile>] \
[-quiet] \
[-validatePairs] \
[-basein <inputBase> | <inputFile1> <inputFile2>] \
[-baseout <outputBase> | <outputFile1P> <outputFile1U> <outputFile2P> <outputFile2U>] \
<trimmer1>...
or:
SE [-version] \
[-threads <threads>] \
[-phred33|-phred64] \
[-trimlog <trimLogFile>] \
[-summary <statsSummaryFile>] \
[-quiet] \
<inputFile> <outputFile> \
<trimmer1>...
or:
-version
Where I have added backslash-line-endings to make it a little easier to read.
Aha! We can run this dude in paired end mode (PE) or in single end mode (SE). Since we have paired-end data, we will run it in paired-end mode.
Full details of the syntax and the methodology can be found in the PDF Trimmomatic manual at http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/TrimmomaticManual_V0.32.pdf. This link seems to not work very well sometimes…
We will be running it on a single thread for now so the really relevant syntax is:
PE \
[-basein <inputBase> | <inputFile1> <inputFile2>] \
[-baseout <outputBase> | <outputFile1P> <outputFile1U> <outputFile2P> <outputFile2U>] \
<trimmer1>...
This syntax message signifies that you can either give it a base file prefix
for the input and output files, like -basein fastq/s001-1 -baseout results/trimmed/s001-1
,
or you can explicitly give it the names of the input and output files. I tend to think that
the latter is a better way to do it (always good to be explicit).
So, for example, with our tidy named files in the fastq directory, the first two lines there might look like:
trimmomatic PE \
fastq/s001---1_R1.fq.gz fastq/s001---1_R2.fq.gz \
results/trimmed/s001---1_R1.fq.gz results/trimmed/s001---1_R1_unpaired.fq.gz \
results/trimmed/s001---1_R2.fq.gz results/trimmed/s001---1_R2_unpaired.fq.gz
So, the two input files are read 1 and read2, and the four output files are in this order:
- Read 1 successfully trimmed.
- Read 1 that is no longer paired (because, for example, its mate was tossed out)
- Read 2 successfully trimmed
- Read 2 that is no longer paired.
The reads that you will use downstream, for mapping, will be the successfully trimmed ones.
Now, the more complex part of the command line syntax is the innocouous looking
<trimmer1>...
. That seemingly simple word belies the opportunity to specify all the
different types of read trimming modules and their parameters
(and the order in which they will be applied) that you want to use. Once again, the
PDF manual is the place to go, but a good place to start will be:
# get path to trimmomatic to find the path to the adapter files.
# It seems to look for them in a standard /usr/local/share location,
# which is not where they are if conda installed.
trim_path=$(which trimmomatic)
trim_dir=$(dirname $trim_path)
adapter_fasta=$trim_dir/../share/trimmomatic/adapters/TruSeq3-PE-2.fa
# make a results/trimmed directory to put the outputs into
mkdir -p results/trimmed
trimmomatic PE \
fastq/s001---1_R1.fq.gz fastq/s001---1_R2.fq.gz \
results/trimmed/s001---1_R1.fq.gz results/trimmed/s001---1_R1_unpaired.fq.gz \
results/trimmed/s001---1_R2.fq.gz results/trimmed/s001---1_R2_unpaired.fq.gz \
ILLUMINACLIP:${adapter_fasta}:2:30:10 \
LEADING:3 \
TRAILING:3 \
SLIDINGWINDOW:4:15 \
MINLEN:36
You can go ahead and evaluate those commands and see what happens when trimmomatic runs. And while it is running, we can try to unpack those last few lines:
ILLUMINACLIP:${adapter_fasta}:2:30:10
: This means remove Illumina adapters that match the sequences in the fasta fileTruSeq3-PE-2.fa
that is included with the Trimmomatic distribution. That file looks like this:
>PrefixPE/1
TACACTCTTTCCCTACACGACGCTCTTCCGATCT
>PrefixPE/2
GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT
>PE1
TACACTCTTTCCCTACACGACGCTCTTCCGATCT
>PE1_rc
AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA
>PE2
GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT
>PE2_rc
AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC
Those are the sequences (and the reverse complements) for the TruSeq version3 adapters. It is pretty fun to look around in your fastq files to find there sequences…(We could play around with that if you want). This particular fasta file is set up to allow Trimmomatic to find spurious adapter sequences when they arrived in your reads by a few different things (see the Palindrome clipping section in the manual).
LEADING:3
: Remove bases at the beginning of the read that have a Phred-scaled quality score less than 3.TRAILING:3
: Remove bases at the end of the read that have a Phred-scaled quality score less than 3.SLIDINGWINDOW:4:15
: Remove any bases within a sliding window of 4 bases that have an average Phred-scaled quality score of less than 15.MINLEN:36
: Toss out any reads that are only 36 bp long after the bad bases have been trimmed.
27.5.1 Cycling over all the files and trimming them
Ultimately, we are going to have to cycle over the fastq files in some way, so,
note that we could have written the above command line by defining a shell variable for the
file prefix. Lets clean this whole thing up with some shell variables, using
s002---1
as an example
# here is a variable to store all the trimmer directives:
TRIMMER="ILLUMINACLIP:${adapter_fasta}:2:30:10 \
LEADING:3 \
TRAILING:3 \
SLIDINGWINDOW:4:15 \
MINLEN:36"
# define a file prefix for s002---1
FP=s002---1
# now get the names of all the input files and output files
# and store them in a shell variable. Not the fun shell-fu here.
INFILES=$(echo fastq/${FP}_{R1,R2}.fq.gz)
OUTFILES=$(echo results/trimmed/${FP}_{R1,R2}{,_unpaired}.fq.gz)
# make a Log file too
LOG=results/logs/trimmomatic/$FP.log
# you might want to try "echo $INFILES" and "echo $OUTFILES"
# and "echo $LOG" to see what those look like.
# We need to make sure that we have directories to put all those
# output and log files into. We can do that with:
mkdir -p $(dirname $LOG $OUTFILES)
# now that we have defined those variables, we can write our
# command line like this:
trimmomatic PE $INFILES $OUTFILES $TRIMMER > $LOG 2>&1
Once we have figured out that whole collection of command lines,
and have verified that it works for
our “trial” value of FP
(i.e, s002---1
), it is time to cycle over all the
different possible values of FP
.
We can use globbing and the basename
utility to get all the file prefixes. Check this
out:
Evaluate that and then run echo $FILE_PREFIXES
to see what you have.
With that, we can then define a for loop to run over all 22 pairs of files:
# This stays outside of the for loop:
TRIMMER="ILLUMINACLIP:${adapter_fasta}:2:30:10 \
LEADING:3 \
TRAILING:3 \
SLIDINGWINDOW:4:15 \
MINLEN:36"
# get list of file prefixes
FILE_PREFIXES=$(basename -a -s _R1.fq.gz fastq/*R1.fq.gz)
for FP in $FILE_PREFIXES; do
INFILES=$(echo fastq/${FP}_{R1,R2}.fq.gz)
OUTFILES=$(echo results/trimmed/${FP}_{R1,R2}{,_unpaired}.fq.gz)
LOG=results/logs/trimmomatic/$FP.log
mkdir -p $(dirname $LOG $OUTFILES) # this could be done outside the for loop
trimmomatic PE $INFILES $OUTFILES $TRIMMER > $LOG 2>&1
done
You can run those, and they will likely take 15 to 20 minutues to get through. To finish them all, you would typically want to be in a tmux session or something similar so that the job is not killed when you logout.
Note that for most big and long jobs you wouldn’t run this in an interactive shell,
but, rather, you would run it through SLURM’s sbatch
. We will be covering that next
week when we start to talk about mapping. But for now, just want to do some
interactive computing to get a flavor of how one can write and test scripts
on an interactive shell.
27.5.2 Multiqc your trimmomatics results
It turns out that Multiqc, which so nicely gobbled up and summarized our fastqc results, also knows how to make nice summaries of the log (stderr/stdout) output of Trimmomatic. So, once our Trimmomatic loop is done, we can run Multiqc and direct it to both the directory holding the fastqc output and the directory holding the Trimmomatic output, so as to get an HTML report that includes both. Witness:
conda activate multiqc
multiqc -v -f -o results/qc/multiqc results/qc/fastqc results/logs/trimmomatic
Note that the above command includes the -f
option to force it to overwrite existing
output.
As before, you get the results in results/logs/trimmomatic/multiqc_report.html
, but you
have to get them on your laptop to view them easily.
27.6 Step 2. Mapping each unit to the genome
Mapping or aligning reads to a genome is one of the more computationally demanding steps in a typical WGS workflow. However, I find it somewhat more straightforward (less prone to errors and failures and bugs in the software) than some of the later computational steps (like creating a gVCF file from HaplotypeCaller).
Here is a figure showing where in the overall workflow this step fits in:There are several different software programs available for alignment. I am a fan
of the bwa mem
algorithm, and it is commonly used in many production pipelines,
so that is what we will be using.
The output of bwa mem
can often be manipulated as it comes streaming out of the program
onto stdout. The standard tool for manipulation such streams is samtools
. For this
reason, when we define a conda environment for mapping steps, we will typically
include both the bwa
program and the samtools
program in it. So, let’s make a
conda environment called bwasam
that has both tools in it:
Once we have that environment, we can activate it and then look at the
syntax for bwa mem
by typing bwa mem
:
The output of that is quite long. Thankfully, the default program values are often a good place to start, and typically work well. So, for our purposes, the most germane things to note are the main syntax statement:
This means you can give it options, and then you have to pass it the path to the genome (which is also the prefix used for the bwa indexes, as we shall see). And then you pass it a fastq file with read 1, and if you have paired end reads you pass it the corresponding read 2 fastq file.
The other option that we will use is described as:
-R STR read group header line such as '@RG\tID:foo\tSM:bar' [null]
This lets us add information to the output file about the origin of the collection of reads in the fastq files (like what sample and library prep they are from, what type of platform they were sequenced on, and the machine and lane).
To start out with, we simply want to make a SAM file that we can use
for further explorations of what a SAM/BAM file is, and what
it means for DNA to align to a reference genome. So, make sure that
you are in your non-model-wgs-example-data
directory, you have an
interactive compute shell, and that you have the bwasam
conda environment
activated. Here we go!
27.6.1 Indexing the reference genome
The reference genome we are working with is in resources/genome.fasta
.
In order to quickly and efficiently find places in the genome to which reads align,
it is necessary to index the reference genome. The bwa subcommand index
will
index the reference genome for bwa mem
to use.
To index the genome we do:
This takes about 40 seconds and creates the files:
resources/genome.fasta.amb resources/genome.fasta.bwt resources/genome.fasta.sa
resources/genome.fasta.ann resources/genome.fasta.pac
The particulars of these files are not too important. Just know that they
are required for bwa mem
to work. Note that the <idxbase>
in bwa mem
’s
syntax statement is just prefix for these 5 files: resources/genome.fasta
.
27.6.2 A quick mapping run to stdout
Now that the genome is indexed let’s go ahead and map the first pair of files that we trimmed:
results/trimmed/s001---1_R1.fq.gz results/trimmed/s001---1_R2.fq.gz
and we will just let the output go streaming to our screen:
bwa mem resources/genome.fasta \
results/trimmed/s001---1_R1.fq.gz \
results/trimmed/s001---1_R2.fq.gz
You should see a whole lot of lines starting with @ come zooming by first, then you will see some things that look like progress messages. Eventually, a bunch of sequences start zooming by. You can hit cntrl-C to stop that.
OK! That was fun. Now, we are actually going to redirect that standard output
into a SAM file, but before we do that, we are going to want to add the -R
option
to define the read group for the sequences in this file (we will talk a lot more
about read groups in a moment).
So,
mkdir -p results/sam # a directory in which to put the output from stdout
mkdir -p results/logs/bwa # a directory in which to put the output from stderr
bwa mem \
-R '@RG\tID:s001_T199967_Lib-1_HY75HDSX2_1_AAGACCGT+CAATCGAC\tSM:T199967\tPL:ILLUMINA\tLB:Lib-1\tPU:HY75HDSX2.1.AAGACCGT+CAATCGAC' \
resources/genome.fasta \
results/trimmed/s001---1_R1.fq.gz results/trimmed/s001---1_R2.fq.gz > \
results/sam/s001---1.sam 2> results/logs/bwa/s001---1.log
While that is running (and it may run for a pretty good long while) let’s just note that the read-group string in there is a horrifying
bunch of textual gobbledygook. The information that goes into it is in the
file units.tsv
. More on read groups later.
For now, let’s start exploring how sequences can align, and the meaning of all the lines in the SAM file. This is documented in Section 17.4.
27.6.3 Mapping all 22 pairs of trimmed FASTQ files using a SLURM job array
Using all the tricks we learned in Section 8.4.3 we are ready to map these
things. In case you missed it in class, before you proceed with this section, you will
have to undertake all the steps (to create numbered-units.tsv
and the line-assign.sh
script) in Section 8.4.3.5. Otherwise things won’t work!
We can build on our mapping script in 8.4.2.6 to have
it use line_assign.sh
to choose different files to align to the genome in a
SLURM job array, according to the SLURM_ARRAY_TASK_ID. Hooray!
Here is what such a script looks like:
#!/bin/bash
#SBATCH --array=1-22
#SBATCH --time=02:00:00
#SBATCH --output=results/slurm_logs/map/map_s002---1-output-%A-%a
#SBATCH --error=results/slurm_logs/map/map_s002---1-error-%A-%a
#SBATCH --mail-type=ALL
#SBATCH --mail-user=yourname@yourmail.edu
source ~/.bashrc
conda activate bwasam
eval $(line_assign.sh $SLURM_ARRAY_TASK_ID numbered-units.tsv)
mkdir -p results/bam # a directory in which to put the output from stdout
mkdir -p results/logs/bwa # a directory in which to put the output from stderr
mkdir -p results/logs/samtools-sort
SU=${sample}---${unit}
RGID=${sample}_${sample_id}_${library}_${flowcell}_${lane}_${barcode}
RG="@RG\tID:${RGID}\tSM:${sample}\tPL:${platform}\tLB:${library}\tPU:${flowcell}.${lane}.${barcode}"
PREFIX=results/trimmed/$SU
OUT=results/bam/$SU.bam
BWALOG=results/logs/bwa/$SU.log
SAMSORTLOG=results/logs/samtools-sort/$SU.log
bwa mem \
-R $RG \
resources/genome.fasta \
${PREFIX}_R1.fq.gz ${PREFIX}_R2.fq.gz 2> $BWALOG | \
samtools view -u - | \
samtools sort -l 9 -o $OUT - 2> $SAMSORTLOG
As you can see, we were able to use most of our original test script. We just
need to define SU
and RG
a little differently, according to the
value of SLURM_ARRAY_TASK_ID
, and we added an #SBATCH --array=1-22
line. (Because there are 22 rows in our numbered-units.tsv
file!).
Also we named the slurm output and error files according to the %A_%a
nomenclature.
27.6.3.1 Step through this interactively to make sure it looks right
Before trying to run something like this, it is always good to “pretend”
it is being run as a SLURM array job by spoofing the variable
SLURM_ARRAY_TASK_ID
. In other words, on an interactive shell
you can just set a shell variable called SLURM_ARRAY_TASK_ID
to whatever value you want to (let’s choose 5 here!) like this:
then you can go through the above script line by line and make sure
that all the variables get set to appropriate values. After evaluating
each line in which a variable value gets set, check the value of that
variable using echo
.
If everything looks right, you could even evaluate all the 6 lines of the main command and let it run for a little while to make sure that what is coming into the output files and logs looks correct. But, you don’t have to do that now.
27.6.3.2 Put that script into map-array-job.sh
and run it with --test-only
Make a new script with nano by doing:
paste the contents of the above script into it, change the email address for notification emails, and the save and exit.
Then run it under sbatch
but use the --test-only
option:
That works, but note that --test-only
doesn’t see to let you know that this
is an array job. Whatever!
27.6.3.3 Run just the first job
You should never launch a large job array without at least launching a few of them
first, to make sure that it doesn’t fail immediately. We can use --array=1-3
on the command
launch just the first three jobs.
Note also that you always want to make sure that you have created any directories necessary for storing the slurm_log output.
Here we go!
# first, make sure the directories are there
mkdir -p results/slurm_logs/map/
# then launch the first three jobs
sbatch --array=1-3 scripts/map-array-job.sh
Then use your myjobs
alias to see when that starts. Once it has started, check the
file results/logs/bwa/s001---1.log
to make sure that stuff is coming out that looks
like what you expect.
27.7 Step 3. Mark duplicates in BAM files from each individual, and run samtools stats
on the results
I used to identify (and either remove or mark) duplicates with some of the subcommands
of the samtools
package. But it turns out that MarkDuplicates from
the Picard Tools package, from the Broad Institute, has a few
advantages. Chief among those is the fact that, with MarkDuplicates, it is not
necessary to merge all of the BAM files from a single individual and
library prep to mark duplicates. Although the documentation about this
is surprisingly spotty, Picard’s MarkDuplicates is
sufficiently aware of the read groups attached to different reads that it can
automatically only consider reads from the same library and individual
as PCR duplicates (so long as the read group of each group of reads has
been properly set, with the LB
read-group key set, as we did during
the mapping step, above).
In order to use MarkDuplicates, we first need to get the
Picard Tools package. So, our first task is going to be to get those
with mamba
. We will want version 2.26 (which is pretty
much the latest minor release as of this writing), so we do:
Once that is installed we can activate the environment and check out the syntax for the MarkDuplicates tool.
The help file notes that duplicates are identified as reads in which each mate maps to the same location in the genome (i.e. they have the same 5’ position) and which have sequences that are similar. It also tells us that the read that is “retained” as the “non-duplicate” is chosen to have the highest base quality scores, on average.
We also see that we can provide multiple
BAM files, simultaneously, to MarkDuplicates, by giving the
-I
or --INPUT
command multiple times, following it, each time,
with the path to a different BAM file.
So, we need an easy way to write a command line that includes a
-I
option for each of the BAM files
that are associated with any single sample. We will do that
by making a TAB separated file from which we can pull values with
our line_assign.sh
script. This file will have three columns: index
will give
the row number in the file, sample
will give the name of the sample
(like s001
, s002
, etc.), and
bamopts
will hold a space-separated string of the BAM files that
hold reads from the sample, each preceded by the -I
option that
MarkDuplicates requires before each BAM path.
We can create such a file by processing our units.tsv
file with
the awk
utility.
You should go back and review the section on awk
’s use of associative (or “hash”)
arrays to better understand how the following combination of an awk script and the unix
sort
utility gives us a numbered TSV file that provides a list of bam file paths
that contain reads from a given sample. The comments
in the following script also give some detail.
awk -F"\t" '
BEGIN {
OFS="\t"; # separate output fields with TABs
}
NR==1 {next;} # ignore the first line
{ # in this block, store a space-separated string of "-I filpath.bam" for all the
# files that have reads from a give sample. The variable bams is an
# associative or "hash" array (like a "dictionary" in Python) that holds those
# strings.
bams[$1] = sprintf("%s -I results/bam/%s---%s.bam", bams[$1], $1, $2);
}
END { # at the end, print out the keys of bams and the paths associated with each
# in two columns
for(i in bams) print i, bams[i]
}
' units.tsv | \
sort | \
awk -F "\t" ' # print these lines out in a numbered TSV file.
BEGIN {printf("index\tsample\tbamopts\n");} # print a header line
{printf("%d\t%s\t%s\n", ++n, $1, $2, $3)} # print out each numbered line
' > numbered-sample-bams.tsv
The file, numbered-sample-bams.tsv
created by this short script looks like this:
index sample bamopts
1 s001 -I results/bam/s001---1.bam
2 s002 -I results/bam/s002---1.bam -I results/bam/s002---2.bam
3 s003 -I results/bam/s003---1.bam -I results/bam/s003---2.bam -I results/bam/s003---3.bam
4 s004 -I results/bam/s004---1.bam -I results/bam/s004---2.bam
5 s005 -I results/bam/s005---1.bam -I results/bam/s005---2.bam -I results/bam/s005---3.bam
6 s006 -I results/bam/s006---1.bam -I results/bam/s006---2.bam -I results/bam/s006---3.bam -I results/bam/s006---4.bam
7 s007 -I results/bam/s007---1.bam -I results/bam/s007---2.bam -I results/bam/s007---3.bam -I results/bam/s007---4.bam
8 s008 -I results/bam/s008---1.bam -I results/bam/s008---2.bam -I results/bam/s008---3.bam
This shows that there are 8 distinct samples, and each one of those samples is associated with between 1 and 4 BAM files of mapped reads.
Since these files are not large and running MarkDuplicates does
not take as much time as mapping reads, we will simply cycle over these different
8 samples using the line_assign.sh
script in an interactive shell
on a compute node. MarkDuplicates writes results out to a single BAM
file, and also writes out a “metrics” file that gives information on
how many duplicates were found. It also allows an optional --TAGGING_POLICY
option to instruct the program to make a note of whether each duplicate
is a PCR duplicate or an optical duplicate. We will invoke that
option.
Additionally, MarkDuplicates can create an index for the BAM file it produces.
This index is much like the one created by the samtools index
subcommand; however,
rather than creating an index file named file-prefix.bam.bai
, it creates
file-prefix.bai
. So, we will also instruct MarkDuplicates to create a directory
for the BAM file, as that will save us a step later on!
conda activate picard
mkdir -p results/mkdup-merged results/qc/mkdup-merged results/logs/mkdup-merged
for LINE in {1..8}; do
eval $(line_assign.sh $LINE numbered-sample-bams.tsv)
echo "Marking duplicates for sample $sample"
OUT=results/mkdup-merged/$sample.bam
MET=results/qc/mkdup-merged/$sample.metrics
LOG=results/logs/mkdup-merged/$sample.log
picard MarkDuplicates $bamopts \
--OUTPUT $OUT \
--METRICS_FILE $MET \
--TAGGING_POLICY All \
--CREATE_INDEX > $LOG 2>&1
done
That should only take a few minutes.
Once that is done we will look at the resulting bamfiles with samtools
,
and compare them to the BAM files before duplicates were marked, just to see
that some things have been marked as duplicates. Recall from
MarkDuplicates’ help information that duplicates have the 0x400 bit set
in their SAM flags. So, we can look at the sequences marked as duplicates
with samtools view
telling it, with the -f 0x400
option to only print
sequences having that 0x400 bit set.
For example:
Note that the final column on each row has some information tagged with DT
.
This let’s us know whether the duplicate was a PCR duplicate (formed during
library prep) or an optical duplicate (a cluster on the sequencing machine
that was incorrectly split into two spots).
We can use awk to count up the number of different DT
tags in the
last column of each row having a duplicated sequence in it, like this:
# here we can count how many of each tag we see
samtools view -f 0x400 results/mkdup-merged/s001.bam | \
awk '{n[$NF]++} END {for(i in n) print i, n[i]}'
The DT
tag either has an LB
if it is a PCR-duplicate (it is a consequence
of the library, hence the LB
) or it has an SQ
if it is an optical
duplicate (it is a consequence of the sequencing platform, hence SQ
).
If you look at any of the original BAM files (before marking duplicates) you will find that none of them have any reads marked as duplicates (which is not surpring—that is why we ran them through MarkDuplicates.)
When reads are marked as duplicates, they will be automatically ignored during variant calling (which we start in the next step!) by GATK’s variant-calling programs. They can also be ignored in ANGSD and most other variant calling programs.
27.7.1 Run samtools stats on the results
Each sample is now in a single BAM file. For quality checking these, it can be
a good idea to run samtools stats
on them. That utility returns a fairly
comprehensive summary of any BAM file.
This doesn’t take long to run, and we can easily do it with a for loop over filenames:
# make a directory for the results
mkdir -p results/qc/samtools-stats
conda activate bwasam # to have samtools
for FILE in results/mkdup-merged/*.bam; do
j=$(basename $FILE);
echo $j;
samtools stats $FILE > results/qc/samtools-stats/${j}_stats
done
When that is done, it is worth looking through all the different summaries. For example:
They give the distribution of a lot of different features of the mapped reads.
MultiQC has some capability to plot some of the features in the samtools stats
summaries. You can see those by doing.
and then copying the multiqc_report.html to your laptop and opening it in a browser. With as little data as we have in these files, it is not super interesting to look at them.
MultiQC (at least a suitable new version of it) can also process the “metrics” files from MarkDuplicates. So you can do:
For that to work, I had to get a newer version of MultiQC than I had, which meant I needed to do this first.
conda env remove --name multiqc
conda config --add channels defaults
conda config --add channels conda-forge
conda config --add channels bioconda
mamba create -n multiqc -c bioconda multiqc=1.12
Now we have 8 BAM files. Each one has all the reads from a single individual. Duplicates in them have been marked, and they each have an index. Therefore, it is time to proceed to the variant calling steps.
27.8 Step 4. Creating a gVCF file for each sample
Yay! We are finally going to start turning our alignments into variants. But don’t get too excited just yet, because the GATK gVCF workflow involves three separate steps to take us from our duplicate-marked BAM files to a VCF file with variants. Here we are doing the first step: creating one gVCF file for each of our samples.
Here is our DAG showing what steps we are talking about in this section:This is the first step where we are starting to use the tools of the GATK package, so we will have to create a conda environment with GATK. That will be our first step. It is important at this time to get GATK version 4.2.5.0, which is the latest no-architecture version available on conda as of 3/29/22. The previous version, 4.2.4.1 had some bugs that caused problems when genotyping gVCFs from the genomics data bases for some of my non-human data (details here). There might be some residual problems left in 4.2.5.0 (see here), but it should work for our small example data set, today.
By now we know the drill:
# on scompile
mamba create -n gatk4.2.5.0 -c bioconda gatk4=4.2.5.0
# when that is done, activate it
conda activate gatk4.2.5.0
Since this is our first time with GATK, it is worth getting a glimpse of all the tools available from the GATK package:
Holy Moly! That is a lot of different programs!
To get the syntax and documentation for any particular tool you can use
gatk ToolName --help
, so, for our case:
Whoa! That spits out a lot of options. At this juncture, it is worth noting that the GATK tools have comprehensive documentation available on the web, too. Sometimes it is easier to read the docs there than on your Unix terminal. A Google search of GATK toolname, like “GATK HaplotypeCaller” will typically return the documentation as the first hit. Here is a link to the HaplotypeCaller documentation.
The documentation on the web also provides a nice example for how HaplotypeCaller should be called for this first step in the gVCF workflow. It looks like this:
Single-sample GVCF calling (outputs intermediate GVCF)
gatk --java-options "-Xmx4g" HaplotypeCaller \
-R Homo_sapiens_assembly38.fasta \
-I input.bam \
-O output.g.vcf.gz \
-ERC GVCF
So, we need as input:
- The reference genome the reads were mapped to
- The input BAM which must be sorted and preferably has been duplicate marked. It
also needs to be indexed, i.e. there must be a
.bai
file associated with each. - The name of the output file
- The option
-ERC GVCF
. This tells HaplotypeCaller to make a gVCF, rather than to attempt to got “direct-to-VCF.”
27.8.1 That’s a dict thing to do!
What isn’t mentioned in the documentation is that HaplotypeCaller also requires that
the reference genome be indexed in two different ways. It must have a .fai
file (for
example created using samtools faidx
) and it must have a .dict
or “dictionary” file
for the genome. This can also be create with samtools using samtools dict
. These
steps are visible in our workflow DAG. Let’s make sure that we both of those
necessary indexes by doing this:
27.8.2 Setting up a SLURM job array for this
We are going to run these jobs using a SLURM job array.
Each task will be given one CPU. Because of this, we also
want to tell HaplotypeCaller to only use a single thread for
its pair hidden Markov model (by default it uses 4 threads, which is not
helpful if we have allocated only a single core for the job). So, the
command line will include the line --native-pair-hmm-threads 1
.
Additionally, since all the GATK tools are implemented in Java, they run
under the Java virtual machine on our computers. We can specify how much
memory (RAM) is allocated to the Java virtual machine in order to host GATK.
We have about 4848 Mb of RAM per core on SUMMIT, so we will give 4000 Mb, or 4Gb of
RAM to the Java virtual machine. This should be more than enough. But it is worth
noting that sometimes you can run out of memory, in which case you would
have to re-run the job and give it more memory. To provide a certain amount of
memory to the Java virtual machine, or, in fact, to supply any particular
options to Java, we will use the --java-opts
option to gatk
. This goes
before the tool name, like this: gatk --java-opts "quoted string of options to pass to Java" ToolName
. In our case, that quoted string of options to pass to Java
will look like: "-Xmx4g"
which basically says “give me 4 Gb of memory.”
We have 8 jobs to run (one for each of our samples). We first make a simple
tab-separated file called numbered-samples.tsv
that we can read with our
line-assign.sh
script. Here is some Unix fun to make such a file from
our units.tsv
file:
cut -f 1 units.tsv | uniq | \
awk '
NR==1 {printf("index\tsample\n"); next}
{printf("%d\t%s\n", ++n, $1);}
' > numbered-samples.tsv
Then we need to have a script that we can pass to sbatch
. The following
should be copied into scripts/gvcf-array-job.sh
.
#!/bin/bash
#SBATCH --array=1-8
#SBATCH --time=02:00:00
#SBATCH --output=results/slurm_logs/gvcf/make-gvcf-output-%A-%a
#SBATCH --error=results/slurm_logs/gvcf/make-gvcf-error-%A-%a
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=yourname@yourmail.edu
source ~/.bashrc
conda activate gatk4.2.5.0
# this sets the shell variable "sample" to s001, s002, etc.
eval $(line_assign.sh $SLURM_ARRAY_TASK_ID numbered-samples.tsv)
outdir=results/gvcf # a directory in which to put the output gVFC file
logdir=results/logs/haplotype-caller # a directory in which to put the output from stderr
# make sure those directories exist
mkdir -p $outdir
mkdir -p $logdir
gatk --java-options "-Xmx4g" HaplotypeCaller \
-R resources/genome.fasta \
-I results/mkdup-merged/$sample.bam \
-O $outdir/$sample.g.vcf.gz \
--native-pair-hmm-threads 1 \
-ERC GVCF > $logdir/$sample.stdout 2> $logdir/$sample.stderr
27.8.3 Running interactively to watch the output
If you haven’t run any GATK tools before, it is worth doing a short run to watch the progress meter log go by (this is what gets written to stderr when you run GATK tools).
So, get an interactive shell, then spoof the SLURM_ARRAY_TASK_ID (i.e.
just make a variable named SLURM_ARRAY_TASK_ID
and set it to 1), and
run one iteration as follows:
conda activate gatk4.2.5.0
SLURM_ARRAY_TASK_ID=1
# this sets the shell variable "sample" to s001, s002, etc.
eval $(line_assign.sh $SLURM_ARRAY_TASK_ID numbered-samples.tsv)
outdir=results/gvcf # a directory in which to put the output gVFC file
logdir=results/logs/haplotype-caller # a directory in which to put the output from stderr
# make sure those directories exist
mkdir -p $outdir
mkdir -p $logdir
gatk --java-options "-Xmx4g" HaplotypeCaller \
-R resources/genome.fasta \
-I results/mkdup-merged/$sample.bam \
-O $outdir/$sample.g.vcf.gz \
--native-pair-hmm-threads 1 \
-ERC GVCF
Note that we are not redirecting stdout or stderr to any files, so it just come to our screen.
The “ProgressMeter” that comes flying out of GATK tools can be quite helpful for estimating how long a particular run on a particular data set might require.
When you have seen enough, kill that job with cntrl-c
and then we will launch
our jobs for real.
27.8.4 Launch the first 3 jobs of our array
As always, it is wise to launch just the first few jobs of an array to make sure things are set up correctly.
Before we start this, we must create the directories that we are redirecting the SLURM logs to:
Then, start the first three:
Once you are convinced that those are doing just fine, you can start up the remaining 5:
While those are running would be a good time to go back and review what a gVCF file is all about (Section 20.2.2).
27.9 Step 5. Importing the gVCF files into a GenomicsDatabase
Section 20.2.3 provides some background on the genomics data base
used by GATK, and it also outlined all of the options to the GATK program
GenomicsDBImport
. Here we will launch a series of jobs in a SLURM
array, each one requesting 2 CPUs (so we can use 2 reader threads, effectively)
to import all of our gVCF files made in the last step into a genomics
data base.
As mentioned in Section 20.2.3, these jobs parallelize over
chromosomes; however, for non-model organisms with more fragmented reference
genomes, for the unplaced scaffolds we will be parallelizing over the
scaffold groups. These are just groups of short scaffolds that have been
lumped together so that they can be merged together and treated by
GenomicsDBImport
as a single “pseudo-chromosome.”
The length of all the chromosomes and the unplaced scaffolds
in the reference genome are easily found in the .fai
file created
by samtools faidx
. We have one of those in resources/genome.fasta.fai
.
It is fairly straightforward to operate on this file with R to
create a file of chromsomes and a file of scaffold groups. In fact,
a simple R script that does that can be viewed here.
Rather than go through that process, we are just going to download the resulting
files chromosomes.tsv
and scaffold_groups.tsv
, in order to use them. To
download these to our example workflow directory, you can give these two simple
commands (from that example workflow directory!):
wget https://raw.githubusercontent.com/eriqande/mega-non-model-wgs-snakeflow/main/.test/config/chromosomes.tsv
wget https://raw.githubusercontent.com/eriqande/mega-non-model-wgs-snakeflow/main/.test/config/scaffold_groups.tsv
Look at those files to see how they are structured.
The documentation for GenomicsDBImport can be gotten using gatk GenomicsDBImport --help
, or
it can be found on the web.
27.9.1 Job array script for the chromosomes
First, we need to turn chromosomes.tsv
into something that we can use with line_assign.sh
.
So let’s do that:
awk 'NR == 1 {printf("index\tchrom\n"); next} {printf("%d\t%s\n", ++n, $1);}' chromosomes.tsv > numbered-chromosomes.tsv
That gives us numbered-chromosomes.tsv
, that looks like this:
index chrom
1 CM031199.1
2 CM031200.1
3 CM031201.1
4 CM031202.1
First we prepare a job array script for the chromosomes that uses numbered-chromosomes.tsv
.
We can just copy the one for making gVCFs, and then modify it as appropriate. After reading
and understanding the following chunk of code, use a text editor like nano
to paste it into
the file scripts/genomics-db-array.sh
.
#!/bin/bash
#SBATCH --array=1-4
#SBATCH --time=02:00:00
#SBATCH --cpus-per-task=2
#SBATCH --output=results/slurm_logs/gdb_import/output-%A-%a
#SBATCH --error=results/slurm_logs/gdb_import/error-%A-%a
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=yourname@yourmail.edu
source ~/.bashrc
conda activate gatk4.2.5.0
# this sets the shell variable "chrom" to the chromosome names etc.
eval $(line_assign.sh $SLURM_ARRAY_TASK_ID numbered-chromosomes.tsv)
outdir=results/genomics_db/$chrom # a directory in which the genomics DB will go
logdir=results/logs/genomics_db_import # a directory in which to put the output from stderr and stdout
# make sure those directories exist
mkdir -p $(dirname $outdir) # make the directory for this output directory to go into
mkdir -p $logdir
# here are the standard options that you will typically use
standard_opts=" --batch-size 50 --reader-threads 2 --genomicsdb-shared-posixfs-optimizations "
gatk --java-options "-Xmx4g" GenomicsDBImport \
$standard_opts \
$(awk 'NR>1 {printf(" -V results/gvcf/%s.g.vcf.gz ", $2);}' numbered-samples.tsv) \
--intervals $chrom \
--genomicsdb-workspace-path $outdir > $logdir/stdout_$chrom.txt 2> $logdir/stderr_$chrom.txt
Note that call to awk. That puts all the samples on the command line with a -V before each one.
Now! Before we launch that, we must create the directories for the slurm_logs. Otherwise, SLURM will kill the jobs immediately upon submitting them without any sort of reasonable error message indicating why it failed.
mkdir -p results/slurm_logs/gdb_import
# normally we would test a few, but since there are only
# four jobs in this array, let's launch them all...
sbatch scripts/genomics-db-array.sh
At this juncture, it is worth mentioning that the output of GenomicsDBImport goes
into a directory (named $outdir
in the above script). If that directory already
exists, and you call GenomicsDBImport with the --genomicsdb-workspace-path
option,
then you will get an error. The tool is careful not to overwrite an
existing data base. So, if you tried a run that failed after creating the output
directory, you will need to remove that before you try to relaunch the job.
27.9.2 Job array script for the scaffold groups
Recall that GenomicsDBImport can use the --merge-contigs-into-num-partitions 1
“trick” to
put many short scaffolds into a single genomics data base. We still have to do that.
Making a script for that involves two steps that were different than what we did with the
chromosomes, above:
- We need to specify all the scaffolds in each scaffold group
- We need to use the
--merge-contigs-into-num-partitions 1
option
Note that we have a lot of scaffolds in some scaffold groups, so that if we
tried to specify them all on the command line, we would end up with a very long
command line. Although Unix, today, is reasonably tolerant of long and convoluted
command lines, some flavors of Unix will hiccup or barf on very long command lines.
Accordingly, we will use a GATK feature that lets us specify genomic regions in a file—so
long as it is named with the extension .list
—rather than on the command line for this job.
First, we need to make a file numbered-scaff-groups.tsv
that will translate
a SLURM_ARRAY_TASK_ID
into the name of a scaffold group. We can do that with a little
Unix and awk:
awk 'NR>1 {print $1}' scaffold_groups.tsv | uniq | awk 'BEGIN {printf("index\tsg\n")} {printf("%d\t%s\n", ++n, $1)}' > numbered-scaff-groups.tsv
If you look at the file created there, you will see that it looks like this:
index sg
1 scaff_group_001
2 scaff_group_002
so, we can use it to turn the SLURM_ARRAY_TASK_ID
into a shell variable
sg
that holds the name of the scaffold group we are dealing with.
The following code block contains a shell script that you should
paste into a new file called scripts/genomics-db-scaffold-array.sh
:
#!/bin/bash
#SBATCH --array=1-2
#SBATCH --time=02:00:00
#SBATCH --cpus-per-task=2
#SBATCH --output=results/slurm_logs/gdb_import_scaff_groups/output-%A-%a
#SBATCH --error=results/slurm_logs/gdb_import_scaff_groups/error-%A-%a
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=yourname@yourmail.edu
source ~/.bashrc
conda activate gatk4.2.5.0
# this sets the shell variable "sg" to the scaffold group name
eval $(line_assign.sh $SLURM_ARRAY_TASK_ID numbered-scaff-groups.tsv)
outdir=results/genomics_db/$sg # a directory in which the genomics DB will go
logdir=results/logs/genomics_db_import # a directory in which to put the output from stderr and stdout
# make sure those directories exist
mkdir -p $(dirname $outdir) # make the directory for this output directory to go into
mkdir -p $logdir
# here are the standard options that you will typically use
standard_opts=" --batch-size 50 --reader-threads 2 --genomicsdb-shared-posixfs-optimizations "
# now, some new stuff for the scaffold group lists. We are going to make a file
# called results/sg_lists/$sg.list that contains the names of all the scaffolds in this
# scaffold group $sg.
mkdir -p results/sg_lists # make sure that this directory exists
sglist=results/sg_lists/$sg.list # file name for the scaffold names
awk -v sg=$sg 'NR >1 && $1==sg {print $2}' scaffold_groups.tsv > $sglist
# now, we call this with the name of the file having
gatk --java-options "-Xmx4g" GenomicsDBImport \
$standard_opts \
$(awk 'NR>1 {printf(" -V results/gvcf/%s.g.vcf.gz ", $2);}' numbered-samples.tsv) \
--intervals $sglist \
--merge-contigs-into-num-partitions 1 \
--genomicsdb-workspace-path $outdir > $logdir/stdout_$sg.txt 2> $logdir/stderr_$sg.txt
Then, since there are only two scaffold groups, we can run that as an array from 1 to 2. Of course, as always, be sure to create the slurm log output directories before you launch the job:
27.9.3 Hey! Let’s try updating the genomic data base
Recall from Section 20.2.4, that it is possible to add new individuals to a genomics data base. It is worth giving that a whirl. What we shall do here is create the data base with the first six samples, and then update it by adding the last two sample to the existing data bases.
Before we do this, we might note that it is a bit of a hassle that
we are starting to see such a proliferation of different “array” scripts
in this section. We already have one script for chromosomes and one script
for scaffold groups. Geez! If we now also have to make a separate script for
updating chromosome and scaffold group data bases, that is going to be four
separate scripts. Not only that, but the names of the files of numbered chromsomes and
numbered scaffold groups that they use are hard-wired into the scripts, and if
we are going to be doing some building of data bases and then, later, updating them, we
might want to be able to tell the script which numbered-XX
file to use.
So, here we will write a single, general script that can deal with these different use cases by taking options and positional parameters. It will make the script a little longer, but it will be nice to have it all in one place. This is a general theme in computer programming: if you can reduce the number of places (i.e., files or scripts) the same block of code appears (by using variables and options) then it is typically a lot easier to maintain the code base.
So, here is our big general script. You should paste this into:
scripts/general-genomics-db-array.sh
#!/bin/bash
#SBATCH --array=1-2
#SBATCH --time=02:00:00
#SBATCH --cpus-per-task=2
#SBATCH --output=results/slurm_logs/gdb_import_general/output-%A-%a
#SBATCH --error=results/slurm_logs/gdb_import_general/error-%A-%a
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=yourname@yourmail.edu
### Define default values for variables here. By default we are going to
### Be doing "by chromosome", but the options we give can update these
OUTPARENT=results/genomics_db
DOING_SCAFF=0 # default is to do chromosomes
UPDATING=0 # default is to create a new data base
EXTRA=" "
logdir=results/logs/genomics_db_import # a directory in which to put the output from stderr and stdout
### Here is a lot of stuff for usage "help" messages and options processing
function usage {
echo Syntax:
echo " $(basename $0) [-h -u -s ScaffGroupFile -o OutputDirectory -l LogDirectory] NumberedSeqsFile NumberedSamplesFile "
echo
echo " -h: print help and exit."
echo " -l: put the resulting logs (from stderr and stdout) in directory LogDirectory."
echo " Default is results/logs/genomics_db_import"
echo " -o: put the genomicsDB within OutputDirectory (and make that dir if needed)."
echo " Default is results/genomics_db"
echo " -s: process scaffold groups given in ScaffGroupFile."
echo " -u: update the genomics data base instead of creating it."
echo
echo " ScaffGroupFile is like scaffold_groups.tsv"
echo " NumberedSeqsFile is either like numbered-chromosomes.tsv or numbered-scaff-groups.tsv"
echo " NumberedSamplesFile is like numbered-samples.tsv"
echo
}
if [ $# -eq 0 ]; then
usage;
exit 1;
fi;
while getopts "hl:o:s:u" opt; do
case $opt in
h )
usage
exit 1
;;
l )
logdir=$OPTARG;
;;
o )
OUTPARENT=$OPTARG;
;;
s )
SCAFF_GROUP_FILE=$OPTARG;
DOING_SCAFF=1
;;
u )
UPDATING=1;
;;
esac
done
shift $((OPTIND-1));
# uncomment to test for right number of required args
if [ $# -ne 2 ]; then
usage;
exit 1;
fi
NUMBERED_FILE=$1
NUMBERED_SAMPLES_FILE=$2
#### Done processing options and getting down to work
source ~/.bashrc
conda activate gatk4.2.5.0
# make sure the directories exist
mkdir -p $OUTPARENT # make the directory for this output directory to go into
mkdir -p $logdir
# this sets the shell variable "sg" to the scaffold group name
# or the variable "chrom" to the chromosome name
eval $(line_assign.sh $SLURM_ARRAY_TASK_ID $NUMBERED_FILE)
### Now we process things differently if we are doing scaff_groups or chromosomes
if [ $DOING_SCAFF = 0 ]; then
outdir=$OUTPARENT/$chrom
INTERVALS=" --intervals $chrom "
STDOUT=$logdir/stdout_$chrom.txt
STDERR=$logdir/stderr_$chrom.txt
else
outdir=$OUTPARENT/$sg
mkdir -p results/sg_lists # make sure that this directory exists
sglist=results/sg_lists/$sg.list # file name for the scaffold names
awk -v sg=$sg 'NR >1 && $1==sg {print $2}' $SCAFF_GROUP_FILE > $sglist
INTERVALS=" --intervals $sglist "
EXTRA=" --merge-contigs-into-num-partitions 1 "
STDOUT=$logdir/stdout_$sg.txt
STDERR=$logdir/stderr_$sg.txt
fi
### Now, change options if updating vs creating a data base
if [ $UPDATING = 0 ]; then
DB_OPT=" --genomicsdb-workspace-path "
else
DB_OPT=" --genomicsdb-update-workspace-path "
INTERVALS=""
fi
# here are the standard options that you will typically use
standard_opts=" --batch-size 50 --reader-threads 2 --genomicsdb-shared-posixfs-optimizations "
# here is where we make the call
# now, we call this with the name of the file having
gatk --java-options "-Xmx4g" GenomicsDBImport \
$standard_opts \
$(awk 'NR>1 {printf(" -V results/gvcf/%s.g.vcf.gz ", $2);}' $NUMBERED_SAMPLES_FILE) $INTERVALS $EXTRA \
$DB_OPT $outdir > $STDOUT 2> $STDERR
Once we have that script, we are going to use it to create genomics data bases from the the first six samples for chromosomes and scaffold groups, and then we are going to update each of those data bases by adding the next two individuals into the data bases.
We start by making some numbered-samples type of files.
awk 'BEGIN {OFS="\t"} NR==1 {print; next} $1 <= 6 {print}' numbered-samples.tsv > numbered-first-six-samples.tsv
awk 'BEGIN {OFS="\t"} NR==1 {print; next} $1 > 6 {printf("%d\t%s\n", ++n, $2);}' numbered-samples.tsv > numbered-last-two-samples.tsv
Then we start our jobs to create the data bases for the four chromosomes and for the two scaffold groups
with the first six samples. We will put these into the directory results/stepwise_genomics_db
# make sure the slurm logs directory exists!!
mkdir -p results/slurm_logs/gdb_import_general
# create 4 chromosome databases with the first six samples
sbatch --array=1-4 scripts/general-genomics-db-array.sh \
-l results/logs/stepwise_gdb_import/first-six \
-o results/stepwise_genomics_db \
numbered-chromosomes.tsv \
numbered-first-six-samples.tsv
# create 2 scaffold-group databases with the first six samples
sbatch --array=1-2 scripts/general-genomics-db-array.sh \
-l results/logs/stepwise_gdb_import/first-six \
-o results/stepwise_genomics_db \
-s scaffold_groups.tsv \
numbered-scaff-groups.tsv \
numbered-first-six-samples.tsv
We have to wait until each of those runs has completed. Once they do, check that all the genomics data bases have been created, by, for example:
# making sure that all the log files are there:
ls results/logs/stepwise_gdb_import/first-six/
# looking at the last ten lines of each of the log files
tail results/logs/stepwise_gdb_import/first-six/stderr*
# listing the directories of the data bases:
ls results/stepwise_genomics_db
# listing all the contents of those data bases
ls -R results/stepwise_genomics_db
If that all looks good, then it is time to update each of those data bases
by adding the last two individuals to them. To do so, we give our
script the -u
option, give it the path to the existing data base directory
with the -o
option. And we also change the log file directory (with the -l
option) so that the new logs go to a new place.
# update 4 chromosome databases by adding the last two samples to them
sbatch --array=1-4 scripts/general-genomics-db-array.sh \
-l results/logs/stepwise_gdb_import/last-two \
-o results/stepwise_genomics_db \
-u \
numbered-chromosomes.tsv numbered-last-two-samples.tsv
# update 2 scaffold-group databases by adding the last two samples to them
sbatch --array=1-2 scripts/general-genomics-db-array.sh \
-l results/logs/stepwise_gdb_import/last-two \
-o results/stepwise_genomics_db \
-s scaffold_groups.tsv \
-u \
numbered-scaff-groups.tsv numbered-last-two-samples.tsv
When those are done, you can check the log files to make sure that they look like they ran all right:
If they say something like:
08:31:11.156 INFO GenomicsDBImport - Importing batch 1 with 2 samples
08:31:19.147 INFO GenomicsDBImport - Done importing batch 1/1
08:31:19.155 INFO GenomicsDBImport - Import of all batches to GenomicsDB completed!
08:31:19.157 INFO GenomicsDBImport - Shutting down engine
[April 7, 2022 at 8:31:19 AM MDT] org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport done. Elapsed time: 0.15 minutes.
Runtime.totalMemory()=385875968
then it has all worked according to plan!
27.10 Step 6. Calling variants and creating VCF files from the GenomicsDatabase
The final step to getting a VCF file with variants for all your samples is to run GenotypeGVCFs. This GATK tool can take a variety of things as input, but, for our purposes, the most useful is that it can take a genomics data base as input.
Recall that we now have six genomics data bases: one for each of the four chromosomes in our reference genome file, and two for each of the scaffold groups. Each of these scaffold groups includes a large number of unplaced scaffolds that exist in our reference genome.
Each of those data bases knows exactly which chromosomes or scaffolds are included in it—and also it has information about all the samples. In other words, it has all the information it needs to make a VCF file for the chromosome or scaffold group that it encompasses. Accordingly, it is pretty easy to run the GenotypeGVCFs tool on a genomics data base.
This step corresponds to the colored node in our now-familiar workflow DAG:The webpage describing how this tools works can be found here.
The basic command, for the chromosome CM031202.1
(the shortest one!) in our example workflow looks like:
gatk --java-options "-Xmx4g" GenotypeGVCFs \
-R resources/genome.fasta \
-V gendb://results/genomics_db/CM031202.1 \
-O results/vcf_parts/CM031202.1.vcf
There are some other options, but none that we will explore today. The main
thing to note in the command line is that the path to the genomics data base
is preceded by gendb://
which is the way you would specify the location of a
data base as a URL.
Run the above command on a compute shell to see what happens. It is set up
to use the genomics data bases that we created in results/genomics_db
.
So, all we need to do now is make some simple array scripts. We could make a general purpose one that can be used with both the scaffold groups or the chromosomes, or just write two separate scripts. We will do the latter here.
27.10.1 Array script for the four chromosomes
The contents of our first script is below and should be copied into
scripts/gdb2vcf-chromo-array.sh
:
#!/bin/bash
#SBATCH --array=1-4
#SBATCH --time=01:00:00
#SBATCH --output=results/slurm_logs/gdb2vcf/output-%A-%a
#SBATCH --error=results/slurm_logs/gdb2vcf/error-%A-%a
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=yourname@yourmail.edu
source ~/.bashrc
conda activate gatk4.2.5.0
# this sets the shell variable "chrom" to the chromosome name
eval $(line_assign.sh $SLURM_ARRAY_TASK_ID numbered-chromosomes.tsv)
outdir=results/vcf_parts # a directory in which to put the output gVFC file
logdir=results/logs/genotype-gvcfs # a directory in which to put the output from stderr
# make sure those directories exist
mkdir -p $outdir
mkdir -p $logdir
gatk --java-options "-Xmx4g" GenotypeGVCFs \
-R resources/genome.fasta \
-V gendb://results/genomics_db/$chrom \
-O $outdir/$chrom.vcf.gz > $logdir/stdout.$chrom 2> $logdir/stderr.$chrom
Note that when you name the output file with a .gz
extension, GATK will automatically
gzip the output.
Launch that job array.
27.10.2 Array script for the two scaffold groups
The contents of our scaffold group script is below and should be copied into
scripts/gdb2vcf-sg-array.sh
:
#!/bin/bash
#SBATCH --array=1-2
#SBATCH --time=01:00:00
#SBATCH --output=results/slurm_logs/gdb2vcf/output-%A-%a
#SBATCH --error=results/slurm_logs/gdb2vcf/error-%A-%a
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=yourname@yourmail.edu
source ~/.bashrc
conda activate gatk4.2.5.0
# this sets the shell variable "sg" to the chromosome name
eval $(line_assign.sh $SLURM_ARRAY_TASK_ID numbered-scaff-groups.tsv)
outdir=results/vcf_parts # a directory in which to put the output gVFC file
logdir=results/logs/genotype-gvcfs # a directory in which to put the output from stderr
# make sure those directories exist
mkdir -p $outdir
mkdir -p $logdir
gatk --java-options "-Xmx4g" GenotypeGVCFs \
-R resources/genome.fasta \
-V gendb://results/genomics_db/$sg \
-O $outdir/$sg.vcf.gz > $logdir/stdout.$sg 2> $logdir/stderr.$sg
Launch that job array.
# create slurm logs directory (we already did, above, but
# it is always good to think about this before launching a slurm job)
mkdir -p results/slurm_logs/gdb2vcf
# try just the first job
sbatch --array=1 scripts/gdb2vcf-sg-array.sh
# if it looks like that is working (check the stderr log output)
# then launch the remaining ones
sbatch --array=2 scripts/gdb2vcf-sg-array.sh
27.10.3 Have a look at the products
You can take a look at the resulting VCF files like this:
Have a look at them and see if it seems everything is in order!
27.10.4 Homework
Modify the two array scripts used above so that they use the genomics
data bases that we put in results/stepwise_genomics_db
while testing
the sequential updating of the data bases. Put the output files
into results/stepwise_vcf_parts
. See if you get the same variants
out of it (you should!).
27.10.5 Catenate all the vcf.gz files into a single VCF file and index it
After running the above steps, we have six different vcf.gz files, four for each of four chromosomes, and two for each of two scaffold groups. Sometimes it is nice to keep the VCF files for different chromosomes separate; however because indexing a VCF file allows for rapid, indexed access to any place in the genome, there isn’t much reason to keep all the pieces separate unless you are going to be parsing each file with non-standard tools (i.e., reading each chromosome into R, which, must say, is really NOT recommended—there are specialized tools for handling VCF files and you should become comfortable with them).
In order to catenate these 7 vcf.gz
files into a single one, and to index it, we
will use bcftools
. We will get the latest version of bcftools available on conda at
time time this was written, version 1.15.1, and put it in its own conda
environment named bcftools
.
(Note the use of -c conda-forge
in the above. This is present because
some of the dependencies for bcftools
are not found on the bioconda channel.
Rather they are on the conda-forge channel. If you conda/mamba environment is not
set up to search conda-forge by default, then the -c conda-forge
is required to
get all the dependencies.)
When that has finished, activate the environment and then type bcftools
to
get a list of all the different possible bcftools
subcommands:
Right now we will be using bcftools concat
to catenate the files, and then
we will use bcftools index
to index the result. We will put the result
into results/vcf
.
Note that if we order our chromosome names and our scaffold groups alphabetically they are in the correct (genomic) order, so it is pretty easy to specify the in that order on the command line, like this:
Note the -O z
option which makes the program produce compressed VCF format
(hence we gave it the .vcf.gz
extension).
There is another type of format common for VCF files, and that is “BCF” format. We
will also make a compressed BCF version (using the -O b
option). You should be
comfortable working with either format. BCF is simply a binary format of the same
information. It is not directly human readable, but that is not a problem, since
most of the time you will look at these files using bcftools view
(which is to VCF/BCF
files what samtools view
is to SAM/BAM files).
Now, we are going to use these files to learn all about bcftools
in Section 22.