Chapter 27 An example full workflow from sequences to variants

The purpose of this chapter is to present an accessible example workflow showing all the steps of taking raw sequences in FASTQ files through to variants in VCF files. It assumes that some sort of reference genome is available already (be it a chromosome-level assembly or just a scaffold-level assembly). The workflow shown here is based on the GATK “Best Practices,” with some modifications to make the workflow reasonable for non-model organisms that lack many of the genomic resources expected by the authors of the GATK “Best Practices” (like files of known variation in the species, etc.)

As an example data set we have created a small version of a typical NGS data set. The original data set included 384 Chinook salmon from the Yukon River basin, sequenced across 4 lanes of an Illumina NovaSeq machine. To create the subsetted data set, we extracted just the reads that map to a small part (about 10 megabases from each of four chromosomes, and then a variety of shorter scaffolds) of the latest (as of 2022) reference genome (Otsh_v2.0). for Chinook salmon. And we included data from only 8 of the 384 individual fish.

Included in the data set is a reference genome that has been downsized similarly, to include only those portions of the genome from which we extracted the read from those 8 fish.

Finally, the reads from different individuals were massaged to make it appear that some individuals were sequenced on multiple lanes or flow cells and so that some individuals underwent several different library preps. This is an important feature to ensure that we learn how to properly deal with sequencing of individual libraries across multiple lanes when marking PCR duplicates.

The data set is available publicly as a .tar file at this link: If you view that site on a browser you can choose to download the file.

Downloading the file to a Linux server can be done using wget on the command line. Because the file is large, a little bit of extra trickery is required to download it. This (rather complex) command line should work:

wget --load-cookies /tmp/cookies.txt "$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate '' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1LMK-DCkH1RKFAWTR2OKEJ_K9VOjJIZ1b" -O non-model-wgs-example-data.tar && rm -rf /tmp/cookies.txt

Once the file non-model-wgs-example-data.tar is downloaded, that .tar archive can be untarred with

tar -xvf non-model-wgs-example-data.tar

That will create the directory non-model-wgs-example-data, which contains all the files.

The contents of that directory look like this (in a tree view):

├── README.Rmd
├── README.nb.html
├── data
│   └── messy_names
│       └── fastqs
│           ├── T199967_T2087_HY75HDSX2_L001_R1_001.fastq.gz
│           ├── T199967_T2087_HY75HDSX2_L001_R2_001.fastq.gz
│           ├── T199968_T2087_HY75HDSX2_L001_R1_001.fastq.gz
│           ├── T199968_T2087_HY75HDSX2_L001_R2_001.fastq.gz
│           ├── T199968_T2087_HY75HDSX2_L002_R1_001.fastq.gz
│           ├── T199968_T2087_HY75HDSX2_L002_R2_001.fastq.gz
│           ├── T199969_T2087_HTYYCBBXX_L002_R1_001.fastq.gz
│           ├── T199969_T2087_HTYYCBBXX_L002_R2_001.fastq.gz
│           ├── T199969_T2087_HY75HDSX2_L002_R1_001.fastq.gz
│           ├── T199969_T2087_HY75HDSX2_L002_R2_001.fastq.gz
│           ├── T199969_T2087_HY75HDSX2_L003_R1_001.fastq.gz
│           ├── T199969_T2087_HY75HDSX2_L003_R2_001.fastq.gz
│           ├── T199970_T2087_HY75HDSX2_L003_R1_001.fastq.gz
│           ├── T199970_T2087_HY75HDSX2_L003_R2_001.fastq.gz
│           ├── T199970_T2094_HY75HDSX2_L004_R1_001.fastq.gz
│           ├── T199970_T2094_HY75HDSX2_L004_R2_001.fastq.gz
│           ├── T199971_T2087_HY75HDSX2_L002_R1_001.fastq.gz
│           ├── T199971_T2087_HY75HDSX2_L002_R2_001.fastq.gz
│           ├── T199971_T2087_HY75HDSX2_L004_R1_001.fastq.gz
│           ├── T199971_T2087_HY75HDSX2_L004_R2_001.fastq.gz
│           ├── T199971_T2099_HTYYCBBXX_L002_R1_001.fastq.gz
│           ├── T199971_T2099_HTYYCBBXX_L002_R2_001.fastq.gz
│           ├── T199972_T2087_HTYYCBBXX_L003_R1_001.fastq.gz
│           ├── T199972_T2087_HTYYCBBXX_L003_R2_001.fastq.gz
│           ├── T199972_T2087_HY75HDSX2_L001_R1_001.fastq.gz
│           ├── T199972_T2087_HY75HDSX2_L001_R2_001.fastq.gz
│           ├── T199972_T2094_HTYYCBBXX_L004_R1_001.fastq.gz
│           ├── T199972_T2094_HTYYCBBXX_L004_R2_001.fastq.gz
│           ├── T199972_T2094_HY75HDSX2_L002_R1_001.fastq.gz
│           ├── T199972_T2094_HY75HDSX2_L002_R2_001.fastq.gz
│           ├── T199973_T2087_HY75HDSX2_L002_R1_001.fastq.gz
│           ├── T199973_T2087_HY75HDSX2_L002_R2_001.fastq.gz
│           ├── T199973_T2087_HY75HDSX2_L003_R1_001.fastq.gz
│           ├── T199973_T2087_HY75HDSX2_L003_R2_001.fastq.gz
│           ├── T199973_T2094_HY75HDSX2_L002_R1_001.fastq.gz
│           ├── T199973_T2094_HY75HDSX2_L002_R2_001.fastq.gz
│           ├── T199973_T2094_HY75HDSX2_L003_R1_001.fastq.gz
│           ├── T199973_T2094_HY75HDSX2_L003_R2_001.fastq.gz
│           ├── T199974_T2087_HY75HDSX2_L001_R1_001.fastq.gz
│           ├── T199974_T2087_HY75HDSX2_L001_R2_001.fastq.gz
│           ├── T199974_T2094_HY75HDSX2_L001_R1_001.fastq.gz
│           ├── T199974_T2094_HY75HDSX2_L001_R2_001.fastq.gz
│           ├── T199974_T2099_HY75HDSX2_L001_R1_001.fastq.gz
│           └── T199974_T2099_HY75HDSX2_L001_R2_001.fastq.gz
├── fastq
│   ├── s001---1_R1.fq.gz -> ../data/messy_names/fastqs/T199967_T2087_HY75HDSX2_L001_R1_001.fastq.gz
│   ├── s001---1_R2.fq.gz -> ../data/messy_names/fastqs/T199967_T2087_HY75HDSX2_L001_R2_001.fastq.gz
│   ├── s002---1_R1.fq.gz -> ../data/messy_names/fastqs/T199968_T2087_HY75HDSX2_L001_R1_001.fastq.gz
│   ├── s002---1_R2.fq.gz -> ../data/messy_names/fastqs/T199968_T2087_HY75HDSX2_L001_R2_001.fastq.gz
│   ├── s002---2_R1.fq.gz -> ../data/messy_names/fastqs/T199968_T2087_HY75HDSX2_L002_R1_001.fastq.gz
│   ├── s002---2_R2.fq.gz -> ../data/messy_names/fastqs/T199968_T2087_HY75HDSX2_L002_R2_001.fastq.gz
│   ├── s003---1_R1.fq.gz -> ../data/messy_names/fastqs/T199969_T2087_HY75HDSX2_L002_R1_001.fastq.gz
│   ├── s003---1_R2.fq.gz -> ../data/messy_names/fastqs/T199969_T2087_HY75HDSX2_L002_R2_001.fastq.gz
│   ├── s003---2_R1.fq.gz -> ../data/messy_names/fastqs/T199969_T2087_HY75HDSX2_L003_R1_001.fastq.gz
│   ├── s003---2_R2.fq.gz -> ../data/messy_names/fastqs/T199969_T2087_HY75HDSX2_L003_R2_001.fastq.gz
│   ├── s003---3_R1.fq.gz -> ../data/messy_names/fastqs/T199969_T2087_HTYYCBBXX_L002_R1_001.fastq.gz
│   ├── s003---3_R2.fq.gz -> ../data/messy_names/fastqs/T199969_T2087_HTYYCBBXX_L002_R2_001.fastq.gz
│   ├── s004---1_R1.fq.gz -> ../data/messy_names/fastqs/T199970_T2087_HY75HDSX2_L003_R1_001.fastq.gz
│   ├── s004---1_R2.fq.gz -> ../data/messy_names/fastqs/T199970_T2087_HY75HDSX2_L003_R2_001.fastq.gz
│   ├── s004---2_R1.fq.gz -> ../data/messy_names/fastqs/T199970_T2094_HY75HDSX2_L004_R1_001.fastq.gz
│   ├── s004---2_R2.fq.gz -> ../data/messy_names/fastqs/T199970_T2094_HY75HDSX2_L004_R2_001.fastq.gz
│   ├── s005---1_R1.fq.gz -> ../data/messy_names/fastqs/T199971_T2087_HY75HDSX2_L002_R1_001.fastq.gz
│   ├── s005---1_R2.fq.gz -> ../data/messy_names/fastqs/T199971_T2087_HY75HDSX2_L002_R2_001.fastq.gz
│   ├── s005---2_R1.fq.gz -> ../data/messy_names/fastqs/T199971_T2087_HY75HDSX2_L004_R1_001.fastq.gz
│   ├── s005---2_R2.fq.gz -> ../data/messy_names/fastqs/T199971_T2087_HY75HDSX2_L004_R2_001.fastq.gz
│   ├── s005---3_R1.fq.gz -> ../data/messy_names/fastqs/T199971_T2099_HTYYCBBXX_L002_R1_001.fastq.gz
│   ├── s005---3_R2.fq.gz -> ../data/messy_names/fastqs/T199971_T2099_HTYYCBBXX_L002_R2_001.fastq.gz
│   ├── s006---1_R1.fq.gz -> ../data/messy_names/fastqs/T199972_T2087_HY75HDSX2_L001_R1_001.fastq.gz
│   ├── s006---1_R2.fq.gz -> ../data/messy_names/fastqs/T199972_T2087_HY75HDSX2_L001_R2_001.fastq.gz
│   ├── s006---2_R1.fq.gz -> ../data/messy_names/fastqs/T199972_T2087_HTYYCBBXX_L003_R1_001.fastq.gz
│   ├── s006---2_R2.fq.gz -> ../data/messy_names/fastqs/T199972_T2087_HTYYCBBXX_L003_R2_001.fastq.gz
│   ├── s006---3_R1.fq.gz -> ../data/messy_names/fastqs/T199972_T2094_HY75HDSX2_L002_R1_001.fastq.gz
│   ├── s006---3_R2.fq.gz -> ../data/messy_names/fastqs/T199972_T2094_HY75HDSX2_L002_R2_001.fastq.gz
│   ├── s006---4_R1.fq.gz -> ../data/messy_names/fastqs/T199972_T2094_HTYYCBBXX_L004_R1_001.fastq.gz
│   ├── s006---4_R2.fq.gz -> ../data/messy_names/fastqs/T199972_T2094_HTYYCBBXX_L004_R2_001.fastq.gz
│   ├── s007---1_R1.fq.gz -> ../data/messy_names/fastqs/T199973_T2087_HY75HDSX2_L002_R1_001.fastq.gz
│   ├── s007---1_R2.fq.gz -> ../data/messy_names/fastqs/T199973_T2087_HY75HDSX2_L002_R2_001.fastq.gz
│   ├── s007---2_R1.fq.gz -> ../data/messy_names/fastqs/T199973_T2087_HY75HDSX2_L003_R1_001.fastq.gz
│   ├── s007---2_R2.fq.gz -> ../data/messy_names/fastqs/T199973_T2087_HY75HDSX2_L003_R2_001.fastq.gz
│   ├── s007---3_R1.fq.gz -> ../data/messy_names/fastqs/T199973_T2094_HY75HDSX2_L002_R1_001.fastq.gz
│   ├── s007---3_R2.fq.gz -> ../data/messy_names/fastqs/T199973_T2094_HY75HDSX2_L002_R2_001.fastq.gz
│   ├── s007---4_R1.fq.gz -> ../data/messy_names/fastqs/T199973_T2094_HY75HDSX2_L003_R1_001.fastq.gz
│   ├── s007---4_R2.fq.gz -> ../data/messy_names/fastqs/T199973_T2094_HY75HDSX2_L003_R2_001.fastq.gz
│   ├── s008---1_R1.fq.gz -> ../data/messy_names/fastqs/T199974_T2087_HY75HDSX2_L001_R1_001.fastq.gz
│   ├── s008---1_R2.fq.gz -> ../data/messy_names/fastqs/T199974_T2087_HY75HDSX2_L001_R2_001.fastq.gz
│   ├── s008---2_R1.fq.gz -> ../data/messy_names/fastqs/T199974_T2094_HY75HDSX2_L001_R1_001.fastq.gz
│   ├── s008---2_R2.fq.gz -> ../data/messy_names/fastqs/T199974_T2094_HY75HDSX2_L001_R2_001.fastq.gz
│   ├── s008---3_R1.fq.gz -> ../data/messy_names/fastqs/T199974_T2099_HY75HDSX2_L001_R1_001.fastq.gz
│   └── s008---3_R2.fq.gz -> ../data/messy_names/fastqs/T199974_T2099_HY75HDSX2_L001_R2_001.fastq.gz
├── resources
│   └── genome.fasta
├── samples.tsv
└── units.tsv
  • The README.* files give some background on how this data set was whittled out of the larger, complete data set.
  • The directory data/messy_names/fastqs, contains all the FASTQ files as they might appear after downloading from the sequencing center. Characteristic of such files, they all have rather daunting names.
  • The directory fastq at the top level of the directory contains symbolic links to the big files with messy names. These are there simply to provide students a simpler variety of names of files to deal with when doing the initial steps of a sequencing exercise. But, ultimately, we will explore ways of directly addressing the files with more complex names.
  • The file samples.tsv just contains all the unique “tidy” sample names (like s001, s002, etc.) in a single column named sample.
  • The file units.tsv contains crucial information about all the different pairs of FASTQ files. Each pair is considered a separate “unit” of one of the samples.
    Some of the information in the file pertains to which flowcell and lane the sample-unit was sequenced on, as well as which library the sample-unit was prepared in. All of this relates to the “journey” that was taken by each little piece of DNA that was sequenced (Section 19.1). It is worth having a look at the file.
  • The file resources/genome.fasta contains the “genome” for this exercise.

For information about FASTQ and FASTA files see Sections 17.2 and 17.3.

27.1 An overview of the whole workflow

We provide here a simple(ish) flow diagram that shows the different steps/jobs in this workflow that have to be done. The diagram is made from a Snakemake workflow (see Chapter 14) that Eric has implemented for whole genome sequencing in non-model organisms. The acyclic directed graph (DAG) output from Snakemake has been condensed with the function,condense_dag(), from one of Eric’s R packages (SnakemakeDagR). Each step/job is a separate box, and the arrows between the boxes show the dependence of output from the parent box (at the tail end of the arrow) as input for the child box (at the head end of the arrow). The numbers in the boxes and along the arrows show how many independent instances of each job must be run (for different samples, units, chromosomes, etc.)
A DAG showing all the separate steps or jobs (called "rules" in Snakemake terminology) that we will be exploring while working through the example WGS workflow in this chapter.

FIGURE 27.1: A DAG showing all the separate steps or jobs (called “rules” in Snakemake terminology) that we will be exploring while working through the example WGS workflow in this chapter.

Burn this DAG into your memory. We will return to versions of it in each section below to give an indication of which part of the workflow we are tackling.

27.2 Let’s have a look at the reference genome

A number of people in the class had problems installing samtools on SUMMIT last week. The reason for this appears to be that the conda package management system migrated to a new version of the SSH library, but samtools looks for an older version of the libcrypto library (perhaps because it uses some of the hashing functions from that library). This was a known problem when conda upgraded its SSH dependencies, but seems to have resurfaced with later versions of samtools. However, it can be fixed by installing version 1.11 of samtools.

27.2.1 First off, make sure that you have added mamba to your base environment

I was playing with a lot of things over the weekend, and it was clear that mamba was orders of magnitude faster than conda for installing environments. (It almost felt like conda merely stalled on some environments—that wouldn’t make for fun classtime…just waiting for software to be installed).

So, I am going to require that people use mamba from here on out. First, test to see if you have mamba in your base conda environment: activate your base conda environment and see if you have mamba:

conda activate base

If this gives you a screen full of syntax about how to use mamba, like:

usage: mamba [-h] [-V] command ...

conda is a tool for managing and deploying applications, environments and packages.


positional arguments:
    clean        Remove unused packages and caches.
    compare      Compare packages between conda environments.
    config       Modify configuration values in .condarc. This is modeled after the git config command. Writes to the user .condarc file (/home/ by
    create       Create a new conda environment from a list of specified packages.
    help         Displays a list of available conda commands and their help strings.
    info         Display information about current conda install.
    init         Initialize conda for shell interaction. [Experimental]
    install      Installs a list of packages into a specified conda environment.
    list         List linked packages in a conda environment.
    package      Low-level conda package utility. (EXPERIMENTAL)
    remove       Remove a list of packages from a specified conda environment.
    uninstall    Alias for conda remove.
    run          Run an executable in a conda environment. [Experimental]
    search       Search for packages and display associated information. The input is a MatchSpec, a query language for conda packages. See examples below.
    update       Updates conda packages to the latest compatible version.
    upgrade      Alias for conda update.
    repoquery    Query repositories using mamba.

then you have mamba.

If, instead, it tells you it doesn’t know anything about the mamba command, then you need to install it. Make sure that you install it into your base conda environment, using this command:

conda install mamba -n base -c conda-forge

Now, if you need to get a working samtools (version 1.11) into an environment called samtools, you can do this:

mamba create -n samtools -c bioconda samtools=1.11

To use samtools you then must activate the environment

conda activate samtools

27.2.2 Indexing the reference genome

Samtools has wonderful subcommand, faidx that indexes a fasta file. We can index our example reference genome by running this command from the top level of our non-model-wgs-example-data directory.

samtools faidx resources/genome.fasta

This creates the file resources/genome.fasta.fai. The extension .fai stands for “faidx index.” Look at resources/genome.fasta.fai using the less viewer. The first few lines look like:

  CM031199.1      14000000        122     60      61
CM031200.1      12000000        14233578        60      61
CM031201.1      10000000        26433700        60      61
CM031202.1      6000000 36600489        60      61
JAEPEU010001976.1       276     42700570        60      61
JAEPEU010009262.1       333     42700932        60      61
JAEPEU010007942.1       367     42701352        60      61
JAEPEU010006675.1       416     42701807        60      61
JAEPEU010003074.1       440     42702311        60      61

Each line gives information about a differet sequence in the referenc fasta file, listed in the order that the sequences appear in the fasta file.

The five columns of each line hold this information:

  • NAME: Name of this reference sequence
  • LENGTH: Total length of this reference sequence, in bases
  • OFFSET: Offset in the FASTA/FASTQ file of this sequence’s first base
  • LINEBASES: The number of bases on each line
  • LINEWIDTH: The number of bytes in each line, including the newline

This information can be used to rapidly find the parts of the file where any position in the genome is found, for example, if you wanted to find the 100th base in sequence JAEPEU010001976.1 you can see that you have to go 42700570 bytes from the start of the file to get to base 1 of `JAEPEU010001976.1, and then you would have to go forward in the file for 99 bases. 59 bases forward would get you to the last base on the first line, then you have an additional byte to eat up (the line ending) and then 40 more bytes would put you at position 100. So, you could reach the desired point in the fasta file by just going 42700570 + 59 + 1 + 40 bytes forward in the file. This is the sort of thing that can be done very quickly by a computer.

Fortunately you don’t have to do this arithmetic in your head to find sequences at certain places in the reference genome. Rather, once you have indexed your FASTA file, you can use samtools faidx with any number of optional genomic coordinates, to extract those seqences.

A genomic coordinate (or “range”) of a sequence has the form:


For example:


gives the genomic coordinates (or the “genome address”) of the 150 base pair sequence whose first base is at position 50,151 in the genome, and whose last is at position 50,300. When using samtools faidx bases are numbered so that the first one is 1. This is called base-1 indexing. Other ways of addressing regions of genomes (like those used in BED files, use a different base-0 scheme.)

You can think of base-1 schemes as those that give the address of each base starting from 1. To understand a base-0 scheme in that light, it makes sense to think about addressing each SNP according to individual “pickets” that separate each base. (Probably should draw a picture, but you can see it at this UCSC genome browser blog site).

At any rate, if you want to get the 500 bases starting from 1,000,001 on the first two chromosomes in our example reference genome you would do:

samtools faidx resources/genome.fasta CM031199.1:1000001-1000500 CM031200.1:1000001-1000500

Note that if you find it difficult to keep track of all those zeroes, you are free to put commas into the numbers anywhere that is helpful, for example:

samtools faidx resources/genome.fasta CM031199.1:1,000,001-1,000,500 CM031200.1:1,000,001-1,000,500

So, if you ever need to extract a certain sequence from a genome, this is one way you can do it!

27.3 A word on organizing workflows

Now, before we start rolling along with our example bioinformatic workflow, let us take a moment and appreciate how clean and tidy our non-model-wgs-example-data directory is. If I do ls on it I see this:

% ls
data  fastq  README.nb.html  README.Rmd  resources  samples.tsv  units.tsv

That is all right! There are only seven files or directories there and it is not a mess.

As we proceed with a bioinformatic pipeline/workflow, it will behoove us to keep things super nice and clean. In fact I am going to recommend that nearly everything we do in our workflow should go inside a single directory called results.

In general, your workflow will involve several types of large files. Some of those are just things like genomes (.fasta files), genomic features (.gff) files, or known variants, etc. In general these sorts of things that you might download from a repository like GenBank should be placed into the resources directory, possibly in their own subdirectory within resources. The other types of files are those that get produced when you run bioinformatic software to analyze your data—for example, trimmed versions of your fastqs after you have trimmed off the low quality bases, etc. These latter types of files should all go into separate folders within a directory called results. The directory they go in should reflect what type of step in your bioinformatic process produced them, for example: trimmed, mapped, variants, etc.

Additionally, there should be a directory, results/logs where you should redirect stderr for each sample that is run in a certain process. All this will become clear as we start working through our example.

27.4 Step 1a. Quality Checking the Reads

The first thing we are going to do is Quality Check our fastq files. The goal of this step is to get a sense for how good (or bad) the data that we got back from the sequencing center is, in terms of number of reads per sample, base quality scores, distribution of A, C, G, and T bases, the number of missing bases, and the occurrence of overrepresented sequences (that could indicate contamination or untrimmed Illumina adapters, etc.)

The following figure shows where this fits into our overall workflow. We will be using FASTQC to profile each FASTQ file, and then we will summarize all those results with MULTIQC.
FASTQC steps and the associated MULTIQC step of this section shown in color.

FIGURE 27.2: FASTQC steps and the associated MULTIQC step of this section shown in color.

There is a standard piece of Java software called FASTQC that does a reasonably comprehensive job of this. We will focus on running it interactively for now (we will talk about dispatching jobs via SLURM later). So, on the cluster, we will want to get a shell on a compute node. If you want to, on SUMMIT, you can get an interactive shell with:


But, to be honest, sinteractive is a somewhat old-school shell function that uses screen to connect you to the interactive shell. This causes some problems for TMUX. So, I find a better way to get a shell from the oversubscriped interactive partition on SUMMIT to be:

srun --partition=shas-interactive -t 2:00:00 --pty /bin/bash

When you do this, and then reconnect to such a session with TMUX, your scrollback shouldn’t be totally messed up (Thanks Caitlin for reminding me of this sinteractive atrocity!). I’ve checked with the sys-admins on SUMMIT and they say this is a perfectly acceptable way of getting an interactive shell on SUMMIT. Also, the -t 2:00:00 says “give me this interactive shell for 2 hours,” which will give us more than enough time to get through a whole class session.

The first thing we will need to do is get FASTQC using mamba. Put it in its own environment named fastqc.

mamba create -n fastqc -c bioconda fastqc

Hooray for the wicked speed of mamba!

Once this is set up, we can get the instructions for using fastqc like this:

conda activate fastqc
fastqc --help

That is a fairly lengthy statement about the syntax of this program, so I won’t copy it into the handbook, here, But read through it and you shall find that it looks like if we want to just put all the fastq files on the command line then fastqc ought to gobble them right up and put the output in files that are named intelligently. That sounds pretty convenient. Let’s test it out on a couple of the smallest files that we have (about 15 Mb each):

# make some directories to put the results in
mkdir -p results/qc/fastqc results/logs/fastqc

# now, try running fastqc on two small files:
fastqc -o results/qc/fastqc --noextract fastq/s006---1_R1.fq.gz fastq/s006---1_R2.fq.gz

Note that fastqc spits out some output to tell you about the progress that it is making. It turns out that is going to both stderr (the percent complete lines) and to stdout (the “complete” lines). So, we probably should capture both of those streams in the future.

When it is done, look at the output in results/qc/fastqc. See that there are files with pretty self-explanatory names. For each FASTQ file there is an .html file that has a report file (with lots of pretty pictures) that can be opened in a web browser. Note that you have to copy those to your laptop if you want to see them. You could, for example, do that with Globus. The other output for each FASTQ file is a .zip file of data that we will use later.

That didn’t take too long. Let’s try the whole schmeer! We can do that easily with globbing. Note the redirection of stdout to a log file and the 2>&1 which means, “send stderr to the same place stdout is going to.”

fastqc -o results/qc/fastqc --noextract fastq/*.fq.gz  > results/logs/fastqc/fastqc.log 2>&1

Note that if you want to see how much progress is being made on the file you can look at the bottom of the log file at results/logs/fastqc/fastqc.log using the tail command in another shell (opened for example with tmux, before you started the last command). In fact, you can do one better than that by using tail -f.

Since I had another shell going in tmux, I can do this with it:

tail -f results/logs/fastqc/fastqc.log

this continuously spits new output to the fastqc.log file to stdout. The whole thing should take about 5 to 10 minutes on an interactive shell on SUMMIT.

27.4.1 A Little Looking at the HTMLs

Let’s just go through a few of the different pieces of information in here. It is relatively self-explanatory. I do want to call out the adapter content pane, and show that it found some instances of the Illumina Universal Adapter. These are refer to Illumina’s TruSeq Version 3 adapters. So, we see that some of the reads include sequence from the adapters. Cool. We will come back to that.

27.4.2 Summarizing output from a lot of different files with MultiQC

As you probably noticed if you tried to open all the html reports for every FASTQ file, it is a bit laborious to look at every report…and this is just for a small number of files. Imagine if you had to do that for 768 different FASTQ files!

A good solution for summarizing such outputs is the Java-based program, MultiQC. This program offers an insightful system for summarizing log files and outputs from a wide range of different bioinformatics programs. It is capable of summarizing much more than just fastqc files, but that is how we will first use it.

Let’s get it and put it in a conda environment called multiqc.

mamba create -n multiqc -c bioconda multiqc

Once that is done (and if you use conda instead of mamba it might never be done) you can activate the environment and get some information about how to use multiqc with:

conda activate multiqc
multiqc --help 

Reading the program syntax, it seems like we should be able to point it to the results/qc/fastqc directory and tell it to put its output into a results/qc/multiqc directory like this:

multiqc -v -o results/qc/multiqc results/qc/fastqc

Note that the command line above is using the -v option, which gives us verbose output. This is kind of fun because it shows us all the different kinds of output file types that MultiQC can parse—a whole boatload of them. All we have now our fastqc files. But we can see the summary report by downloading results/qc/multiqc/multiqc_report.html and opening it in your browser. Super cool!

27.5 Step 1b. Trimming the Reads

The fastqc reports showed that most of the sequences we have are of pretty good quality, but, as expected, the quality drops off a bit later in the read. Also, we saw that there were some adapter sequences stuck into 1 to 3% of the reads. We can use the tool TrimmoMatic to:

  1. Trim off parts of reads that are of low quality.
  2. Chuck out entire reads that are of low quality.
  3. Remove adapter sequences that are improperly inserted into our reads.
This step corresponds to the colored node in our now-familiar workflow DAG:
Running Trimmomatic to clean up our sequencing reads by removing adapter cruft and discarding parts of the reads with low quality.

FIGURE 27.3: Running Trimmomatic to clean up our sequencing reads by removing adapter cruft and discarding parts of the reads with low quality.

To use Trimmomatic, we will have to get it. If you don’t have it in a conda environment yet, you know the drill:

mamba create -n trimmo -c bioconda trimmomatic

Once that is installed you can get a terse and somewhat cryptic message about how to run the program like this:

conda activate trimmo

Which tells us this:

       PE [-version] \
          [-threads <threads>] \
          [-phred33|-phred64] \
          [-trimlog <trimLogFile>] \
          [-summary <statsSummaryFile>] \
          [-quiet] \
          [-validatePairs] \
          [-basein <inputBase> | <inputFile1> <inputFile2>] \
          [-baseout <outputBase> | <outputFile1P> <outputFile1U> <outputFile2P> <outputFile2U>] \
       SE [-version] \
          [-threads <threads>] \
          [-phred33|-phred64] \
          [-trimlog <trimLogFile>] \
          [-summary <statsSummaryFile>] \
          [-quiet] \
          <inputFile> <outputFile> \

Where I have added backslash-line-endings to make it a little easier to read.

Aha! We can run this dude in paired end mode (PE) or in single end mode (SE). Since we have paired-end data, we will run it in paired-end mode.

Full details of the syntax and the methodology can be found in the PDF Trimmomatic manual at This link seems to not work very well sometimes…

We will be running it on a single thread for now so the really relevant syntax is:

       PE \
          [-basein <inputBase> | <inputFile1> <inputFile2>] \
          [-baseout <outputBase> | <outputFile1P> <outputFile1U> <outputFile2P> <outputFile2U>] \

This syntax message signifies that you can either give it a base file prefix for the input and output files, like -basein fastq/s001-1 -baseout results/trimmed/s001-1, or you can explicitly give it the names of the input and output files. I tend to think that the latter is a better way to do it (always good to be explicit).

So, for example, with our tidy named files in the fastq directory, the first two lines there might look like:

trimmomatic PE \
  fastq/s001---1_R1.fq.gz  fastq/s001---1_R2.fq.gz   \
  results/trimmed/s001---1_R1.fq.gz  results/trimmed/s001---1_R1_unpaired.fq.gz  \
  results/trimmed/s001---1_R2.fq.gz  results/trimmed/s001---1_R2_unpaired.fq.gz 

So, the two input files are read 1 and read2, and the four output files are in this order:

  1. Read 1 successfully trimmed.
  2. Read 1 that is no longer paired (because, for example, its mate was tossed out)
  3. Read 2 successfully trimmed
  4. Read 2 that is no longer paired.

The reads that you will use downstream, for mapping, will be the successfully trimmed ones.

Now, the more complex part of the command line syntax is the innocouous looking <trimmer1>.... That seemingly simple word belies the opportunity to specify all the different types of read trimming modules and their parameters (and the order in which they will be applied) that you want to use. Once again, the PDF manual is the place to go, but a good place to start will be:

# get path to trimmomatic to find the path to the adapter files.
# It seems to look for them in a standard /usr/local/share location,
# which is not where they are if conda installed.
trim_path=$(which trimmomatic)
trim_dir=$(dirname $trim_path)

# make a results/trimmed directory to put the outputs into
mkdir -p results/trimmed

trimmomatic PE \
  fastq/s001---1_R1.fq.gz  fastq/s001---1_R2.fq.gz   \
  results/trimmed/s001---1_R1.fq.gz  results/trimmed/s001---1_R1_unpaired.fq.gz \
  results/trimmed/s001---1_R2.fq.gz  results/trimmed/s001---1_R2_unpaired.fq.gz  \
  ILLUMINACLIP:${adapter_fasta}:2:30:10 \

You can go ahead and evaluate those commands and see what happens when trimmomatic runs. And while it is running, we can try to unpack those last few lines:

  • ILLUMINACLIP:${adapter_fasta}:2:30:10: This means remove Illumina adapters that match the sequences in the fasta file TruSeq3-PE-2.fa that is included with the Trimmomatic distribution. That file looks like this:

Those are the sequences (and the reverse complements) for the TruSeq version3 adapters. It is pretty fun to look around in your fastq files to find there sequences…(We could play around with that if you want). This particular fasta file is set up to allow Trimmomatic to find spurious adapter sequences when they arrived in your reads by a few different things (see the Palindrome clipping section in the manual).

  • LEADING:3: Remove bases at the beginning of the read that have a Phred-scaled quality score less than 3.
  • TRAILING:3: Remove bases at the end of the read that have a Phred-scaled quality score less than 3.
  • SLIDINGWINDOW:4:15: Remove any bases within a sliding window of 4 bases that have an average Phred-scaled quality score of less than 15.
  • MINLEN:36: Toss out any reads that are only 36 bp long after the bad bases have been trimmed.

27.5.1 Cycling over all the files and trimming them

Ultimately, we are going to have to cycle over the fastq files in some way, so, note that we could have written the above command line by defining a shell variable for the file prefix. Lets clean this whole thing up with some shell variables, using s002---1 as an example

# here is a variable to store all the trimmer directives:
TRIMMER="ILLUMINACLIP:${adapter_fasta}:2:30:10 \

# define a file prefix for s002---1

# now get the names of all the input files and output files
# and store them in a shell variable.  Not the fun shell-fu here.
INFILES=$(echo fastq/${FP}_{R1,R2}.fq.gz)
OUTFILES=$(echo results/trimmed/${FP}_{R1,R2}{,_unpaired}.fq.gz)

# make a Log file too

# you might want to try "echo $INFILES" and "echo $OUTFILES" 
# and "echo $LOG" to see what those look like.

# We need to make sure that we have directories to put all those
# output and log files into.  We can do that with:
mkdir -p $(dirname $LOG $OUTFILES)

# now that we have defined those variables, we can write our 
# command line like this:

trimmomatic PE $INFILES $OUTFILES $TRIMMER > $LOG 2>&1

Once we have figured out that whole collection of command lines, and have verified that it works for our “trial” value of FP (i.e, s002---1), it is time to cycle over all the different possible values of FP.

We can use globbing and the basename utility to get all the file prefixes. Check this out:

FILE_PREFIXES=$(basename -a -s _R1.fq.gz fastq/*R1.fq.gz)

Evaluate that and then run echo $FILE_PREFIXES to see what you have.

With that, we can then define a for loop to run over all 22 pairs of files:

# This stays outside of the for loop:
TRIMMER="ILLUMINACLIP:${adapter_fasta}:2:30:10 \

# get list of file prefixes
FILE_PREFIXES=$(basename -a -s _R1.fq.gz fastq/*R1.fq.gz)

for FP in $FILE_PREFIXES; do
  INFILES=$(echo fastq/${FP}_{R1,R2}.fq.gz)
  OUTFILES=$(echo results/trimmed/${FP}_{R1,R2}{,_unpaired}.fq.gz)
  mkdir -p $(dirname $LOG $OUTFILES) # this could be done outside the for loop
  trimmomatic PE $INFILES $OUTFILES $TRIMMER > $LOG 2>&1


You can run those, and they will likely take 15 to 20 minutues to get through. To finish them all, you would typically want to be in a tmux session or something similar so that the job is not killed when you logout.

Note that for most big and long jobs you wouldn’t run this in an interactive shell, but, rather, you would run it through SLURM’s sbatch. We will be covering that next week when we start to talk about mapping. But for now, just want to do some interactive computing to get a flavor of how one can write and test scripts on an interactive shell.

27.5.2 Multiqc your trimmomatics results

It turns out that Multiqc, which so nicely gobbled up and summarized our fastqc results, also knows how to make nice summaries of the log (stderr/stdout) output of Trimmomatic. So, once our Trimmomatic loop is done, we can run Multiqc and direct it to both the directory holding the fastqc output and the directory holding the Trimmomatic output, so as to get an HTML report that includes both. Witness:

conda activate multiqc
multiqc -v -f -o results/qc/multiqc results/qc/fastqc results/logs/trimmomatic

Note that the above command includes the -f option to force it to overwrite existing output.

As before, you get the results in results/logs/trimmomatic/multiqc_report.html, but you have to get them on your laptop to view them easily.

27.6 Step 2. Mapping each unit to the genome

Mapping or aligning reads to a genome is one of the more computationally demanding steps in a typical WGS workflow. However, I find it somewhat more straightforward (less prone to errors and failures and bugs in the software) than some of the later computational steps (like creating a gVCF file from HaplotypeCaller).

Here is a figure showing where in the overall workflow this step fits in:
Mapping step (using bwa mem) of this section shown in color.

FIGURE 27.4: Mapping step (using bwa mem) of this section shown in color.

There are several different software programs available for alignment. I am a fan of the bwa mem algorithm, and it is commonly used in many production pipelines, so that is what we will be using.

The output of bwa mem can often be manipulated as it comes streaming out of the program onto stdout. The standard tool for manipulation such streams is samtools. For this reason, when we define a conda environment for mapping steps, we will typically include both the bwa program and the samtools program in it. So, let’s make a conda environment called bwasam that has both tools in it:

mamba create -n bwasam -c bioconda bwa=0.7.17 samtools=1.11

Once we have that environment, we can activate it and then look at the syntax for bwa mem by typing bwa mem:

conda activate bwasam
bwa mem

The output of that is quite long. Thankfully, the default program values are often a good place to start, and typically work well. So, for our purposes, the most germane things to note are the main syntax statement:

Usage: bwa mem [options] <idxbase> <in1.fq> [in2.fq]

This means you can give it options, and then you have to pass it the path to the genome (which is also the prefix used for the bwa indexes, as we shall see). And then you pass it a fastq file with read 1, and if you have paired end reads you pass it the corresponding read 2 fastq file.

The other option that we will use is described as:

-R STR        read group header line such as '@RG\tID:foo\tSM:bar' [null]

This lets us add information to the output file about the origin of the collection of reads in the fastq files (like what sample and library prep they are from, what type of platform they were sequenced on, and the machine and lane).

To start out with, we simply want to make a SAM file that we can use for further explorations of what a SAM/BAM file is, and what it means for DNA to align to a reference genome. So, make sure that you are in your non-model-wgs-example-data directory, you have an interactive compute shell, and that you have the bwasam conda environment activated. Here we go!

27.6.1 Indexing the reference genome

The reference genome we are working with is in resources/genome.fasta. In order to quickly and efficiently find places in the genome to which reads align, it is necessary to index the reference genome. The bwa subcommand index will index the reference genome for bwa mem to use.

To index the genome we do:

bwa index resources/genome.fasta

This takes about 40 seconds and creates the files:

resources/genome.fasta.amb  resources/genome.fasta.bwt  resources/
resources/genome.fasta.ann  resources/genome.fasta.pac

The particulars of these files are not too important. Just know that they are required for bwa mem to work. Note that the <idxbase> in bwa mem’s syntax statement is just prefix for these 5 files: resources/genome.fasta.

27.6.2 A quick mapping run to stdout

Now that the genome is indexed let’s go ahead and map the first pair of files that we trimmed:

results/trimmed/s001---1_R1.fq.gz  results/trimmed/s001---1_R2.fq.gz

and we will just let the output go streaming to our screen:

bwa mem resources/genome.fasta \
   results/trimmed/s001---1_R1.fq.gz  \

You should see a whole lot of lines starting with @ come zooming by first, then you will see some things that look like progress messages. Eventually, a bunch of sequences start zooming by. You can hit cntrl-C to stop that.

OK! That was fun. Now, we are actually going to redirect that standard output into a SAM file, but before we do that, we are going to want to add the -R option to define the read group for the sequences in this file (we will talk a lot more about read groups in a moment).


mkdir -p results/sam  # a directory in which to put the output from stdout
mkdir -p results/logs/bwa  # a directory in which to put the output from stderr
bwa mem \
resources/genome.fasta \
results/trimmed/s001---1_R1.fq.gz  results/trimmed/s001---1_R2.fq.gz > \
results/sam/s001---1.sam 2> results/logs/bwa/s001---1.log

While that is running (and it may run for a pretty good long while) let’s just note that the read-group string in there is a horrifying bunch of textual gobbledygook. The information that goes into it is in the file units.tsv. More on read groups later.

For now, let’s start exploring how sequences can align, and the meaning of all the lines in the SAM file. This is documented in Section 17.4.

27.6.3 Mapping all 22 pairs of trimmed FASTQ files using a SLURM job array

Using all the tricks we learned in Section 8.4.3 we are ready to map these things. In case you missed it in class, before you proceed with this section, you will have to undertake all the steps (to create numbered-units.tsv and the script) in Section Otherwise things won’t work!

We can build on our mapping script in to have it use to choose different files to align to the genome in a SLURM job array, according to the SLURM_ARRAY_TASK_ID. Hooray!

Here is what such a script looks like:


#SBATCH --array=1-22
#SBATCH --time=02:00:00
#SBATCH --output=results/slurm_logs/map/map_s002---1-output-%A-%a
#SBATCH --error=results/slurm_logs/map/map_s002---1-error-%A-%a 
#SBATCH --mail-type=ALL 

source ~/.bashrc
conda activate bwasam
eval $( $SLURM_ARRAY_TASK_ID numbered-units.tsv) 

mkdir -p results/bam  # a directory in which to put the output from stdout
mkdir -p results/logs/bwa  # a directory in which to put the output from stderr
mkdir -p results/logs/samtools-sort




bwa mem \
-R $RG \
resources/genome.fasta \
${PREFIX}_R1.fq.gz  ${PREFIX}_R2.fq.gz 2> $BWALOG |  \
samtools view -u - |  \
samtools sort -l 9 -o $OUT - 2> $SAMSORTLOG

As you can see, we were able to use most of our original test script. We just need to define SU and RG a little differently, according to the value of SLURM_ARRAY_TASK_ID, and we added an #SBATCH --array=1-22 line. (Because there are 22 rows in our numbered-units.tsv file!). Also we named the slurm output and error files according to the %A_%a nomenclature. Step through this interactively to make sure it looks right

Before trying to run something like this, it is always good to “pretend” it is being run as a SLURM array job by spoofing the variable SLURM_ARRAY_TASK_ID. In other words, on an interactive shell you can just set a shell variable called SLURM_ARRAY_TASK_ID to whatever value you want to (let’s choose 5 here!) like this:


then you can go through the above script line by line and make sure that all the variables get set to appropriate values. After evaluating each line in which a variable value gets set, check the value of that variable using echo.

If everything looks right, you could even evaluate all the 6 lines of the main command and let it run for a little while to make sure that what is coming into the output files and logs looks correct. But, you don’t have to do that now. Put that script into and run it with --test-only

Make a new script with nano by doing:

nano scripts/

paste the contents of the above script into it, change the email address for notification emails, and the save and exit.

Then run it under sbatch but use the --test-only option:

sbatch --test-only scripts/

That works, but note that --test-only doesn’t see to let you know that this is an array job. Whatever! Run just the first job

You should never launch a large job array without at least launching a few of them first, to make sure that it doesn’t fail immediately. We can use --array=1-3 on the command launch just the first three jobs.

Note also that you always want to make sure that you have created any directories necessary for storing the slurm_log output.

Here we go!

# first, make sure the directories are there
mkdir -p results/slurm_logs/map/

# then launch the first three jobs
sbatch --array=1-3 scripts/

Then use your myjobs alias to see when that starts. Once it has started, check the file results/logs/bwa/s001---1.log to make sure that stuff is coming out that looks like what you expect. If the last step looked good, then launch the remaining jobs

Once again, the power of overriding the internal SBATCH directives in the script come in handy here, setting --array=4-22 to do the remaining 19 jobs.

sbatch --array=4-22 scripts/

That should map all those trimmed reads to the genome. Woo-hoo!

27.7 Step 3. Mark duplicates in BAM files from each individual, and run samtools stats on the results

I used to identify (and either remove or mark) duplicates with some of the subcommands of the samtools package. But it turns out that MarkDuplicates from the Picard Tools package, from the Broad Institute, has a few advantages. Chief among those is the fact that, with MarkDuplicates, it is not necessary to merge all of the BAM files from a single individual and library prep to mark duplicates. Although the documentation about this is surprisingly spotty, Picard’s MarkDuplicates is sufficiently aware of the read groups attached to different reads that it can automatically only consider reads from the same library and individual as PCR duplicates (so long as the read group of each group of reads has been properly set, with the LB read-group key set, as we did during the mapping step, above).

Here is our DAG showing what steps we are talking about in this section:
Marking PCR and optical duplicates and also running samtools stats described in this section.

FIGURE 27.5: Marking PCR and optical duplicates and also running samtools stats described in this section.

In order to use MarkDuplicates, we first need to get the Picard Tools package. So, our first task is going to be to get those with mamba. We will want version 2.26 (which is pretty much the latest minor release as of this writing), so we do:

mamba create -n picard -c bioconda picard=2.26

Once that is installed we can activate the environment and check out the syntax for the MarkDuplicates tool.

conda activate picard
picard MarkDuplicates --help

The help file notes that duplicates are identified as reads in which each mate maps to the same location in the genome (i.e. they have the same 5’ position) and which have sequences that are similar. It also tells us that the read that is “retained” as the “non-duplicate” is chosen to have the highest base quality scores, on average.

We also see that we can provide multiple BAM files, simultaneously, to MarkDuplicates, by giving the -I or --INPUT command multiple times, following it, each time, with the path to a different BAM file.

So, we need an easy way to write a command line that includes a -I option for each of the BAM files that are associated with any single sample. We will do that by making a TAB separated file from which we can pull values with our script. This file will have three columns: index will give the row number in the file, sample will give the name of the sample (like s001, s002, etc.), and bamopts will hold a space-separated string of the BAM files that hold reads from the sample, each preceded by the -I option that MarkDuplicates requires before each BAM path.

We can create such a file by processing our units.tsv file with the awk utility. You should go back and review the section on awk’s use of associative (or “hash”) arrays to better understand how the following combination of an awk script and the unix sort utility gives us a numbered TSV file that provides a list of bam file paths that contain reads from a given sample. The comments in the following script also give some detail.

awk -F"\t" '
    OFS="\t";     # separate output fields with TABs 
  NR==1 {next;}   # ignore the first line
  { # in this block, store a space-separated string of "-I filpath.bam" for all the
    # files that have reads from a give sample.  The variable bams is an
    # associative or "hash" array (like a "dictionary" in Python) that holds those
    # strings.
    bams[$1] = sprintf("%s -I results/bam/%s---%s.bam", bams[$1], $1, $2);
  END { # at the end, print out the keys of bams and the paths associated with each
        # in two columns
    for(i in bams) print i, bams[i]
' units.tsv | \
sort |  \
awk -F "\t" ' # print these lines out in a numbered TSV file.
  BEGIN {printf("index\tsample\tbamopts\n");}  # print a header line
  {printf("%d\t%s\t%s\n", ++n, $1, $2, $3)}  # print out each numbered line
' > numbered-sample-bams.tsv

The file, numbered-sample-bams.tsv created by this short script looks like this:

index   sample  bamopts
1   s001     -I results/bam/s001---1.bam
2   s002     -I results/bam/s002---1.bam -I results/bam/s002---2.bam
3   s003     -I results/bam/s003---1.bam -I results/bam/s003---2.bam -I results/bam/s003---3.bam
4   s004     -I results/bam/s004---1.bam -I results/bam/s004---2.bam
5   s005     -I results/bam/s005---1.bam -I results/bam/s005---2.bam -I results/bam/s005---3.bam
6   s006     -I results/bam/s006---1.bam -I results/bam/s006---2.bam -I results/bam/s006---3.bam -I results/bam/s006---4.bam
7   s007     -I results/bam/s007---1.bam -I results/bam/s007---2.bam -I results/bam/s007---3.bam -I results/bam/s007---4.bam
8   s008     -I results/bam/s008---1.bam -I results/bam/s008---2.bam -I results/bam/s008---3.bam

This shows that there are 8 distinct samples, and each one of those samples is associated with between 1 and 4 BAM files of mapped reads.

Since these files are not large and running MarkDuplicates does not take as much time as mapping reads, we will simply cycle over these different 8 samples using the script in an interactive shell on a compute node. MarkDuplicates writes results out to a single BAM file, and also writes out a “metrics” file that gives information on how many duplicates were found. It also allows an optional --TAGGING_POLICY option to instruct the program to make a note of whether each duplicate is a PCR duplicate or an optical duplicate. We will invoke that option.

Additionally, MarkDuplicates can create an index for the BAM file it produces. This index is much like the one created by the samtools index subcommand; however, rather than creating an index file named file-prefix.bam.bai, it creates file-prefix.bai. So, we will also instruct MarkDuplicates to create a directory for the BAM file, as that will save us a step later on!

conda activate picard
mkdir -p results/mkdup-merged results/qc/mkdup-merged results/logs/mkdup-merged
for LINE in {1..8}; do
  eval $( $LINE numbered-sample-bams.tsv)
  echo "Marking duplicates for sample $sample"
  picard MarkDuplicates $bamopts \
      --OUTPUT $OUT \
      --TAGGING_POLICY All \
      --CREATE_INDEX > $LOG 2>&1

That should only take a few minutes.

Once that is done we will look at the resulting bamfiles with samtools, and compare them to the BAM files before duplicates were marked, just to see that some things have been marked as duplicates. Recall from MarkDuplicates’ help information that duplicates have the 0x400 bit set in their SAM flags. So, we can look at the sequences marked as duplicates with samtools view telling it, with the -f 0x400 option to only print sequences having that 0x400 bit set.

For example:

conda activate bwasam
samtools view -f 0x400 results/mkdup-merged/s001.bam | less

Note that the final column on each row has some information tagged with DT. This let’s us know whether the duplicate was a PCR duplicate (formed during library prep) or an optical duplicate (a cluster on the sequencing machine that was incorrectly split into two spots).

We can use awk to count up the number of different DT tags in the last column of each row having a duplicated sequence in it, like this:

# here we can count how many of each tag we see
samtools view -f 0x400 results/mkdup-merged/s001.bam | \
  awk '{n[$NF]++} END {for(i in n) print i, n[i]}'

The DT tag either has an LB if it is a PCR-duplicate (it is a consequence of the library, hence the LB) or it has an SQ if it is an optical duplicate (it is a consequence of the sequencing platform, hence SQ).

If you look at any of the original BAM files (before marking duplicates) you will find that none of them have any reads marked as duplicates (which is not surpring—that is why we ran them through MarkDuplicates.)

samtools view -f 0x400 results/bam/s001---1.bam | head

When reads are marked as duplicates, they will be automatically ignored during variant calling (which we start in the next step!) by GATK’s variant-calling programs. They can also be ignored in ANGSD and most other variant calling programs.

27.7.1 Run samtools stats on the results

Each sample is now in a single BAM file. For quality checking these, it can be a good idea to run samtools stats on them. That utility returns a fairly comprehensive summary of any BAM file.

This doesn’t take long to run, and we can easily do it with a for loop over filenames:

# make a directory for the results
mkdir -p results/qc/samtools-stats

conda activate bwasam  # to have samtools
for FILE in results/mkdup-merged/*.bam; do 
  j=$(basename $FILE); 
  echo $j; 
  samtools stats $FILE > results/qc/samtools-stats/${j}_stats  

When that is done, it is worth looking through all the different summaries. For example:

less results/qc/samtools-stats/s008.bam_stats

They give the distribution of a lot of different features of the mapped reads.

MultiQC has some capability to plot some of the features in the samtools stats summaries. You can see those by doing.

conda activate multiqc
multiqc results/qc/samtools-stats/*.bam_stats

and then copying the multiqc_report.html to your laptop and opening it in a browser. With as little data as we have in these files, it is not super interesting to look at them.

MultiQC (at least a suitable new version of it) can also process the “metrics” files from MarkDuplicates. So you can do:

multiqc -f results/qc/samtools-stats/*.bam_stats results/qc/mkdup-merged/*.metrics

For that to work, I had to get a newer version of MultiQC than I had, which meant I needed to do this first.

conda env remove --name multiqc
conda config --add channels defaults
conda config --add channels conda-forge
conda config --add channels bioconda
mamba create -n multiqc -c bioconda multiqc=1.12

Now we have 8 BAM files. Each one has all the reads from a single individual. Duplicates in them have been marked, and they each have an index. Therefore, it is time to proceed to the variant calling steps.

27.8 Step 4. Creating a gVCF file for each sample

Yay! We are finally going to start turning our alignments into variants. But don’t get too excited just yet, because the GATK gVCF workflow involves three separate steps to take us from our duplicate-marked BAM files to a VCF file with variants. Here we are doing the first step: creating one gVCF file for each of our samples.

Here is our DAG showing what steps we are talking about in this section:
Running HaplotypeCaller to create one gVCF file for each sample.

FIGURE 27.6: Running HaplotypeCaller to create one gVCF file for each sample.

This is the first step where we are starting to use the tools of the GATK package, so we will have to create a conda environment with GATK. That will be our first step. It is important at this time to get GATK version, which is the latest no-architecture version available on conda as of 3/29/22. The previous version, had some bugs that caused problems when genotyping gVCFs from the genomics data bases for some of my non-human data (details here). There might be some residual problems left in (see here), but it should work for our small example data set, today.

By now we know the drill:

# on scompile
mamba create -n gatk4.2.5.0 -c bioconda gatk4=

# when that is done, activate it
conda activate gatk4.2.5.0

Since this is our first time with GATK, it is worth getting a glimpse of all the tools available from the GATK package:

gatk --list

Holy Moly! That is a lot of different programs!

To get the syntax and documentation for any particular tool you can use gatk ToolName --help, so, for our case:

gatk HaplotypeCaller --help

Whoa! That spits out a lot of options. At this juncture, it is worth noting that the GATK tools have comprehensive documentation available on the web, too. Sometimes it is easier to read the docs there than on your Unix terminal. A Google search of GATK toolname, like “GATK HaplotypeCaller” will typically return the documentation as the first hit. Here is a link to the HaplotypeCaller documentation.

The documentation on the web also provides a nice example for how HaplotypeCaller should be called for this first step in the gVCF workflow. It looks like this:

Single-sample GVCF calling (outputs intermediate GVCF)

 gatk --java-options "-Xmx4g" HaplotypeCaller  \
   -R Homo_sapiens_assembly38.fasta \
   -I input.bam \
   -O output.g.vcf.gz \

So, we need as input:

  1. The reference genome the reads were mapped to
  2. The input BAM which must be sorted and preferably has been duplicate marked. It also needs to be indexed, i.e. there must be a .bai file associated with each.
  3. The name of the output file
  4. The option -ERC GVCF. This tells HaplotypeCaller to make a gVCF, rather than to attempt to got “direct-to-VCF.”

27.8.1 That’s a dict thing to do!

What isn’t mentioned in the documentation is that HaplotypeCaller also requires that the reference genome be indexed in two different ways. It must have a .fai file (for example created using samtools faidx) and it must have a .dict or “dictionary” file for the genome. This can also be create with samtools using samtools dict. These steps are visible in our workflow DAG. Let’s make sure that we both of those necessary indexes by doing this:

conda activate bwasam

samtools faidx resources/genome.fasta
samtools dict resources/genome.fasta > resources/genome.dict

# and when done, reactivate our gatk4 environment
conda activate gatk4.2.5.0

27.8.2 Setting up a SLURM job array for this

We are going to run these jobs using a SLURM job array. Each task will be given one CPU. Because of this, we also want to tell HaplotypeCaller to only use a single thread for its pair hidden Markov model (by default it uses 4 threads, which is not helpful if we have allocated only a single core for the job). So, the command line will include the line --native-pair-hmm-threads 1.

Additionally, since all the GATK tools are implemented in Java, they run under the Java virtual machine on our computers. We can specify how much memory (RAM) is allocated to the Java virtual machine in order to host GATK. We have about 4848 Mb of RAM per core on SUMMIT, so we will give 4000 Mb, or 4Gb of RAM to the Java virtual machine. This should be more than enough. But it is worth noting that sometimes you can run out of memory, in which case you would have to re-run the job and give it more memory. To provide a certain amount of memory to the Java virtual machine, or, in fact, to supply any particular options to Java, we will use the --java-opts option to gatk. This goes before the tool name, like this: gatk --java-opts "quoted string of options to pass to Java" ToolName. In our case, that quoted string of options to pass to Java will look like: "-Xmx4g" which basically says “give me 4 Gb of memory.”

We have 8 jobs to run (one for each of our samples). We first make a simple tab-separated file called numbered-samples.tsv that we can read with our script. Here is some Unix fun to make such a file from our units.tsv file:

cut -f 1 units.tsv | uniq | \
awk '
    NR==1 {printf("index\tsample\n"); next} 
    {printf("%d\t%s\n", ++n, $1);}
' > numbered-samples.tsv

Then we need to have a script that we can pass to sbatch. The following should be copied into scripts/


#SBATCH --array=1-8
#SBATCH --time=02:00:00
#SBATCH --output=results/slurm_logs/gvcf/make-gvcf-output-%A-%a
#SBATCH --error=results/slurm_logs/gvcf/make-gvcf-error-%A-%a 
#SBATCH --mail-type=FAIL

source ~/.bashrc
conda activate gatk4.2.5.0

# this sets the shell variable "sample" to s001, s002, etc.
eval $( $SLURM_ARRAY_TASK_ID numbered-samples.tsv) 

outdir=results/gvcf                      # a directory in which to put the output gVFC file
logdir=results/logs/haplotype-caller     # a directory in which to put the output from stderr

# make sure those directories exist
mkdir -p $outdir  
mkdir -p $logdir  

gatk --java-options "-Xmx4g" HaplotypeCaller \
  -R resources/genome.fasta \
  -I results/mkdup-merged/$sample.bam \
  -O $outdir/$sample.g.vcf.gz \
  --native-pair-hmm-threads 1 \
  -ERC GVCF > $logdir/$sample.stdout  2> $logdir/$sample.stderr

27.8.3 Running interactively to watch the output

If you haven’t run any GATK tools before, it is worth doing a short run to watch the progress meter log go by (this is what gets written to stderr when you run GATK tools).

So, get an interactive shell, then spoof the SLURM_ARRAY_TASK_ID (i.e. just make a variable named SLURM_ARRAY_TASK_ID and set it to 1), and run one iteration as follows:

conda activate gatk4.2.5.0


# this sets the shell variable "sample" to s001, s002, etc.
eval $( $SLURM_ARRAY_TASK_ID numbered-samples.tsv) 

outdir=results/gvcf                      # a directory in which to put the output gVFC file
logdir=results/logs/haplotype-caller     # a directory in which to put the output from stderr

# make sure those directories exist
mkdir -p $outdir  
mkdir -p $logdir  

gatk --java-options "-Xmx4g" HaplotypeCaller \
  -R resources/genome.fasta \
  -I results/mkdup-merged/$sample.bam \
  -O $outdir/$sample.g.vcf.gz \
  --native-pair-hmm-threads 1 \

Note that we are not redirecting stdout or stderr to any files, so it just come to our screen.

The “ProgressMeter” that comes flying out of GATK tools can be quite helpful for estimating how long a particular run on a particular data set might require.

When you have seen enough, kill that job with cntrl-c and then we will launch our jobs for real.

27.8.4 Launch the first 3 jobs of our array

As always, it is wise to launch just the first few jobs of an array to make sure things are set up correctly.

Before we start this, we must create the directories that we are redirecting the SLURM logs to:

mkdir -p results/slurm_logs/gvcf

Then, start the first three:

sbatch --array=1-3 scripts/

Once you are convinced that those are doing just fine, you can start up the remaining 5:

sbatch --array=4-8 scripts/

While those are running would be a good time to go back and review what a gVCF file is all about (Section 20.2.2).

27.9 Step 5. Importing the gVCF files into a GenomicsDatabase

Section 20.2.3 provides some background on the genomics data base used by GATK, and it also outlined all of the options to the GATK program GenomicsDBImport. Here we will launch a series of jobs in a SLURM array, each one requesting 2 CPUs (so we can use 2 reader threads, effectively) to import all of our gVCF files made in the last step into a genomics data base.

This step corresponds to the colored nodes in our now-familiar workflow DAG:
Importing gVCF files into the genomics data bases.

FIGURE 27.7: Importing gVCF files into the genomics data bases.

As mentioned in Section 20.2.3, these jobs parallelize over chromosomes; however, for non-model organisms with more fragmented reference genomes, for the unplaced scaffolds we will be parallelizing over the scaffold groups. These are just groups of short scaffolds that have been lumped together so that they can be merged together and treated by GenomicsDBImport as a single “pseudo-chromosome.”

The length of all the chromosomes and the unplaced scaffolds in the reference genome are easily found in the .fai file created by samtools faidx. We have one of those in resources/genome.fasta.fai. It is fairly straightforward to operate on this file with R to create a file of chromsomes and a file of scaffold groups. In fact, a simple R script that does that can be viewed here. Rather than go through that process, we are just going to download the resulting files chromosomes.tsv and scaffold_groups.tsv, in order to use them. To download these to our example workflow directory, you can give these two simple commands (from that example workflow directory!):


Look at those files to see how they are structured.

The documentation for GenomicsDBImport can be gotten using gatk GenomicsDBImport --help, or it can be found on the web.

27.9.1 Job array script for the chromosomes

First, we need to turn chromosomes.tsv into something that we can use with So let’s do that:

awk 'NR == 1 {printf("index\tchrom\n"); next} {printf("%d\t%s\n", ++n, $1);}' chromosomes.tsv  > numbered-chromosomes.tsv

That gives us numbered-chromosomes.tsv, that looks like this:

index   chrom
1   CM031199.1
2   CM031200.1
3   CM031201.1
4   CM031202.1

First we prepare a job array script for the chromosomes that uses numbered-chromosomes.tsv. We can just copy the one for making gVCFs, and then modify it as appropriate. After reading and understanding the following chunk of code, use a text editor like nano to paste it into the file scripts/


#SBATCH --array=1-4
#SBATCH --time=02:00:00
#SBATCH --cpus-per-task=2
#SBATCH --output=results/slurm_logs/gdb_import/output-%A-%a
#SBATCH --error=results/slurm_logs/gdb_import/error-%A-%a 
#SBATCH --mail-type=FAIL

source ~/.bashrc
conda activate gatk4.2.5.0

# this sets the shell variable "chrom" to the chromosome names etc.
eval $( $SLURM_ARRAY_TASK_ID numbered-chromosomes.tsv) 

outdir=results/genomics_db/$chrom                      # a directory in which the genomics DB will go
logdir=results/logs/genomics_db_import                   # a directory in which to put the output from stderr and stdout

# make sure those directories exist
mkdir -p $(dirname $outdir)  # make the directory for this output directory to go into
mkdir -p $logdir  

# here are the standard options that you will typically use
standard_opts=" --batch-size 50 --reader-threads 2 --genomicsdb-shared-posixfs-optimizations "

gatk --java-options "-Xmx4g" GenomicsDBImport \
      $standard_opts \
      $(awk 'NR>1 {printf(" -V results/gvcf/%s.g.vcf.gz ", $2);}' numbered-samples.tsv) \
      --intervals $chrom \
      --genomicsdb-workspace-path $outdir > $logdir/stdout_$chrom.txt  2> $logdir/stderr_$chrom.txt

Note that call to awk. That puts all the samples on the command line with a -V before each one.

Now! Before we launch that, we must create the directories for the slurm_logs. Otherwise, SLURM will kill the jobs immediately upon submitting them without any sort of reasonable error message indicating why it failed.

mkdir -p results/slurm_logs/gdb_import

# normally we would test a few, but since there are only
# four jobs in this array, let's launch them all...
sbatch  scripts/

At this juncture, it is worth mentioning that the output of GenomicsDBImport goes into a directory (named $outdir in the above script). If that directory already exists, and you call GenomicsDBImport with the --genomicsdb-workspace-path option, then you will get an error. The tool is careful not to overwrite an existing data base. So, if you tried a run that failed after creating the output directory, you will need to remove that before you try to relaunch the job.

27.9.2 Job array script for the scaffold groups

Recall that GenomicsDBImport can use the --merge-contigs-into-num-partitions 1 “trick” to put many short scaffolds into a single genomics data base. We still have to do that. Making a script for that involves two steps that were different than what we did with the chromosomes, above:

  1. We need to specify all the scaffolds in each scaffold group
  2. We need to use the --merge-contigs-into-num-partitions 1 option

Note that we have a lot of scaffolds in some scaffold groups, so that if we tried to specify them all on the command line, we would end up with a very long command line. Although Unix, today, is reasonably tolerant of long and convoluted command lines, some flavors of Unix will hiccup or barf on very long command lines. Accordingly, we will use a GATK feature that lets us specify genomic regions in a file—so long as it is named with the extension .list—rather than on the command line for this job.

First, we need to make a file numbered-scaff-groups.tsv that will translate a SLURM_ARRAY_TASK_ID into the name of a scaffold group. We can do that with a little Unix and awk:

awk 'NR>1 {print $1}' scaffold_groups.tsv | uniq | awk 'BEGIN {printf("index\tsg\n")} {printf("%d\t%s\n", ++n, $1)}' > numbered-scaff-groups.tsv

If you look at the file created there, you will see that it looks like this:

index   sg
1   scaff_group_001
2   scaff_group_002

so, we can use it to turn the SLURM_ARRAY_TASK_ID into a shell variable sg that holds the name of the scaffold group we are dealing with.

The following code block contains a shell script that you should paste into a new file called scripts/


#SBATCH --array=1-2
#SBATCH --time=02:00:00
#SBATCH --cpus-per-task=2
#SBATCH --output=results/slurm_logs/gdb_import_scaff_groups/output-%A-%a
#SBATCH --error=results/slurm_logs/gdb_import_scaff_groups/error-%A-%a 
#SBATCH --mail-type=FAIL

source ~/.bashrc
conda activate gatk4.2.5.0

# this sets the shell variable "sg" to the scaffold group name
eval $( $SLURM_ARRAY_TASK_ID numbered-scaff-groups.tsv) 

outdir=results/genomics_db/$sg                # a directory in which the genomics DB will go
logdir=results/logs/genomics_db_import        # a directory in which to put the output from stderr and stdout

# make sure those directories exist
mkdir -p $(dirname $outdir)  # make the directory for this output directory to go into
mkdir -p $logdir  

# here are the standard options that you will typically use
standard_opts=" --batch-size 50 --reader-threads 2 --genomicsdb-shared-posixfs-optimizations "

# now, some new stuff for the scaffold group lists.  We are going to make a file
# called results/sg_lists/$sg.list that contains the names of all the scaffolds in this
# scaffold group $sg.
mkdir -p results/sg_lists # make sure that this directory exists
sglist=results/sg_lists/$sg.list  # file name for the scaffold names
awk -v sg=$sg 'NR >1 && $1==sg {print $2}' scaffold_groups.tsv > $sglist 

# now, we call this with the name of the file having 
gatk --java-options "-Xmx4g" GenomicsDBImport \
      $standard_opts \
      $(awk 'NR>1 {printf(" -V results/gvcf/%s.g.vcf.gz ", $2);}' numbered-samples.tsv) \
      --intervals $sglist \
      --merge-contigs-into-num-partitions 1 \
      --genomicsdb-workspace-path $outdir > $logdir/stdout_$sg.txt  2> $logdir/stderr_$sg.txt

Then, since there are only two scaffold groups, we can run that as an array from 1 to 2. Of course, as always, be sure to create the slurm log output directories before you launch the job:

mkdir -p results/slurm_logs/gdb_import_scaff_groups
sbatch --array=1-2 scripts/

27.9.3 Hey! Let’s try updating the genomic data base

Recall from Section 20.2.4, that it is possible to add new individuals to a genomics data base. It is worth giving that a whirl. What we shall do here is create the data base with the first six samples, and then update it by adding the last two sample to the existing data bases.

Before we do this, we might note that it is a bit of a hassle that we are starting to see such a proliferation of different “array” scripts in this section. We already have one script for chromosomes and one script for scaffold groups. Geez! If we now also have to make a separate script for updating chromosome and scaffold group data bases, that is going to be four separate scripts. Not only that, but the names of the files of numbered chromsomes and numbered scaffold groups that they use are hard-wired into the scripts, and if we are going to be doing some building of data bases and then, later, updating them, we might want to be able to tell the script which numbered-XX file to use.

So, here we will write a single, general script that can deal with these different use cases by taking options and positional parameters. It will make the script a little longer, but it will be nice to have it all in one place. This is a general theme in computer programming: if you can reduce the number of places (i.e., files or scripts) the same block of code appears (by using variables and options) then it is typically a lot easier to maintain the code base.

So, here is our big general script. You should paste this into: scripts/


#SBATCH --array=1-2
#SBATCH --time=02:00:00
#SBATCH --cpus-per-task=2
#SBATCH --output=results/slurm_logs/gdb_import_general/output-%A-%a
#SBATCH --error=results/slurm_logs/gdb_import_general/error-%A-%a
#SBATCH --mail-type=FAIL

### Define default values for variables here.  By default we are going to
### Be doing "by chromosome", but the options we give can update these
DOING_SCAFF=0 # default is to do chromosomes
UPDATING=0    # default is to create a new data base
logdir=results/logs/genomics_db_import        # a directory in which to put the output from stderr and stdout

### Here is a lot of stuff for usage "help" messages and options processing
function usage {
      echo Syntax:
      echo "  $(basename $0) [-h -u -s ScaffGroupFile -o OutputDirectory -l LogDirectory] NumberedSeqsFile NumberedSamplesFile "
      echo "    -h:  print help and exit."
      echo "    -l:  put the resulting logs (from stderr and stdout) in directory LogDirectory."
      echo "         Default is results/logs/genomics_db_import"
      echo "    -o:  put the genomicsDB within OutputDirectory (and make that dir if needed)."
      echo "         Default is results/genomics_db"
      echo "    -s:  process scaffold groups given in ScaffGroupFile."
      echo "    -u:  update the genomics data base instead of creating it."
      echo "   ScaffGroupFile is like scaffold_groups.tsv"
      echo "   NumberedSeqsFile is either like numbered-chromosomes.tsv or numbered-scaff-groups.tsv"
      echo "   NumberedSamplesFile is like numbered-samples.tsv"

if [ $# -eq 0 ]; then
    exit 1;

while getopts "hl:o:s:u" opt; do
    case $opt in
    h    )
        exit 1
    l    )
    o    )
    s    )
    u    )


shift $((OPTIND-1));

# uncomment to test for right number of required args
if [ $# -ne 2 ]; then
    exit 1;


#### Done processing options and getting down to work

source ~/.bashrc
conda activate gatk4.2.5.0

# make sure the directories exist
mkdir -p $OUTPARENT  # make the directory for this output directory to go into
mkdir -p $logdir

# this sets the shell variable "sg" to the scaffold group name
# or the variable "chrom" to the chromosome name

### Now we process things differently if we are doing scaff_groups or chromosomes
if [ $DOING_SCAFF = 0 ]; then
    INTERVALS=" --intervals $chrom "
    mkdir -p results/sg_lists # make sure that this directory exists
    sglist=results/sg_lists/$sg.list  # file name for the scaffold names
    awk -v sg=$sg 'NR >1 && $1==sg {print $2}' $SCAFF_GROUP_FILE > $sglist
    INTERVALS=" --intervals $sglist "
    EXTRA=" --merge-contigs-into-num-partitions 1 "

### Now, change options if updating vs creating a data base
if [ $UPDATING = 0 ]; then
    DB_OPT=" --genomicsdb-workspace-path "
    DB_OPT=" --genomicsdb-update-workspace-path "

# here are the standard options that you will typically use
standard_opts=" --batch-size 50 --reader-threads 2 --genomicsdb-shared-posixfs-optimizations "

# here is where we make the call
# now, we call this with the name of the file having
gatk --java-options "-Xmx4g" GenomicsDBImport \
  $standard_opts \
  $(awk 'NR>1 {printf(" -V results/gvcf/%s.g.vcf.gz ", $2);}' $NUMBERED_SAMPLES_FILE) $INTERVALS $EXTRA \
  $DB_OPT $outdir > $STDOUT  2> $STDERR

Once we have that script, we are going to use it to create genomics data bases from the the first six samples for chromosomes and scaffold groups, and then we are going to update each of those data bases by adding the next two individuals into the data bases.

We start by making some numbered-samples type of files.

awk 'BEGIN {OFS="\t"} NR==1 {print; next} $1 <= 6 {print}' numbered-samples.tsv > numbered-first-six-samples.tsv
awk 'BEGIN {OFS="\t"} NR==1 {print; next} $1 > 6 {printf("%d\t%s\n", ++n, $2);}' numbered-samples.tsv > numbered-last-two-samples.tsv

Then we start our jobs to create the data bases for the four chromosomes and for the two scaffold groups with the first six samples. We will put these into the directory results/stepwise_genomics_db

# make sure the slurm logs directory exists!!
mkdir -p results/slurm_logs/gdb_import_general

# create 4 chromosome databases with the first six samples
sbatch --array=1-4 scripts/  \
  -l results/logs/stepwise_gdb_import/first-six  \
  -o results/stepwise_genomics_db \
  numbered-chromosomes.tsv \

# create 2 scaffold-group databases with the first six samples
sbatch --array=1-2 scripts/  \
  -l results/logs/stepwise_gdb_import/first-six  \
  -o results/stepwise_genomics_db \
  -s scaffold_groups.tsv \
  numbered-scaff-groups.tsv \

We have to wait until each of those runs has completed. Once they do, check that all the genomics data bases have been created, by, for example:

# making sure that all the log files are there:
ls  results/logs/stepwise_gdb_import/first-six/

# looking at the last ten lines of each of the log files
tail results/logs/stepwise_gdb_import/first-six/stderr*

# listing the directories of the data bases:
ls results/stepwise_genomics_db

# listing all the contents of those data bases
ls -R results/stepwise_genomics_db

If that all looks good, then it is time to update each of those data bases by adding the last two individuals to them. To do so, we give our script the -u option, give it the path to the existing data base directory with the -o option. And we also change the log file directory (with the -l option) so that the new logs go to a new place.

# update 4 chromosome databases by adding the last two samples to them
sbatch --array=1-4 scripts/ \
  -l results/logs/stepwise_gdb_import/last-two \
  -o results/stepwise_genomics_db \
  -u \
  numbered-chromosomes.tsv numbered-last-two-samples.tsv

# update 2 scaffold-group databases by adding the last two samples to them
sbatch --array=1-2 scripts/ \
  -l results/logs/stepwise_gdb_import/last-two \
  -o results/stepwise_genomics_db \
  -s scaffold_groups.tsv \
  -u \
  numbered-scaff-groups.tsv numbered-last-two-samples.tsv

When those are done, you can check the log files to make sure that they look like they ran all right:

tail results/logs/stepwise_gdb_import/last-two/stderr_*

If they say something like:

08:31:11.156 INFO  GenomicsDBImport - Importing batch 1 with 2 samples
08:31:19.147 INFO  GenomicsDBImport - Done importing batch 1/1
08:31:19.155 INFO  GenomicsDBImport - Import of all batches to GenomicsDB completed!
08:31:19.157 INFO  GenomicsDBImport - Shutting down engine
[April 7, 2022 at 8:31:19 AM MDT] done. Elapsed time: 0.15 minutes.

then it has all worked according to plan!

27.10 Step 6. Calling variants and creating VCF files from the GenomicsDatabase

The final step to getting a VCF file with variants for all your samples is to run GenotypeGVCFs. This GATK tool can take a variety of things as input, but, for our purposes, the most useful is that it can take a genomics data base as input.

Recall that we now have six genomics data bases: one for each of the four chromosomes in our reference genome file, and two for each of the scaffold groups. Each of these scaffold groups includes a large number of unplaced scaffolds that exist in our reference genome.

Each of those data bases knows exactly which chromosomes or scaffolds are included in it—and also it has information about all the samples. In other words, it has all the information it needs to make a VCF file for the chromosome or scaffold group that it encompasses. Accordingly, it is pretty easy to run the GenotypeGVCFs tool on a genomics data base.

This step corresponds to the colored node in our now-familiar workflow DAG:
Joint genotyping: making a VCF file from a genomics database/

FIGURE 27.8: Joint genotyping: making a VCF file from a genomics database/

The webpage describing how this tools works can be found here.

The basic command, for the chromosome CM031202.1 (the shortest one!) in our example workflow looks like:

gatk --java-options "-Xmx4g" GenotypeGVCFs \
   -R resources/genome.fasta \
   -V gendb://results/genomics_db/CM031202.1 \
   -O results/vcf_parts/CM031202.1.vcf

There are some other options, but none that we will explore today. The main thing to note in the command line is that the path to the genomics data base is preceded by gendb:// which is the way you would specify the location of a data base as a URL.

Run the above command on a compute shell to see what happens. It is set up to use the genomics data bases that we created in results/genomics_db.

So, all we need to do now is make some simple array scripts. We could make a general purpose one that can be used with both the scaffold groups or the chromosomes, or just write two separate scripts. We will do the latter here.

27.10.1 Array script for the four chromosomes

The contents of our first script is below and should be copied into scripts/


#SBATCH --array=1-4
#SBATCH --time=01:00:00
#SBATCH --output=results/slurm_logs/gdb2vcf/output-%A-%a
#SBATCH --error=results/slurm_logs/gdb2vcf/error-%A-%a 
#SBATCH --mail-type=FAIL

source ~/.bashrc
conda activate gatk4.2.5.0

# this sets the shell variable "chrom" to the chromosome name
eval $( $SLURM_ARRAY_TASK_ID numbered-chromosomes.tsv) 

outdir=results/vcf_parts                      # a directory in which to put the output gVFC file
logdir=results/logs/genotype-gvcfs     # a directory in which to put the output from stderr

# make sure those directories exist
mkdir -p $outdir  
mkdir -p $logdir  

gatk --java-options "-Xmx4g" GenotypeGVCFs \
   -R resources/genome.fasta \
   -V gendb://results/genomics_db/$chrom \
   -O $outdir/$chrom.vcf.gz > $logdir/stdout.$chrom 2> $logdir/stderr.$chrom 

Note that when you name the output file with a .gz extension, GATK will automatically gzip the output.

Launch that job array.

# create slurm logs directory
mkdir -p results/slurm_logs/gdb2vcf

# try just the first job
sbatch --array=1 scripts/

# if it looks like that is working (check the stderr log output)
# then launch the remaining ones
sbatch --array=2-4 scripts/

27.10.2 Array script for the two scaffold groups

The contents of our scaffold group script is below and should be copied into scripts/


#SBATCH --array=1-2
#SBATCH --time=01:00:00
#SBATCH --output=results/slurm_logs/gdb2vcf/output-%A-%a
#SBATCH --error=results/slurm_logs/gdb2vcf/error-%A-%a 
#SBATCH --mail-type=FAIL

source ~/.bashrc
conda activate gatk4.2.5.0

# this sets the shell variable "sg" to the chromosome name
eval $( $SLURM_ARRAY_TASK_ID numbered-scaff-groups.tsv) 

outdir=results/vcf_parts                      # a directory in which to put the output gVFC file
logdir=results/logs/genotype-gvcfs     # a directory in which to put the output from stderr

# make sure those directories exist
mkdir -p $outdir  
mkdir -p $logdir  

gatk --java-options "-Xmx4g" GenotypeGVCFs \
   -R resources/genome.fasta \
   -V gendb://results/genomics_db/$sg \
   -O $outdir/$sg.vcf.gz > $logdir/stdout.$sg 2> $logdir/stderr.$sg 

Launch that job array.

# create slurm logs directory (we already did, above, but
# it is always good to think about this before launching a slurm job)
mkdir -p results/slurm_logs/gdb2vcf

# try just the first job
sbatch --array=1 scripts/

# if it looks like that is working (check the stderr log output)
# then launch the remaining ones
sbatch --array=2 scripts/

27.10.3 Have a look at the products

You can take a look at the resulting VCF files like this:

zcat results/vcf_parts/CM031199.1.vcf.gz | less

Have a look at them and see if it seems everything is in order!

27.10.4 Homework

Modify the two array scripts used above so that they use the genomics data bases that we put in results/stepwise_genomics_db while testing the sequential updating of the data bases. Put the output files into results/stepwise_vcf_parts. See if you get the same variants out of it (you should!).

27.10.5 Catenate all the vcf.gz files into a single VCF file and index it

After running the above steps, we have six different vcf.gz files, four for each of four chromosomes, and two for each of two scaffold groups. Sometimes it is nice to keep the VCF files for different chromosomes separate; however because indexing a VCF file allows for rapid, indexed access to any place in the genome, there isn’t much reason to keep all the pieces separate unless you are going to be parsing each file with non-standard tools (i.e., reading each chromosome into R, which, must say, is really NOT recommended—there are specialized tools for handling VCF files and you should become comfortable with them).

In order to catenate these 7 vcf.gz files into a single one, and to index it, we will use bcftools. We will get the latest version of bcftools available on conda at time time this was written, version 1.15.1, and put it in its own conda environment named bcftools.

mamba create -n bcftools -c conda-forge -c bioconda bcftools=1.15.1

(Note the use of -c conda-forge in the above. This is present because some of the dependencies for bcftools are not found on the bioconda channel. Rather they are on the conda-forge channel. If you conda/mamba environment is not set up to search conda-forge by default, then the -c conda-forge is required to get all the dependencies.)

When that has finished, activate the environment and then type bcftools to get a list of all the different possible bcftools subcommands:

conda activate bcftools

Right now we will be using bcftools concat to catenate the files, and then we will use bcftools index to index the result. We will put the result into results/vcf.

# first, do this to get the help information for bcftools concat
bcftools concat

Note that if we order our chromosome names and our scaffold groups alphabetically they are in the correct (genomic) order, so it is pretty easy to specify the in that order on the command line, like this:

mkdir -p results/vcf
bcftools concat -O z results/vcf_parts/*.vcf.gz > results/vcf/all.vcf.gz

Note the -O z option which makes the program produce compressed VCF format (hence we gave it the .vcf.gz extension).

There is another type of format common for VCF files, and that is “BCF” format. We will also make a compressed BCF version (using the -O b option). You should be comfortable working with either format. BCF is simply a binary format of the same information. It is not directly human readable, but that is not a problem, since most of the time you will look at these files using bcftools view (which is to VCF/BCF files what samtools view is to SAM/BAM files).

bcftools concat -O b results/vcf_parts/*.vcf.gz > results/vcf/all.bcf

Now, we are going to use these files to learn all about bcftools in Section 22.

27.11 Step 7. Filtering variants

27.12 Step 8. Base Quality Score Recalibration