Chapter 22 Basic Handling of VCF files
As we saw in the section on bioinformatic formats, VCF files can be large and unwieldy. The format specification is also such that fields might have different numbers of subfields, depending, for example, on the number of alleles found at a variant. Both of these features make it hard to directly read a VCF file into a, say, R, or some other program that may wish to treat it purely as tabular data.
This is not to say that you couldn’t just read a VCF file into R directly as a TAB delimited text file, and then start splitting fields up on it. However, there are specialized tools for doing operations of VCF files, and becoming familiar with them can relieve a lot of the pain of dealing with VCF files.
To have an example indexed-and-bgzipped VCF file to play with, you can
download them from your Unix workstation with the following commands.
I’d recommend putting them into a directory called bcftools-play
in
scratch
somewhere.
wget https://www.dropbox.com/s/dzwixabvy3wwfdr/chinook-32-3Mb.vcf.gz?dl=1
wget https://www.dropbox.com/s/d303fi7en71p8ug/chinook-32-3Mb.vcf.gz.csi?dl=1
# after which you might need to rename them, if wget retained the `?dl=1`
# in the resulting file names:
mv chinook-32-3Mb.vcf.gz\?dl\=1 chinook-32-3Mb.vcf.gz
mv chinook-32-3Mb.vcf.gz.csi\?dl\=1 chinook-32-3Mb.vcf.gz.csi
# after that, make sure you are ready to use bcftools by doing
conda activate bioinf
There are two main programs for handling VCF files: vcftools
and bcftools
. Both
of these grew out of the 1000 Genomes effort starting about a decade ago. It seems
that vcftools
may have been developed first, but, currently, bcftools
is being
more actively developed, with new versions and new features being added to it
regularly. vcftools
provides some very specific commands for particular analyses
or operations, some of which are not available from bcftools
. On the other hand,
bcftools
provides a more general interface for operations on VCF files. By this
interface, a great number of the operations done in vcftools
are available, but
a little extra knowledge is required to implement them. That said, the range of
possible operations seems much larger in bcftools
than in vcftools
.
22.1 bcftools
Here, we just want to give everyone the chance to run through a few
operations with bcftools
, to start to get familiar with its interface.
The first thing to note is that, like samtools
(which is maintained by
the same group of people), bcftools
possesses a number of different
subcommands. So, the syntax is always like:
bcftools
subcommand options file(s)
Also like samtools
, bcftools
will take input from stdin rather than
from a file—you just pass it -
instead of a file name.
The full documentation/manual for bcftools
is maintained at:
http://samtools.github.io/bcftools/bcftools.html. It is well worth reading through this whole documentation, though it
can be quite terse and intimidating. A friendlier “tutorial” introduction to the
software can be found at https://samtools.github.io/bcftools/howtos/index.html.
Here we are going to get our feet with with a few operations.
22.1.1 Tell me about my VCF file!
VCF files are a little daunting. Especially when they are gzipped they
can seem particularly opaque—learning anything about them in the traditional
manner of uncompressing them and then searching for lines within them
or counting up the number of records can be time consuming. Here are some
bcftools
solutions to a few different questions you might have.
Who is in this file? You can always try to find the last header line
in a VCF file using grep or awk and parse the individuals out yourself, but
it turns out to be faster and safer to use the query
subcommand from bcftools
with the -l
option. Do it here:
bcftools query -l chinook-32-3Mb.vcf.gz
Then read about it on the manual page. Find the part that talks describes it.
How many variants are in this file? This question can be answered quickly with
bcftools stats
, which also returns to you a plethora of information about the variants.
bcftools stats chinook-32-3Mb.vcf.gz
The top part of the output tells you how many SNPs and indels (and other types of variants) there are. Then it tells you about Ts/Tv ratios, allele frequencies and read depths.
Where are these variants? There are several ways to answer this question. One might be simply to print the CHROM and the POS for each row in the VCF file:
bcftools query -f '%CHROM\t%POS\n' chinook-32-3Mb.vcf.gz
If you want to see where it starts and where it finishes you can do:
bcftools query -f '%CHROM\t%POS\n' chinook-32-3Mb.vcf.gz | head
bcftools query -f '%CHROM\t%POS\n' chinook-32-3Mb.vcf.gz | tail
From that we see that our example file runs from 4 Mb to 7 Mb or so
on chromosome NC_037124.1. This shows one use of the subcommand query
,
which is quite useful.
Give me a glimpse of the file You can use bcftools view
for a number
of things, but at its simplest, you can merely look at the file in VCF format.
(In this manner, it behaves much like samtools view
for VCF files).
# show the whole file from the top (use show just the first 10 lines)
bcftools view chinook-32-3Mb.vcf.gz | head
# show just the header with -h. Here look at just the last 10 lines of the header
bcftools view -h chinook-32-3Mb.vcf.gz | tail
# show the variants themslves (no header) with -H
bcftools view -H chinook-32-3Mb.vcf.gz | head
When we did that, we see that there is a lot of information in the INFO field. What if we wanted to extract that?
22.1.2 Get fragments/parts of my VCF file
Extract keyed values from the INFO field This is another job for bcftools query
. You pass a format string to the -f
option that tells the program which fields you want
to extract and how you want to format it. Check out some examples:
# extract CHROM POS and BaseQRankSum, separated by TABs
bcftools query -f '%CHROM\t%POS\t%INFO/BaseQRankSum\n' chinook-32-3Mb.vcf.gz | head
# extract CHROM POS and total read depth DP
bcftools query -f '%CHROM\t%POS\t%INFO/DP\n' chinook-32-3Mb.vcf.gz | head
View data from specified regions
What if we want to look at variants only in two 10 Kb regions, like
NC_037124.1:5000000-5010000
and NC_037124.1:6000000-6010000
? Pass those,
separated by commas, to the -r
option (which is an option that applies to many
of the subcommands):
bcftools view -H -r NC_037124.1:5000000-5010000,NC_037124.1:6000000-6010000 chinook-32-3Mb.vcf.gz
You can also specify those regions in a file with -R
option.
View data from specified individuals
You can give the sample names (comma separated) to the -s
option:
bcftools view -H -s DPCh_plate1_G05_S77,DPCh_plate1_G06_S78 chinook-32-3Mb.vcf.gz | head
Or, if you wanted to view all but those two individuals, precede them with a ^
:
bcftools view -H -s ^DPCh_plate1_G05_S77,DPCh_plate1_G06_S78 chinook-32-3Mb.vcf.gz | head
You can also supply a text file with sample names (one-per-line) to the capital letter -S
option.
You can combine options, like -r
and -s
, as well.
22.1.3 Combine VCF files in various ways
Catenate VCF files
If you have VCF files called from the same reference genome filled with
the same samples, it is easy to catenate the together with bcftools concat
:
# make two files from different regions
bcftools view -Oz -r NC_037124.1:4000000-5000000 chinook-32-3Mb.vcf.gz > A.vcf.gz
bcftools view -Oz -r NC_037124.1:6000000-7000000 chinook-32-3Mb.vcf.gz > B.vcf.gz
# how many variants in each of those?
bcftools stats A.vcf.gz | awk '/^SN/'
bcftools stats B.vcf.gz | awk '/^SN/'
# catenate the back together
bcftools concat -Oz A.vcf.gz B.vcf.gz > CAT.vcf.gz
# how many variants in that?
bcftools stats CAT.vcf.gz | awk '/^SN/'
Note the use of the -O
(capital “o”) option to specify the output
type: v
= VCF, b
= BCF, u
= uncompressed BCF, z
= bgzipped VCF.
Merge VCF files
If you have files with different samples in them you can easily combine them:
# make file with first three samples
bcftools view -Oz -s DPCh_plate1_A05_S5,DPCh_plate1_A06_S6,DPCh_plate1_A11_S11 chinook-32-3Mb.vcf.gz > first3.vcf.gz
# make another with the last three samples
bcftools view -Oz -s DPCh_plate1_H06_S90,DPCh_plate1_H11_S95,DPCh_plate1_H12_S96 chinook-32-3Mb.vcf.gz > last3.vcf.gz
# merging requires that the files be indexed
bcftools index first3.vcf.gz
bcftools index last3.vcf.gz
# merge those into a file with 6 samples
bcftools merge -Oz first3.vcf.gz last3.vcf.gz > 6-samples.vcf.gz
22.1.4 Filter out variants for a variety of reasons
There are a lot of ways to filter out variants. bcftools
leaves things
very general here, and so just about anything is possible. Some simple ones:
Just the biallelic SNPs please Get things with no more than 2 alleles and no fewer than two alleles, and of a type = SNP:
# do it and see the result all in one line:
bcftools view -Ou -m 2 -M 2 --types=snps chinook-32-3Mb.vcf.gz | bcftools stats - | awk '/^SN/'
Just the biallelic indels please
# do it and see the result all in one line:
bcftools view -Ou -m 2 -M 2 --types=indels chinook-32-3Mb.vcf.gz | bcftools stats - | awk '/^SN/'
Note the use of -Ou
in order to pipe uncompressed BCF output directly into
bcftools stats
using the -
for a filename.
Fraction of missing sites less than X
If you want to make sure that 60% of your individuals have at least one read at the genotype, you can do this:
bcftools view -i 'F_MISSING < 0.4' chinook-32-3Mb.vcf.gz | bcftools stats - | awk '/^SN/'
Play with setting the F_MISSING
to different values and see how that affects
the number of variants retained.
This has just scratched the surface of what you can do with bcftools.