Chapter 22 Basic Handling of VCF files
As we saw in the section on bioinformatic formats, VCF files can be large and unwieldy. The format specification is also such that fields might have different numbers of subfields, depending, for example, on the number of alleles found at a variant. Both of these features make it hard to directly read a VCF file into a, say, R, or some other program that may wish to treat it purely as tabular data.
This is not to say that you couldn’t just read a VCF file into R directly as a TAB delimited text file, and then start splitting fields up on it. However, there are specialized tools for doing operations of VCF files, and becoming familiar with them can relieve a lot of the pain of dealing with VCF files.
To have an example VCF file to play with, you can
download one to your Unix workstation with the following commands.
You should put these files in your scratch
directory somewhere, perhaps
creating a directory called bcftools-play
to put them into. All the following
commands assume that the two example files are in the current working directory.
First we download the vcf.gz
version of the VCF file, called all.vcf.gz
. This
is the unfiltered VCF file created by our example WGS workflow in Section 27:
wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1zgFyYbfWU85O4JzmOX7-MOKQs4oXv4fm' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1zgFyYbfWU85O4JzmOX7-MOKQs4oXv4fm" -O all.vcf.gz && rm -rf /tmp/cookies.txt
After that, we also download the BCF version of the same file, just so that everyone
gets familiar with the fact that BCF files can be treated equivalently
to VCF files with bcftools
.
wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1iUP_UstnmuLSGuBIfb0EYXs1Gw1z-z0o' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1iUP_UstnmuLSGuBIfb0EYXs1Gw1z-z0o" -O all.bcf && rm -rf /tmp/cookies.txt
To repeat, BCF and vcf.gz are two formats serving similar purposes. I will be using both
all.bcf
and all.vcf.gz
interchangeably in the following examples.
There are two main, well-known programs for handling VCF files: vcftools
and bcftools
. Both
of these grew out of the 1000 Genomes effort starting about a decade ago. It seems
that vcftools
may have been developed first, but, currently, bcftools
is being
more actively developed, with new versions and new features being added to it
regularly. vcftools
provides some very specific commands for particular analyses
or operations, some of which are not available from bcftools
. On the other hand,
bcftools
provides a more general interface for operations on VCF files. By this
interface, a great number of the operations done in vcftools
are available, but
a little extra knowledge is required to implement them. That said, the range of
possible operations seems much larger in bcftools
than in vcftools
.
Further, bcftools
behaves like a typical Unix utility, allowing data to be streamed
to stdout, and data can be streamed into bcftools
(by using the -
as the
input file name) from stdin. This lets you pipe output into it the way you
can with most Unix tools. This makes it far more versatile than vcftools
.
22.1 bcftools
If you don’t already have bcftools
, you can use mamba
to create a conda
environment called bcftools
that has it:
(Note the use of -c conda-forge
in the above. This is present because
some of the dependencies for bcftools
are not found on the bioconda channel.
Rather they are on the conda-forge channel. If you conda/mamba environment is not
set up to search conda-forge by default, then the -c conda-forge
is required to
get all the dependencies.)
Here, we just want to give everyone the chance to run through a few
operations with bcftools
, to start to get familiar with its interface.
The first thing to note is that, like samtools
(which is maintained by
the same group of people), bcftools
possesses a number of different
subcommands. So, the syntax is always like:
bcftools
subcommand options file(s)
Also like samtools
, bcftools
will take input from stdin rather than
from a file—you just pass it -
instead of a file name.
The full documentation/manual for bcftools
is maintained at:
http://samtools.github.io/bcftools/bcftools.html. It is well worth reading through this whole documentation, though it
can be quite terse and intimidating. A friendlier “tutorial” introduction to the
software can be found at https://samtools.github.io/bcftools/howtos/index.html.
Here we are going to get our feet with with a few operations.
First, we will look at the “onboard” documentation. By just entering bcftools
you get a
list of all the subcommands that are available:
If you want the onboard documentation for any of the particular subcommands, you can
just give a naked bctools subcommand
command, like:
or, for a more daunting set of documentation:
22.1.1 Index my VCF file!
The first thing we are going to do is index our VCF files.
We create the default index, a coordinate sorted index which
has the .csi
extension. If your VCF file is not in coordinate-sorted
order, you might have to sort it before you do this. However,
all.vcf.gz
and all.bcf
are already sorted. So:
Use ls
to see the new files created by that operation.
The index allows for rapid access to different parts of the files that correspond to specific locations in the genome.
22.1.2 Tell me about my VCF file!
VCF files are a little daunting. Especially when they are gzipped
(or are BCF files!) they
can seem particularly opaque—learning anything about them in the traditional
manner of uncompressing them and then searching for lines within them
or counting up the number of records can be time consuming. Here are some
bcftools
solutions to a few different questions you might have.
Who is in this file? You can always try to find the last header line
in a VCF file using grep or awk and parse the individuals out yourself, but
it turns out to be faster and safer to use the query
subcommand from bcftools
with the -l
option. Do it here:
bcftools query -l all.vcf.gz
# And, of course, you can do the same with the BCF file
bcftools query -l all.bcf
Then read about it on the manual page. Find the part that describes it.
How many variants are in this file? This question can be answered quickly with
bcftools stats
, which also returns to you a plethora of information about the variants.
The top part of the output tells you how many SNPs and indels (and other types of variants) there are. Then it tells you about Ts/Tv ratios, then it essentially gives histogram summaries for allele frequencies, variant quality scores (QUAL), insertion-deletion sizes, substitution types, read depths, etc.
Where are these variants? There are several ways to answer this question. One might be simply to print the CHROM and the POS for each row in the VCF file:
If you want to see where it starts and where it finishes you can do:
bcftools query -f '%CHROM\t%POS\n' all.vcf.gz | head
bcftools query -f '%CHROM\t%POS\n' all.vcf.gz | tail
If we wanted to quickly see how many variants were on each of the chromosomes/scaffolds, sorted by number of variants, we could do:
bcftools query -f '%CHROM\t%POS\n' all.vcf.gz | awk '{n[$1]++} END {for(i in n) print i, n[i]}' | sort -nbr -k 2 | less
This shows one use of the subcommand query
, which is quite useful.
Even though it is named query
its main purpose is simply extracting fields
of information from a VCF file and spitting them out in a new, user-specified,
typically tabular format.
Give me a glimpse of the file You can use bcftools view
for a number
of things, but at its simplest, you can merely look at the file in VCF format.
(In this manner, it behaves much like samtools view
for VCF files).
# show the whole file from the top
bcftools view all.bcf | less
# of course, this works with either bcf or vcf or vcf.gz
bcftools view all.vcf.gz | less
# show just the header with -h. Here we look at just the last 10 lines of the header
bcftools view -h all.bcf | tail
# show the variants themselves (no header) with -H
bcftools view -H all.vcf.gz | head
Just like you can with samtools view
you can convert formats with bcftools view
.
Pipe a VCF into it and then use the -O
(Big-O, not a zero) option:
-O z
: bgzipped VCF (vcf.gz)-O v
: uncompressed VCF (the default)-O u
: uncompressed BCF-O b
: compressed BCF
22.1.3 Rename the samples/individuals in the file
We saw above that the names of the samples in the file are like s001
, s002
.
This was not actually what I had intended! The names in here are set by the
SM
field of the read groups in the BAM files from which variants are called.
In Section 27.6.3, where those BAM files were made, I
screwed up and used the wrong column from the numbered-units.tsv
file to
set the SM
value in the read groups. I wanted to use sample_id
not sample
.
However, all is not lost. We don’t have to go all the way back to the beginning
and remap everything and call variants. We simply rename the samples in the file.
For this we can use bcftools reheader
. First, look at the documentation for that,
both on the web, and with:
Aha! we see that the web-based documentation is a little more complete, and it tells
us what format to use for a sample-renaming file for the -s
option.
Copy the following contents (using nano
, for example) into a file called sample-renames.txt
s001 T199967
s002 T199968
s003 T199969
s004 T199970
s005 T199971
s006 T199972
s007 T199973
s008 T199974
Then we can make a renamed bcf
file with:
and a renamed vcf.gz
file with:
In this case, the type of output file (bcf or vcf.gz) is the same as the type of the input file.
Exercise Use bcftools view
and tail
to see that the names have really been
changed. Then use bcftools query
to do the same.
22.1.4 Get fragments/parts of my VCF file
There are lots of ways to extract desired bits of information from a VCF file into a more manageable format.
Extract keyed values from the INFO field When we did this:
we saw that there is a lot of information in the INFO field. What if we wanted to extract that? It looks like it could be hard to parse because the fields are in semi-colon-separated key-value pairs.
This is another job for bcftools query
.
You pass a format string to the -f
option that tells the program which fields you want
to extract and how you want to format it. In general, the values are preceded by
a %
and subfields of the INFO column can be named described
like %INFO/SUBFIELD
. You can ask for TABs between fields with
\t
and for line endings with \n
. In general you need to wrap all of these format
specifications in single quotes so that the shell does not get confused by them.
Check out some examples:
# extract CHROM POS and BaseQRankSum, separated by TABs
bcftools query -f '%CHROM\t%POS\t%INFO/BaseQRankSum\n' all.vcf.gz | less
# extract CHROM POS and total read depth DP
bcftools query -f '%CHROM\t%POS\t%INFO/DP\n' all.bcf | less
You can even extract information from each of the genotype columns. If you want to print CHROM and POS and then all of the PHRED-scaled genotype likelihoods for all the samples, separated by TABs, you can do:
Note that FORMAT values (i.e., those in the genotype columns) must be wrapped in
[ ]
to get all the values to be printed out.
EXERCISE Extract the CHROM, POS, Maximum Likelihood-estimated Allele Frequency (MLEAF in the INFO
column) for each variant, along with the allele depths (AD) of each individual, all TAB separated, from the
file all.vcf.gz
.
View data from specified regions
What if we want to look at variants only in two 10 Kb regions, like
CM031199.1:1-10000
and CM031200.1:1000000-1005000
? Pass those,
separated by commas, to the -r
option (which is an option that applies to many
of the subcommands):
You can also specify those regions in a file with the -R
option.
View data from specified individuals
You can give the sample names (comma separated) to the -s
option:
Or, if you wanted to view all but those two individuals, precede them with a ^
:
You can also supply a text file with sample names (one-per-line) to the capital letter -S
option.
You can combine options, like -r
and -s
, as well.
22.1.5 Combine VCF files in various ways
Catenate VCF files
If you have VCF files called from the same reference genome filled with
the same samples, it is easy to catenate them together with bcftools concat
:
# make two files from different regions
bcftools view -O z -r CM031199.1:1-10000 all.vcf.gz > A.vcf.gz
bcftools view -O z -r CM031200.1:1000000-1005000 all.vcf.gz > B.vcf.gz
# how many variants in each of those?
bcftools stats A.vcf.gz | awk '/^SN/'
bcftools stats B.vcf.gz | awk '/^SN/'
# catenate the back together
bcftools concat -Oz A.vcf.gz B.vcf.gz > CAT.vcf.gz
# how many variants in that?
bcftools stats CAT.vcf.gz | awk '/^SN/'
Note that when using the -O
(capital “o”) option to specify the output
type: v
= VCF, b
= BCF, u
= uncompressed BCF, z
= bgzipped VCF, you
don’t need a space after the -O
.
Merge VCF files
If you have files with different samples in them you can easily combine them:
# make file with first three samples
bcftools view -Oz -s s001,s002,s003 all.vcf.gz > first3.vcf.gz
# make another with the last three samples
bcftools view -Oz -s s006,s007,s008 all.bcf > last3.vcf.gz
# merging requires that the files be indexed
bcftools index first3.vcf.gz
bcftools index last3.vcf.gz
# merge those into a file with 6 samples
bcftools merge -Oz first3.vcf.gz last3.vcf.gz > 6-samples.vcf.gz
22.1.6 Filter out variants for a variety of reasons
There are a lot of ways to filter out variants. bcftools
leaves things
very general here, and so just about anything is possible. Some simple ones appear below.
Remember, we are piping the result to bcftools stats
just so that we can
see the result. If you really wanted to make a filtered file, you would typically
just redirect it to a file.
Just the biallelic SNPs please Get things with no more than 2 alleles and no fewer than two alleles, and of a type = SNP:
# do it and summarize the result to look at it, all in one line:
bcftools view -Ou -m 2 -M 2 --types=snps all.bcf | bcftools stats - | awk '/^SN/'
Just the biallelic indels please
# do it and see the result all in one line:
bcftools view -Ou -m 2 -M 2 --types=indels all.vcf.gz | bcftools stats - | awk '/^SN/'
Note the use of -Ou
in order to pipe uncompressed BCF output directly into
bcftools stats
using the -
for a filename.
Fraction of missing sites less than X
If you want to make sure that 60% of your individuals have at least one read at the genotype, you can do this:
Play with setting the F_MISSING
to different values and see how that affects
the number of variants retained. (Not much with this example data set, it turns
out, because there is not much missing data.
Exclude based on various features of the data
You can use the -e
option to bcftools view
or bcftools filter
to exclude sites that
meet certain criteria. (You can use -i
to include those sites and no others).
For a terrifyingly terse and dense description of what sorts of expressions can be used to describe the criteria, see the web manual section on expressions: http://samtools.github.io/bcftools/bcftools.html#expressions.
For example, to only keep things with a maximum-likelihood-estimated allele frequency between 0.4 and 0.6:
bcftools view -i 'INFO/MLEAF >= 0.4 && INFO/MLEAF <= 0.6' all.bcf | bcftools query -f '%INFO/MLEAF\n' | less
Note we are piping the result to bcftools query
in order to see what the actual MLEAFs are
after filtering. For the most part, this has worked, except for cases in which there are more than
two allele freqencies. If we wanted to filter those out, we could filter to only biallelic sites,
or, for the sake of illustration, we could retain only those sites at which the MLEAF value for the
first alternate allele is between 0.4 and 0.6:
bcftools view -i 'INFO/MLEAF[0] >= 0.4 && INFO/MLEAF[0] <= 0.6' all.bcf | bcftools query -f '%INFO/MLEAF\n' | less
Cool!
How about excluding those sites in which any individual had a DP less than 5. We can
test each of the DP columns in the FORMAT columns. We name these FMT/DP
.
Note that each test (from each sample)
is combined with an OR by default, so:
To make it easier to see what the DPs are there, let’s print them:
More playing
Here is one that might be interesting. Suppose that we want to filter our data set down to variant sites at which at least two of the eight individuals had at least 10 reads of the first alternate allele?
bcftools view -i 'COUNT(FMT/AD[:1] > 10) > 2' all.bcf | bcftools query -f '%CHROM\t%POS\t[%AD\t]\n' | less
Whoa! Cool.
22.2 How about some more arcane filtering tasks? Let’s dream some up!
This has only scratched the surface of what is possible with bcftools.
22.3 Reconstituting a VCF file from genotype data
There are many other formats for storing genotype data from high-throughput sequencing, although few, if any, of them allow for the richness of variant types that can be encoded in a VCF file, and most of them also don’t allow for all the auxiliary information (like read depths, allele depths, genotype likelihoods, etc.) to also be easily stored, indexed, and accessed.
Some of the different formats are specialized to certain types of data. For example, the program PLINK 2.0 has a binary format that is incredibly efficient for storing biallelic SNP data. PLINK also has facilities in the program for converting to and from plink format to VCF. Some formats, however, don’t seem to have any associated programs for converting to VCF.
There can be times when you receive variant data from other people or sources that are
not in VCF format. Data may be archived or placed in public repositories
in a variety of formats. For example, some of the variant data from a recent pearl
millet sequencing project has been made public at
https://cegresources.icrisat.org/data_public/PM_SNPs/SNP_calls/, in the form of .genotypes
files.
Here, you can download the first 10000 lines of the file WP.pgchr1.genotype.gz
at this link:
https://eriqande.github.io/eca-bioinf-handbook/downloads/WP.pgchr1_10K.genotype.gz, and you
can get the explanation of the columns in the file (including the sample names)
from this link:
https://eriqande.github.io/eca-bioinf-handbook/downloads/genotype.readme
or on your cluster, you could use:
wget https://eriqande.github.io/eca-bioinf-handbook/downloads/WP.pgchr1_10K.genotype.gz
wget https://eriqande.github.io/eca-bioinf-handbook/downloads/genotype.readme
These files give the genotypes of individuals using IUPAC nucleotide codes for the bases. Each diploid sample gets a column, and a single letter gives the types of the two bases at that site in a sample.
The meaning of each of the columns is given in the genotype.readme
file:
Column Description/SampleID
1 Chromosome
2 Position
3 Reference genotype
4 PE00838
5 PE01458
6 PE05720
7 PE05722
8 PE05724
9 PE08084
...
So, we have the equivalent of a CHROM, and and POS and a REF column, then, we get one column for each sample.
The .genotypes
files look something like this, showing only the first few columns:
chr1 48 A - - G G R G - - G - - G G G G G G - R G - G R - G A G G G -
chr1 79 T T T T T T T - T T - - T T T T T T T T T - T T T T T T T T T
chr1 83 C C C C C C C - C C - - C C C C C C C C C - C C C C C Y C C C
chr1 95 C C Y Y T C C - C C - - C C C C C Y C C C - C C C C C C C C C
chr1 118 C C C C C C C - C C C C C C C C C C C C C C C C C C C C C C C
chr1 119 C C M M M C C - C C C C C C C C C M M C C C C C C M C C C C C
Importantly, here we see that missing data is denoted by a -
.
Some of these sites are multiallelic. We can see that by looking at the distinct genotypes
in each row. Here we can use awk
to look at the distinct genotypes in a few of the
multiallelic positions found in the WP.pgchr1_10K.genotype.gz
we just downloaded:
gzcat WP.pgchr1_10K.genotype.gz | \
awk '
{
ref=$3;
for(i=4;i<=NF;i++) {
if($i!="-" && $i!=ref && !($i in n)) n[$i]++
}
printf("%s %s %s", $1, $2, ref);
for(i in n) printf(" %s", i);
printf("\n");
delete n
}
' | awk 'NF>5'
That gives use some output that looks like:
chr1 2118 G A R S
chr1 4023 T A C W Y
chr1 4404 G A C R S
chr1 22272 C A M T W
chr1 22843 C A M T
chr1 50267 C A M T
chr1 52108 C A M T Y
chr1 52610 C A M T Y
chr1 54107 A C G M R
chr1 54192 G A C M R S
...
This is good to know! If we want to convert this to a VCF, we will have to pay attention to the fact that some sites have multiple alternate alleles.
22.3.1 Why create a VCF file?
You may ask, why do you need a VCF file? Well, perhaps you wish to annotate these variants, attaching information about their possible effects, given information about coding sequences that they might occur in. This can be done by identifying where each position sits with respect to genomic features like mRNA transcript boundaries, exons, and coding sequence.
Information about those sorts of genomic features can be found in a .gff
file. The
extension .gff
stands for “Genomics Features File.” The pearl millet GFF for the
genome that the variants in the .genotypes
were mapped against can
be found at:
https://cegresources.icrisat.org/data_public/PearlMillet_Genome/v1.0/pm_assembly_v1.0.gff.
The first few lines of that file look like:
chr1 GLEAN mRNA 29490538 29492233 0.670065 + . ID=Pgl_GLEAN_10008614;
chr1 GLEAN CDS 29490538 29490605 . + 0 Parent=Pgl_GLEAN_10008614;
chr1 GLEAN CDS 29491108 29491390 . + 1 Parent=Pgl_GLEAN_10008614;
chr1 GLEAN CDS 29491994 29492233 . + 0 Parent=Pgl_GLEAN_10008614;
chr1 GLEAN mRNA 244026666 244030597 0.57488 + . ID=Pgl_GLEAN_10020091;
chr1 GLEAN CDS 244026666 244027096 . + 0 Parent=Pgl_GLEAN_10020091;
chr1 GLEAN CDS 244027627 244028075 . + 1 Parent=Pgl_GLEAN_10020091;
chr1 GLEAN CDS 244030287 244030597 . + 2 Parent=Pgl_GLEAN_10020091;
chr1 GLEAN mRNA 224685509 224689330 0.653208 + . ID=Pgl_GLEAN_10021721;
chr1 GLEAN CDS 224685509 224685562 . + 0 Parent=Pgl_GLEAN_10021721;
chr1 GLEAN CDS 224687184 224687617 . + 0 Parent=Pgl_GLEAN_10021721;
chr1 GLEAN CDS 224687878 224688208 . + 1 Parent=Pgl_GLEAN_10021721;
chr1 GLEAN CDS 224688317 224688388 . + 0 Parent=Pgl_GLEAN_10021721;
chr1 GLEAN CDS 224688463 224688664 . + 0 Parent=Pgl_GLEAN_10021721;
chr1 GLEAN CDS 224688789 224689330 . + 2 Parent=Pgl_GLEAN_10021721;
Each row shows the start and stop position of some feature in the genome.
Each of these features may themelves be sub-features of some other parent
feature. For example, the first feature is an mRNA transcript
with the ID Pgl_GLEAN_10008614
. Next there are several coding sequences
(CDS) that are subfeatures of the Pgl_GLEAN_10008614
transcript, which
can be seen because they have Pgl_GLEAN_10008614
as their parent.
Trying to figure out whether a variant occurs in a position within a coding sequence that will create an amino acid change can be done by figuring out whether the variant is in a coding sequence and whether the sequence change causes a non-synonymous change in the codon that it is in, etc. Doing so by writing your own scripts is hard and subject to a lot of error. (I know, because I spent a good several days doing that some years ago, and then found I had made a lot of mistakes when I went to check the interesting sites I found on the GenBank genome browser!) There is a lot to figure out because the coding sequences are translated from either the plus (+) or the minus (-) strand relative to the reference genome, etc.
It is much easier to use a program like snpEff
to do that for you. However, snpEff takes VCF as the input format. Hence,
one might want to convert a .genotypes
file to VCF.
22.3.2 Converting the .genotypes
file to VCF
There are lots of ways we could do this. Today we are going to use awk, because it is fast and can process the file one line at a time, so it has a very small memory footprint.
Obviously, the .genotypes
file really only has information about the
genotypes themselves, so when we make a VCF file out of it, it is going to
be pretty stripped down. In fact, this is really going to become a
minimal VCF file. There is so little information in it, that the
header for it will be just the minimal header:
##fileformat=VCFv4.2
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
and then it will have the #CHROM header line:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT PE00838 PE01458 PE05720 PE05722 ...
And then we will start with the variant lines that will look like this:
chr1 48 . A G 100 PASS . GT ./. ./. 1/1 1/1 0/1 ...
For each variant, we have a CHROM and a POS. We don’t have an ID, so
those will all be set to .
. We have a REF and an ALT, and for the QUAL column
we will just arbitrarily put in a 100
, and in the FILTER column we will give each
variant a PASS
. There is no other information about these variants so
we list the INFO with .
(missing data). Then, the FORMAT column lists only the single
field, GT
, which is described in our minimal header, and then the actual genotypes
get listed.
To make such a VCF file, we write an awk script that first prints
the header lines, and then, for each variant, it does two passes through
the information in the .genotypes
file:
- A first pass to figure out what the reference and all the different alternate alleles are. Numbers are assigned to the different alternate alleles at this step.
- A second pass to convert each column to something that looks like
0/1
or1/1
, etc.
Before we proceed, let’s confirm that only the 2-nucleotide IUPAC codes are used, which makes sense, because these are diploids.
gzcat WP.pgchr1.genotype_10K.gz | awk '{for(i=3;i<=NF;i++) n[$i]++} NR>100000 {for(i in n) print i, n[i]; exit}'
A 5240464
C 7949796
G 7852659
K 85767
M 83639
R 263234
S 78251
T 5274252
W 79988
Y 265687
- 10526640
So, we need a way of converting IUPAC codes to alleles. Below, for each IUPAC-code-genotype given in the first column, the first and second alleles are given in the second and third columns:
If we had such a table stored in a file extra/iupac.txt
, we could
use awk
to write a block of awk
code that defines some awk arrays giving the values of the first and second
allele for each. That would look like this:
awk '{printf("iupac[\"%s\",1] = \"%s\";\n", $1, $2); printf("iupac[\"%s\",2] = \"%s\";\n\n", $1, $3);}' extra/iupac.txt
iupac["A",1] = "A";
iupac["A",2] = "A";
iupac["C",1] = "C";
iupac["C",2] = "C";
iupac["G",1] = "G";
iupac["G",2] = "G";
iupac["T",1] = "T";
iupac["T",2] = "T";
iupac["R",1] = "A";
iupac["R",2] = "G";
iupac["Y",1] = "C";
iupac["Y",2] = "T";
iupac["S",1] = "G";
iupac["S",2] = "C";
iupac["W",1] = "A";
iupac["W",2] = "T";
iupac["K",1] = "G";
iupac["K",2] = "T";
iupac["M",1] = "A";
iupac["M",2] = "C";
Then we can put that code into a little awk script that we will call
genotypes2vcf.awk
. You can copy the code in the following
block and put it into a file called script/genotypes2vcf.awk
(i.e., it
is called genotypes2vcf.awk
inside a directory you made called script
):
# we begin by creating arrays that tell us the first and second allele
# of each IUPAC-coded genotype
BEGIN {
iupac["A",1] = "A";
iupac["A",2] = "A";
iupac["C",1] = "C";
iupac["C",2] = "C";
iupac["G",1] = "G";
iupac["G",2] = "G";
iupac["T",1] = "T";
iupac["T",2] = "T";
iupac["R",1] = "A";
iupac["R",2] = "G";
iupac["Y",1] = "C";
iupac["Y",2] = "T";
iupac["S",1] = "G";
iupac["S",2] = "C";
iupac["W",1] = "A";
iupac["W",2] = "T";
iupac["K",1] = "G";
iupac["K",2] = "T";
iupac["M",1] = "A";
iupac["M",2] = "C";
# Now, print the minimal header lines
printf("##fileformat=VCFv4.2\n")
printf("##FORMAT=<ID=GT,Number=1,Type=String,Description=\"Genotype\">\n")
printf("#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT")
}
# we catenate the sample names, each on a line starting with SAMPLE
# to the input, and then print them all here
$1=="SAMPLE" {
printf("\t%s", $NF);
next
}
/xxxxxxxxxxxxxx/ {printf("\n"); next}
# then, for each row, we make a first pass and define integers for the
# different alternate alleles. Then we make a second pass and print
# the VCF line
{
CHROM=$1;
POS=$2
delete a_ints; # clear these arrays
delete alleles;
a_idx = 0; # start the allele index at 0 (the REF)
alleles[$3] = a_idx; # reference allele gets a 0
a_ints[a_idx] = $3; # keep this array to have alleles in sorted order,
# indexed from 0 to the number of alleles - 1
# cycle over all the columns, and the two alleles within each genotype
# in that column, and add any new alleles found to the alleles hash array
for(i=4;i<=NF;i++) {
for(a=1;a<=2;a++) if($i != "-") {
alle = iupac[$i, a];
if(!(alle in alleles)) { # if we have not seen this allele before
alleles[alle] = ++a_idx;
a_ints[a_idx] = alle
}
}
}
# Now we can print the VCF line
# print CHROM, POS, ID, and REF columns
printf("%s\t%s\t.\t%s", $1, $2, $3)
# print the ALT field, including comma-separated alleles if multiallelic
if(a_idx == 0) { # if there are no alternate alleles ALT gets a .
printf("\t.")
} else {
printf("\t%s", a_ints[1])
for(a=2;a<=a_idx;a++)
printf(",%s", a_ints[a])
}
# Set all the QUALs to 100, the FILTERs to PASS and the INFO to .
printf("\t100\tPASS\t.")
# make the FORMAT column. It just has GT
printf("\tGT")
# now, cycle over the individuals and print their genotypes
for(i=4;i<=NF;i++) {
if($i=="-") {
printf("\t./.");
} else {
a = iupac[$i, 1];
b = iupac[$i, 2];
printf("\t%s/%s", alleles[a], alleles[b])
}
}
printf("\n")
}
In order to use that awk
script we will use awk
’s -f
option
to read the script, and we will also
pass to awk
via stdin the genotypes file with the sample names on top of it followed
by a line with an xxxxxxxxxxxxxxxxx
(
awk 'NR>=5 {printf("SAMPLE %s\n", $2);}' genotype.readme;
echo xxxxxxxxxxxxxxxxx;
gzcat WP.pgchr1_10K.genotype.gz
) | awk -f script/genotypes2vcf.awk > WP.pgchr1_10K.vcf
That creates the VCF file, WP.pgchr1_10K.vcf
. Woo-hoo! On my old laptop, that takes about 10 seconds. So, about 1 second per 1000 variants.
That means that if you have 3.6 million variants, it will take about an hour to
convert them all to this VCF format. But once they are in VCF, you can
compress and index them
and manipulate them quite quickly.
22.3.3 Reheader the VCF
It is good practice to have the reference genome information in the VCF header.
We can do that with the bcftools reheader
subcommand. To do so, you need to have
the .fai
file made by running samtools faidx
on the reference genome. You can
download the .fai
file from:
https://eriqande.github.io/eca-bioinf-handbook/downloads/pearl_millet_v1.1.fa.fai
or, on your cluster, use:
Once we have that, we can reheader the VCF file like this (it is piped to less
just so you can view the result without it bombing your screen).
While we are at it, however, we are going to go one step further and add some
fields to the INFO column. We are going to calculate the alternate allele count (AC)
the total number of alleles of all types (AN) the alternate allele
frequency (AF) and the minor allele
frequency (MAF) for each variant. To add these INFO tags we will use the
fill-tags
plug-in for bcftools
. This plug-in comes standard with bcftools
.
You can see the help information for it like this:
For a detailed list of the available tags to fill you can do:
So, we can add the genome to the header and add AN, AC, AF, and MAF fields to the INFO, and compress the file into a vcf.gz, all in one fell swoop with the following, and then we can index it.
bcftools reheader -f pearl_millet_v1.1.fa.fai WP.pgchr1_10K.vcf | \
bcftools +fill-tags - -Oz -o WP.pgchr1_10K.vcf.gz -- -t AN,AC,AF,MAF
# then we could index twice to get both a .tbi and a .csi index
# like this
bcftools index WP.pgchr1_10K.vcf.gz # makes .csi
bcftools index -t WP.pgchr1_10K.vcf.gz # makes .tbi
And then you could look it at like this:
Or you could compute statistics for it like this:
And, if you wanted to look at only the multiallelic sites you could do this:
Wow! Playing around with this makes it clear that the VCF format is pretty
awesome, and that if you get good at using bcftools
, then there are so many
wonderful possibilities.
22.3.4 Where to from here?
A good exercise, (especially for anyone studying pearl millet!), would be
to write a script that cycled over the different chromosomes and made a VCF
for each one, then use bcftools concat
to concatenate them into a single VCF file and add
a few info tags. Once that is done, you would have a VCF file ready
to annotate some variants with snpEff
.
To automate this all on all the cluster it is worth noting that the .genotype
files can be downloaded like so:
wget --no-check-certificate https://cegresources.icrisat.org/data_public/PM_SNPs/SNP_calls/WP.pgchr7.genotype.gz
Perhaps on Thursday we can review a Snakefile that would do all this.