Chapter 22 Basic Handling of VCF files

As we saw in the section on bioinformatic formats, VCF files can be large and unwieldy. The format specification is also such that fields might have different numbers of subfields, depending, for example, on the number of alleles found at a variant. Both of these features make it hard to directly read a VCF file into a, say, R, or some other program that may wish to treat it purely as tabular data.

This is not to say that you couldn’t just read a VCF file into R directly as a TAB delimited text file, and then start splitting fields up on it. However, there are specialized tools for doing operations of VCF files, and becoming familiar with them can relieve a lot of the pain of dealing with VCF files.

To have an example VCF file to play with, you can download one to your Unix workstation with the following commands. You should put these files in your scratch directory somewhere, perhaps creating a directory called bcftools-play to put them into. All the following commands assume that the two example files are in the current working directory.

First we download the vcf.gz version of the VCF file, called all.vcf.gz. This is the unfiltered VCF file created by our example WGS workflow in Section 27:

wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1zgFyYbfWU85O4JzmOX7-MOKQs4oXv4fm' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1zgFyYbfWU85O4JzmOX7-MOKQs4oXv4fm" -O all.vcf.gz && rm -rf /tmp/cookies.txt

After that, we also download the BCF version of the same file, just so that everyone gets familiar with the fact that BCF files can be treated equivalently to VCF files with bcftools.

wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1iUP_UstnmuLSGuBIfb0EYXs1Gw1z-z0o' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1iUP_UstnmuLSGuBIfb0EYXs1Gw1z-z0o" -O all.bcf && rm -rf /tmp/cookies.txt

To repeat, BCF and vcf.gz are two formats serving similar purposes. I will be using both all.bcf and all.vcf.gz interchangeably in the following examples.

There are two main, well-known programs for handling VCF files: vcftools and bcftools. Both of these grew out of the 1000 Genomes effort starting about a decade ago. It seems that vcftools may have been developed first, but, currently, bcftools is being more actively developed, with new versions and new features being added to it regularly. vcftools provides some very specific commands for particular analyses or operations, some of which are not available from bcftools. On the other hand, bcftools provides a more general interface for operations on VCF files. By this interface, a great number of the operations done in vcftools are available, but a little extra knowledge is required to implement them. That said, the range of possible operations seems much larger in bcftools than in vcftools.

Further, bcftools behaves like a typical Unix utility, allowing data to be streamed to stdout, and data can be streamed into bcftools (by using the - as the input file name) from stdin. This lets you pipe output into it the way you can with most Unix tools. This makes it far more versatile than vcftools.

22.1 bcftools

If you don’t already have bcftools, you can use mamba to create a conda environment called bcftools that has it:

mamba create -n bcftools -c conda-forge -c bioconda bcftools=1.15.1
conda activate bcftools

(Note the use of -c conda-forge in the above. This is present because some of the dependencies for bcftools are not found on the bioconda channel. Rather they are on the conda-forge channel. If you conda/mamba environment is not set up to search conda-forge by default, then the -c conda-forge is required to get all the dependencies.)

Here, we just want to give everyone the chance to run through a few operations with bcftools, to start to get familiar with its interface. The first thing to note is that, like samtools (which is maintained by the same group of people), bcftools possesses a number of different subcommands. So, the syntax is always like:

bcftools subcommand options file(s)

Also like samtools, bcftools will take input from stdin rather than from a file—you just pass it - instead of a file name.

The full documentation/manual for bcftools is maintained at: http://samtools.github.io/bcftools/bcftools.html. It is well worth reading through this whole documentation, though it can be quite terse and intimidating. A friendlier “tutorial” introduction to the software can be found at https://samtools.github.io/bcftools/howtos/index.html.

Here we are going to get our feet with with a few operations.

First, we will look at the “onboard” documentation. By just entering bcftools you get a list of all the subcommands that are available:

bcftools

If you want the onboard documentation for any of the particular subcommands, you can just give a naked bctools subcommand command, like:

bcftools index

or, for a more daunting set of documentation:

bcftools roh

22.1.1 Index my VCF file!

The first thing we are going to do is index our VCF files. We create the default index, a coordinate sorted index which has the .csi extension. If your VCF file is not in coordinate-sorted order, you might have to sort it before you do this. However, all.vcf.gz and all.bcf are already sorted. So:

bcftools index all.vcf.gz
bcftools index all.bcf

Use ls to see the new files created by that operation.

The index allows for rapid access to different parts of the files that correspond to specific locations in the genome.

22.1.2 Tell me about my VCF file!

VCF files are a little daunting. Especially when they are gzipped (or are BCF files!) they can seem particularly opaque—learning anything about them in the traditional manner of uncompressing them and then searching for lines within them or counting up the number of records can be time consuming. Here are some bcftools solutions to a few different questions you might have.

Who is in this file? You can always try to find the last header line in a VCF file using grep or awk and parse the individuals out yourself, but it turns out to be faster and safer to use the query subcommand from bcftools with the -l option. Do it here:

bcftools query -l all.vcf.gz

# And, of course, you can do the same with the BCF file
bcftools query -l all.bcf

Then read about it on the manual page. Find the part that describes it.

How many variants are in this file? This question can be answered quickly with bcftools stats, which also returns to you a plethora of information about the variants.

bcftools stats all.vcf.gz | less

The top part of the output tells you how many SNPs and indels (and other types of variants) there are. Then it tells you about Ts/Tv ratios, then it essentially gives histogram summaries for allele frequencies, variant quality scores (QUAL), insertion-deletion sizes, substitution types, read depths, etc.

Where are these variants? There are several ways to answer this question. One might be simply to print the CHROM and the POS for each row in the VCF file:

bcftools query -f '%CHROM\t%POS\n' all.vcf.gz

If you want to see where it starts and where it finishes you can do:

bcftools query -f '%CHROM\t%POS\n' all.vcf.gz | head 
bcftools query -f '%CHROM\t%POS\n' all.vcf.gz | tail

If we wanted to quickly see how many variants were on each of the chromosomes/scaffolds, sorted by number of variants, we could do:

bcftools query -f '%CHROM\t%POS\n' all.vcf.gz | awk '{n[$1]++} END {for(i in n) print i, n[i]}' | sort -nbr -k 2 | less

This shows one use of the subcommand query, which is quite useful. Even though it is named query its main purpose is simply extracting fields of information from a VCF file and spitting them out in a new, user-specified, typically tabular format.

Give me a glimpse of the file You can use bcftools view for a number of things, but at its simplest, you can merely look at the file in VCF format. (In this manner, it behaves much like samtools view for VCF files).

# show the whole file from the top
bcftools view all.bcf | less

# of course, this works with either bcf or vcf or vcf.gz
bcftools view all.vcf.gz | less

# show just the header with -h.  Here we look at just the last 10 lines of the header
bcftools view -h all.bcf  | tail

# show the variants themselves (no header) with -H
bcftools view -H all.vcf.gz | head

Just like you can with samtools view you can convert formats with bcftools view. Pipe a VCF into it and then use the -O (Big-O, not a zero) option:

-O z: bgzipped VCF (vcf.gz)
-O v: uncompressed VCF (the default)
-O u: uncompressed BCF
-O b: compressed BCF

22.1.3 Rename the samples/individuals in the file

We saw above that the names of the samples in the file are like s001, s002. This was not actually what I had intended! The names in here are set by the SM field of the read groups in the BAM files from which variants are called. In Section 27.6.3, where those BAM files were made, I screwed up and used the wrong column from the numbered-units.tsv file to set the SM value in the read groups. I wanted to use sample_id not sample. However, all is not lost. We don’t have to go all the way back to the beginning and remap everything and call variants. We simply rename the samples in the file. For this we can use bcftools reheader. First, look at the documentation for that, both on the web, and with:

bcftools reheader

Aha! we see that the web-based documentation is a little more complete, and it tells us what format to use for a sample-renaming file for the -s option. Copy the following contents (using nano, for example) into a file called sample-renames.txt

s001    T199967
s002    T199968
s003    T199969
s004    T199970
s005    T199971
s006    T199972
s007    T199973
s008    T199974

Then we can make a renamed bcf file with:

bcftools reheader -s sample-renames.txt all.bcf  > renamed-all.bcf

and a renamed vcf.gz file with:

bcftools reheader -s sample-renames.txt all.vcf.gz  > renamed-all.vcf.gz

In this case, the type of output file (bcf or vcf.gz) is the same as the type of the input file.

Exercise Use bcftools view and tail to see that the names have really been changed. Then use bcftools query to do the same.

22.1.4 Get fragments/parts of my VCF file

There are lots of ways to extract desired bits of information from a VCF file into a more manageable format.

Extract keyed values from the INFO field When we did this:

bcftools view -H | less

we saw that there is a lot of information in the INFO field. What if we wanted to extract that? It looks like it could be hard to parse because the fields are in semi-colon-separated key-value pairs.

This is another job for bcftools query. You pass a format string to the -f option that tells the program which fields you want to extract and how you want to format it. In general, the values are preceded by a % and subfields of the INFO column can be named described like %INFO/SUBFIELD. You can ask for TABs between fields with \t and for line endings with \n. In general you need to wrap all of these format specifications in single quotes so that the shell does not get confused by them.

Check out some examples:

# extract CHROM POS and BaseQRankSum, separated by TABs
bcftools query -f '%CHROM\t%POS\t%INFO/BaseQRankSum\n' all.vcf.gz | less

# extract CHROM POS and total read depth DP
bcftools query -f '%CHROM\t%POS\t%INFO/DP\n' all.bcf | less

You can even extract information from each of the genotype columns. If you want to print CHROM and POS and then all of the PHRED-scaled genotype likelihoods for all the samples, separated by TABs, you can do:

bcftools query -f '%CHROM\t%POS\t[%PL\t]\n' all.bcf | less

Note that FORMAT values (i.e., those in the genotype columns) must be wrapped in [ ] to get all the values to be printed out.

EXERCISE Extract the CHROM, POS, Maximum Likelihood-estimated Allele Frequency (MLEAF in the INFO column) for each variant, along with the allele depths (AD) of each individual, all TAB separated, from the file all.vcf.gz.

View data from specified regions

What if we want to look at variants only in two 10 Kb regions, like CM031199.1:1-10000 and CM031200.1:1000000-1005000? Pass those, separated by commas, to the -r option (which is an option that applies to many of the subcommands):

bcftools view -H -r CM031199.1:1-10000,CM031200.1:1000000-1005000 all.vcf.gz | less

You can also specify those regions in a file with the -R option.

View data from specified individuals

You can give the sample names (comma separated) to the -s option:

bcftools view -H -s s001,s002,s003 all.vcf.gz | less

Or, if you wanted to view all but those two individuals, precede them with a ^:

bcftools view -H -s ^s001,s002,s003 all.vcf.gz | less

You can also supply a text file with sample names (one-per-line) to the capital letter -S option.

You can combine options, like -r and -s, as well.

22.1.5 Combine VCF files in various ways

Catenate VCF files

If you have VCF files called from the same reference genome filled with the same samples, it is easy to catenate them together with bcftools concat:

# make two files from different regions
bcftools view -O z -r CM031199.1:1-10000 all.vcf.gz  > A.vcf.gz
bcftools view -O z -r CM031200.1:1000000-1005000 all.vcf.gz  > B.vcf.gz

# how many variants in each of those?
bcftools stats A.vcf.gz | awk '/^SN/'
bcftools stats B.vcf.gz | awk '/^SN/'

# catenate the back together
bcftools concat -Oz  A.vcf.gz B.vcf.gz > CAT.vcf.gz

# how many variants in that?
bcftools stats CAT.vcf.gz | awk '/^SN/'

Note that when using the -O (capital “o”) option to specify the output type: v = VCF, b = BCF, u = uncompressed BCF, z = bgzipped VCF, you don’t need a space after the -O.

Merge VCF files

If you have files with different samples in them you can easily combine them:

# make file with first three samples
bcftools view -Oz -s s001,s002,s003 all.vcf.gz > first3.vcf.gz

# make another with the last three samples
bcftools view -Oz -s s006,s007,s008 all.bcf > last3.vcf.gz

# merging requires that the files be indexed
bcftools index first3.vcf.gz
bcftools index last3.vcf.gz

# merge those into a file with 6 samples
bcftools merge -Oz first3.vcf.gz last3.vcf.gz > 6-samples.vcf.gz

22.1.6 Filter out variants for a variety of reasons

There are a lot of ways to filter out variants. bcftools leaves things very general here, and so just about anything is possible. Some simple ones appear below. Remember, we are piping the result to bcftools stats just so that we can see the result. If you really wanted to make a filtered file, you would typically just redirect it to a file.

Just the biallelic SNPs please Get things with no more than 2 alleles and no fewer than two alleles, and of a type = SNP:

# do it and summarize the result to look at it, all in one line:
bcftools view -Ou -m 2 -M 2 --types=snps all.bcf | bcftools stats - | awk '/^SN/'

Just the biallelic indels please

# do it and see the result all in one line:
bcftools view -Ou -m 2 -M 2 --types=indels all.vcf.gz | bcftools stats - | awk '/^SN/'

Note the use of -Ou in order to pipe uncompressed BCF output directly into bcftools stats using the - for a filename.

Fraction of missing sites less than X

If you want to make sure that 60% of your individuals have at least one read at the genotype, you can do this:

bcftools view -i 'F_MISSING < 0.4' all.vcf.gz | bcftools stats - | awk '/^SN/'

Play with setting the F_MISSING to different values and see how that affects the number of variants retained. (Not much with this example data set, it turns out, because there is not much missing data.

Exclude based on various features of the data

You can use the -e option to bcftools view or bcftools filter to exclude sites that meet certain criteria. (You can use -i to include those sites and no others).

For a terrifyingly terse and dense description of what sorts of expressions can be used to describe the criteria, see the web manual section on expressions: http://samtools.github.io/bcftools/bcftools.html#expressions.

For example, to only keep things with a maximum-likelihood-estimated allele frequency between 0.4 and 0.6:

bcftools view -i 'INFO/MLEAF >= 0.4 && INFO/MLEAF <= 0.6' all.bcf | bcftools query -f '%INFO/MLEAF\n' | less

Note we are piping the result to bcftools query in order to see what the actual MLEAFs are after filtering. For the most part, this has worked, except for cases in which there are more than two allele freqencies. If we wanted to filter those out, we could filter to only biallelic sites, or, for the sake of illustration, we could retain only those sites at which the MLEAF value for the first alternate allele is between 0.4 and 0.6:

bcftools view -i 'INFO/MLEAF[0] >= 0.4 && INFO/MLEAF[0] <= 0.6' all.bcf | bcftools query -f '%INFO/MLEAF\n' | less

Cool!

How about excluding those sites in which any individual had a DP less than 5. We can test each of the DP columns in the FORMAT columns. We name these FMT/DP. Note that each test (from each sample) is combined with an OR by default, so:

bcftools view -H -e 'FMT/DP < 5' all.bcf | less

To make it easier to see what the DPs are there, let’s print them:

bcftools view -e 'FMT/DP < 5' all.bcf | bcftools query -f '%CHROM\t%POS\t[%DP\t]\n' | less

More playing

Here is one that might be interesting. Suppose that we want to filter our data set down to variant sites at which at least two of the eight individuals had at least 10 reads of the first alternate allele?

bcftools view -i 'COUNT(FMT/AD[:1] > 10) > 2' all.bcf | bcftools query -f '%CHROM\t%POS\t[%AD\t]\n' | less

Whoa! Cool.

22.2 How about some more arcane filtering tasks? Let’s dream some up!

This has only scratched the surface of what is possible with bcftools.

22.3 Reconstituting a VCF file from genotype data

There are many other formats for storing genotype data from high-throughput sequencing, although few, if any, of them allow for the richness of variant types that can be encoded in a VCF file, and most of them also don’t allow for all the auxiliary information (like read depths, allele depths, genotype likelihoods, etc.) to also be easily stored, indexed, and accessed.

Some of the different formats are specialized to certain types of data. For example, the program PLINK 2.0 has a binary format that is incredibly efficient for storing biallelic SNP data. PLINK also has facilities in the program for converting to and from plink format to VCF. Some formats, however, don’t seem to have any associated programs for converting to VCF.

There can be times when you receive variant data from other people or sources that are not in VCF format. Data may be archived or placed in public repositories in a variety of formats. For example, some of the variant data from a recent pearl millet sequencing project has been made public at https://cegresources.icrisat.org/data_public/PM_SNPs/SNP_calls/, in the form of .genotypes files.

Here, you can download the first 10000 lines of the file WP.pgchr1.genotype.gz at this link: https://eriqande.github.io/eca-bioinf-handbook/downloads/WP.pgchr1_10K.genotype.gz, and you can get the explanation of the columns in the file (including the sample names) from this link: https://eriqande.github.io/eca-bioinf-handbook/downloads/genotype.readme

or on your cluster, you could use:

wget https://eriqande.github.io/eca-bioinf-handbook/downloads/WP.pgchr1_10K.genotype.gz
wget https://eriqande.github.io/eca-bioinf-handbook/downloads/genotype.readme

These files give the genotypes of individuals using IUPAC nucleotide codes for the bases. Each diploid sample gets a column, and a single letter gives the types of the two bases at that site in a sample.

The meaning of each of the columns is given in the genotype.readme file:

Column  Description/SampleID
1   Chromosome
2   Position
3   Reference genotype
4   PE00838
5   PE01458
6   PE05720
7   PE05722
8   PE05724
9   PE08084
...

So, we have the equivalent of a CHROM, and and POS and a REF column, then, we get one column for each sample.

The .genotypes files look something like this, showing only the first few columns:

chr1    48      A       - - G G R G - - G - - G G G G G G - R G - G R - G A G G G -
chr1    79      T       T T T T T T - T T - - T T T T T T T T T - T T T T T T T T T
chr1    83      C       C C C C C C - C C - - C C C C C C C C C - C C C C C Y C C C
chr1    95      C       C Y Y T C C - C C - - C C C C C Y C C C - C C C C C C C C C
chr1    118     C       C C C C C C - C C C C C C C C C C C C C C C C C C C C C C C
chr1    119     C       C M M M C C - C C C C C C C C C M M C C C C C C M C C C C C

Importantly, here we see that missing data is denoted by a -.

Some of these sites are multiallelic. We can see that by looking at the distinct genotypes in each row. Here we can use awk to look at the distinct genotypes in a few of the multiallelic positions found in the WP.pgchr1_10K.genotype.gz we just downloaded:

gzcat WP.pgchr1_10K.genotype.gz | \
  awk '
    {
      ref=$3; 
      for(i=4;i<=NF;i++) {
         if($i!="-" && $i!=ref && !($i in n)) n[$i]++
      } 
      printf("%s %s %s", $1, $2, ref); 
      for(i in n) printf(" %s", i); 
      printf("\n");  
      delete n
    }
  ' | awk 'NF>5'

That gives use some output that looks like:

chr1 2118 G A R S
chr1 4023 T A C W Y
chr1 4404 G A C R S
chr1 22272 C A M T W
chr1 22843 C A M T
chr1 50267 C A M T
chr1 52108 C A M T Y
chr1 52610 C A M T Y
chr1 54107 A C G M R
chr1 54192 G A C M R S
...

This is good to know! If we want to convert this to a VCF, we will have to pay attention to the fact that some sites have multiple alternate alleles.

22.3.1 Why create a VCF file?

You may ask, why do you need a VCF file? Well, perhaps you wish to annotate these variants, attaching information about their possible effects, given information about coding sequences that they might occur in. This can be done by identifying where each position sits with respect to genomic features like mRNA transcript boundaries, exons, and coding sequence.

Information about those sorts of genomic features can be found in a .gff file. The extension .gff stands for “Genomics Features File.” The pearl millet GFF for the genome that the variants in the .genotypes were mapped against can be found at: https://cegresources.icrisat.org/data_public/PearlMillet_Genome/v1.0/pm_assembly_v1.0.gff.

The first few lines of that file look like:

chr1    GLEAN   mRNA    29490538    29492233    0.670065    +   .   ID=Pgl_GLEAN_10008614;
chr1    GLEAN   CDS 29490538    29490605    .   +   0   Parent=Pgl_GLEAN_10008614;
chr1    GLEAN   CDS 29491108    29491390    .   +   1   Parent=Pgl_GLEAN_10008614;
chr1    GLEAN   CDS 29491994    29492233    .   +   0   Parent=Pgl_GLEAN_10008614;
chr1    GLEAN   mRNA    244026666   244030597   0.57488 +   .   ID=Pgl_GLEAN_10020091;
chr1    GLEAN   CDS 244026666   244027096   .   +   0   Parent=Pgl_GLEAN_10020091;
chr1    GLEAN   CDS 244027627   244028075   .   +   1   Parent=Pgl_GLEAN_10020091;
chr1    GLEAN   CDS 244030287   244030597   .   +   2   Parent=Pgl_GLEAN_10020091;
chr1    GLEAN   mRNA    224685509   224689330   0.653208    +   .   ID=Pgl_GLEAN_10021721;
chr1    GLEAN   CDS 224685509   224685562   .   +   0   Parent=Pgl_GLEAN_10021721;
chr1    GLEAN   CDS 224687184   224687617   .   +   0   Parent=Pgl_GLEAN_10021721;
chr1    GLEAN   CDS 224687878   224688208   .   +   1   Parent=Pgl_GLEAN_10021721;
chr1    GLEAN   CDS 224688317   224688388   .   +   0   Parent=Pgl_GLEAN_10021721;
chr1    GLEAN   CDS 224688463   224688664   .   +   0   Parent=Pgl_GLEAN_10021721;
chr1    GLEAN   CDS 224688789   224689330   .   +   2   Parent=Pgl_GLEAN_10021721;

Each row shows the start and stop position of some feature in the genome. Each of these features may themelves be sub-features of some other parent feature. For example, the first feature is an mRNA transcript with the ID Pgl_GLEAN_10008614. Next there are several coding sequences (CDS) that are subfeatures of the Pgl_GLEAN_10008614 transcript, which can be seen because they have Pgl_GLEAN_10008614 as their parent.

Trying to figure out whether a variant occurs in a position within a coding sequence that will create an amino acid change can be done by figuring out whether the variant is in a coding sequence and whether the sequence change causes a non-synonymous change in the codon that it is in, etc. Doing so by writing your own scripts is hard and subject to a lot of error. (I know, because I spent a good several days doing that some years ago, and then found I had made a lot of mistakes when I went to check the interesting sites I found on the GenBank genome browser!) There is a lot to figure out because the coding sequences are translated from either the plus (+) or the minus (-) strand relative to the reference genome, etc.

It is much easier to use a program like snpEff to do that for you. However, snpEff takes VCF as the input format. Hence, one might want to convert a .genotypes file to VCF.

22.3.2 Converting the `.genotypes` file to VCF

There are lots of ways we could do this. Today we are going to use awk, because it is fast and can process the file one line at a time, so it has a very small memory footprint.

Obviously, the .genotypes file really only has information about the genotypes themselves, so when we make a VCF file out of it, it is going to be pretty stripped down. In fact, this is really going to become a minimal VCF file. There is so little information in it, that the header for it will be just the minimal header:

##fileformat=VCFv4.2
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

and then it will have the #CHROM header line:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  PE00838 PE01458 PE05720 PE05722 ...

And then we will start with the variant lines that will look like this:

chr1    48  .   A   G   100 PASS    .   GT  ./. ./. 1/1 1/1 0/1 ...

For each variant, we have a CHROM and a POS. We don’t have an ID, so those will all be set to .. We have a REF and an ALT, and for the QUAL column we will just arbitrarily put in a 100, and in the FILTER column we will give each variant a PASS. There is no other information about these variants so we list the INFO with . (missing data). Then, the FORMAT column lists only the single field, GT, which is described in our minimal header, and then the actual genotypes get listed.

To make such a VCF file, we write an awk script that first prints the header lines, and then, for each variant, it does two passes through the information in the .genotypes file:

A first pass to figure out what the reference and all the different alternate alleles are. Numbers are assigned to the different alternate alleles at this step.
A second pass to convert each column to something that looks like 0/1 or 1/1, etc.

Before we proceed, let’s confirm that only the 2-nucleotide IUPAC codes are used, which makes sense, because these are diploids.

gzcat WP.pgchr1.genotype_10K.gz | awk '{for(i=3;i<=NF;i++) n[$i]++} NR>100000 {for(i in n) print i, n[i]; exit}'
A 5240464
C 7949796
G 7852659
K 85767
M 83639
R 263234
S 78251
T 5274252
W 79988
Y 265687
- 10526640

So, we need a way of converting IUPAC codes to alleles. Below, for each IUPAC-code-genotype given in the first column, the first and second alleles are given in the second and third columns:

A A A
C C C
G G G
T T T
R A G
Y C T
S G C
W A T
K G T
M A C

If we had such a table stored in a file extra/iupac.txt, we could use awk to write a block of awk code that defines some awk arrays giving the values of the first and second allele for each. That would look like this:

 awk '{printf("iupac[\"%s\",1] = \"%s\";\n", $1, $2); printf("iupac[\"%s\",2] = \"%s\";\n\n", $1, $3);}' extra/iupac.txt
iupac["A",1] = "A";
iupac["A",2] = "A";

iupac["C",1] = "C";
iupac["C",2] = "C";

iupac["G",1] = "G";
iupac["G",2] = "G";

iupac["T",1] = "T";
iupac["T",2] = "T";

iupac["R",1] = "A";
iupac["R",2] = "G";

iupac["Y",1] = "C";
iupac["Y",2] = "T";

iupac["S",1] = "G";
iupac["S",2] = "C";

iupac["W",1] = "A";
iupac["W",2] = "T";

iupac["K",1] = "G";
iupac["K",2] = "T";

iupac["M",1] = "A";
iupac["M",2] = "C";

Then we can put that code into a little awk script that we will call genotypes2vcf.awk. You can copy the code in the following block and put it into a file called script/genotypes2vcf.awk (i.e., it is called genotypes2vcf.awk inside a directory you made called script):

# we begin by creating arrays that tell us the first and second allele
# of each IUPAC-coded genotype
BEGIN {
  iupac["A",1] = "A";
  iupac["A",2] = "A";

  iupac["C",1] = "C";
  iupac["C",2] = "C";

  iupac["G",1] = "G";
  iupac["G",2] = "G";

  iupac["T",1] = "T";
  iupac["T",2] = "T";

  iupac["R",1] = "A";
  iupac["R",2] = "G";

  iupac["Y",1] = "C";
  iupac["Y",2] = "T";

  iupac["S",1] = "G";
  iupac["S",2] = "C";

  iupac["W",1] = "A";
  iupac["W",2] = "T";

  iupac["K",1] = "G";
  iupac["K",2] = "T";

  iupac["M",1] = "A";
  iupac["M",2] = "C";

  # Now, print the minimal header lines
  printf("##fileformat=VCFv4.2\n")
  printf("##FORMAT=<ID=GT,Number=1,Type=String,Description=\"Genotype\">\n")
  printf("#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT")

}


# we catenate the sample names, each on a line starting with SAMPLE
# to the input, and then print them all here
$1=="SAMPLE" {
  printf("\t%s", $NF);
  next
}

/xxxxxxxxxxxxxx/ {printf("\n"); next}



# then, for each row, we make a first pass and define integers for the
# different alternate alleles. Then we make a second pass and print
# the VCF line
{
  CHROM=$1;
  POS=$2
  delete a_ints; # clear these arrays
  delete alleles;
  a_idx = 0;  # start the allele index at 0 (the REF)
  alleles[$3] = a_idx; # reference allele gets a 0
  a_ints[a_idx] = $3;  # keep this array to have alleles in sorted order,
                       # indexed from 0 to the number of alleles - 1

  # cycle over all the columns, and the two alleles within each genotype
  # in that column, and add any new alleles found to the alleles hash array
  for(i=4;i<=NF;i++) {
    for(a=1;a<=2;a++) if($i != "-") {
      alle = iupac[$i, a];
      if(!(alle in alleles)) {  # if we have not seen this allele before
        alleles[alle] = ++a_idx;
        a_ints[a_idx] = alle
      }
    }
  }

  # Now we can print the VCF line
  # print CHROM, POS, ID, and REF columns
  printf("%s\t%s\t.\t%s", $1, $2, $3) 

  # print the ALT field, including comma-separated alleles if multiallelic
  if(a_idx == 0) { # if there are no alternate alleles ALT gets a .
    printf("\t.")
  } else {
    printf("\t%s", a_ints[1])
    for(a=2;a<=a_idx;a++)
      printf(",%s", a_ints[a])
  }

  # Set all the QUALs to 100, the FILTERs to PASS and the INFO to .
  printf("\t100\tPASS\t.")

  # make the FORMAT column.  It just has GT
  printf("\tGT")

  # now, cycle over the individuals and print their genotypes
  for(i=4;i<=NF;i++) {
    if($i=="-") {
      printf("\t./.");
    } else {
      a = iupac[$i, 1];
      b = iupac[$i, 2];
      printf("\t%s/%s", alleles[a], alleles[b])
    }
  }
  
  printf("\n")
}

In order to use that awk script we will use awk’s -f option to read the script, and we will also pass to awk via stdin the genotypes file with the sample names on top of it followed by a line with an xxxxxxxxxxxxxxxxx

(
  awk 'NR>=5 {printf("SAMPLE %s\n", $2);}' genotype.readme; 
  echo xxxxxxxxxxxxxxxxx; 
  gzcat WP.pgchr1_10K.genotype.gz
) | awk -f script/genotypes2vcf.awk > WP.pgchr1_10K.vcf

That creates the VCF file, WP.pgchr1_10K.vcf. Woo-hoo! On my old laptop, that takes about 10 seconds. So, about 1 second per 1000 variants. That means that if you have 3.6 million variants, it will take about an hour to convert them all to this VCF format. But once they are in VCF, you can compress and index them and manipulate them quite quickly.

22.3.3 Reheader the VCF

It is good practice to have the reference genome information in the VCF header. We can do that with the bcftools reheader subcommand. To do so, you need to have the .fai file made by running samtools faidx on the reference genome. You can download the .fai file from: https://eriqande.github.io/eca-bioinf-handbook/downloads/pearl_millet_v1.1.fa.fai or, on your cluster, use:

wget https://eriqande.github.io/eca-bioinf-handbook/downloads/pearl_millet_v1.1.fa.fai

Once we have that, we can reheader the VCF file like this (it is piped to less just so you can view the result without it bombing your screen).

bcftools reheader -f pearl_millet_v1.1.fa.fai WP.pgchr1_10K.vcf | less

While we are at it, however, we are going to go one step further and add some fields to the INFO column. We are going to calculate the alternate allele count (AC) the total number of alleles of all types (AN) the alternate allele frequency (AF) and the minor allele frequency (MAF) for each variant. To add these INFO tags we will use the fill-tags plug-in for bcftools. This plug-in comes standard with bcftools. You can see the help information for it like this:

bcftools +fill-tags -h

For a detailed list of the available tags to fill you can do:

bcftools +fill-tags -- -l

So, we can add the genome to the header and add AN, AC, AF, and MAF fields to the INFO, and compress the file into a vcf.gz, all in one fell swoop with the following, and then we can index it.

bcftools reheader -f pearl_millet_v1.1.fa.fai WP.pgchr1_10K.vcf | \
    bcftools +fill-tags - -Oz -o WP.pgchr1_10K.vcf.gz -- -t AN,AC,AF,MAF

# then we could index twice to get both a .tbi and a .csi index
# like this
bcftools index WP.pgchr1_10K.vcf.gz  # makes .csi
bcftools index -t WP.pgchr1_10K.vcf.gz  # makes .tbi

And then you could look it at like this:

bcftools view WP.pgchr1_10K.vcf.gz | less

Or you could compute statistics for it like this:

bcftools stats WP.pgchr1_10K.vcf.gz | less

And, if you wanted to look at only the multiallelic sites you could do this:

bcftools view -m 3 WP.pgchr1_10K.vcf.gz | less

Wow! Playing around with this makes it clear that the VCF format is pretty awesome, and that if you get good at using bcftools, then there are so many wonderful possibilities.

22.3.4 Where to from here?

A good exercise, (especially for anyone studying pearl millet!), would be to write a script that cycled over the different chromosomes and made a VCF for each one, then use bcftools concat to concatenate them into a single VCF file and add a few info tags. Once that is done, you would have a VCF file ready to annotate some variants with snpEff.

To automate this all on all the cluster it is worth noting that the .genotype files can be downloaded like so:

wget --no-check-certificate https://cegresources.icrisat.org/data_public/PM_SNPs/SNP_calls/WP.pgchr7.genotype.gz

Perhaps on Thursday we can review a Snakefile that would do all this.