The GRIDSS software suite is something that we might be able to use to find structural variants. I am just going to try to throw the merged BAMs in as if they were for single individuals, since large numbers of individuals will apparently bog this down.

This is not super well documented, but let’s start giving it a whirl.

I am doing all this on the SEDNA cluster.

Download and “install”

# in /home/eanderson/Documents/git-repos/thompson-et-al-chinook-salmon-2020
mkdir -p gridss
cd gridss
wget https://github.com/PapenfussLab/gridss/releases/download/v2.8.3/gridss-2.8.3.tar.gz
gunzip gridss-2.8.3.tar.gz
tar -xvf gridss-2.8.3.tar
cd ..

Index the genome with bwa

bwa index genome/Otsh_V1_genomic.fna

Make a directory to play with this stuff, and try some things

# get 20 cores
srun --pty -c 20 /bin/bash

mkdir gridss-work-area
cd gridss-work-area

# make a configuration file for one changed option:
echo 'useReadGroupSampleNameCategoryLabel=false' > my_config.R

# load modules
module load aligners/bwa
module load bio/samtools
module load R



# now, try a command to launch it:

../gridss/gridss.sh \
  --reference ../genome/Otsh_V1_genomic.fna \
  --output first-test.vcf.gz \
  --assembly first-test.bam \
  --threads 20 \
  --jar ../gridss/gridss-2.8.3-gridss-jar-with-dependencies.jar \
  --workingdir working_dir \
  --jvmheap 25g \
  --configuration my_config.R \
  --labels fall,spring \
  ../intermediates/201/merged_bams/fall.2Mb.bam \
  ../intermediates/201/merged_bams/spring.2Mb.bam

That runs pretty quickly. At the end, we can copy the resulting VCF file (first-test.vcf.gz) to stored_results/201 to analyse on the laptop in R.

Onto the laptop

I have the full VCF in intermediates/201. Now, let us analyze it.

library(tidyverse)
library(vcfR)
ftv <- read.vcfR("intermediates/201/first-test.vcf.gz")

## Scanning file to determine attributes.
## File attributes:
##   meta lines: 14130
##   header_line: 14131
##   variant count: 4215
##   column count: 11
## 
Meta line 1000 read in.
Meta line 2000 read in.
Meta line 3000 read in.
Meta line 4000 read in.
Meta line 5000 read in.
Meta line 6000 read in.
Meta line 7000 read in.
Meta line 8000 read in.
Meta line 9000 read in.
Meta line 10000 read in.
Meta line 11000 read in.
Meta line 12000 read in.
Meta line 13000 read in.
Meta line 14000 read in.
Meta line 14130 read in.
## All meta lines processed.
## gt matrix initialized.
## Character matrix gt created.
##   Character matrix gt rows: 4215
##   Character matrix gt cols: 11
##   skip: 0
##   nrows: 4215
##   row_num: 0
## 
Processed variant 1000
Processed variant 2000
Processed variant 3000
Processed variant 4000
Processed variant: 4215
## All variants processed

ftvt <- vcfR2tidy(ftv)


#View(ftvt$gt)
#View(ftvt$fix)

Let’s get the breakend position in the ALT field and filter things down to only those variants that “span” the RoSA, and then arrange them by the absolute difference of the breakend positions from each pair, just for a quick look

candidates <- ftvt$fix %>%
  filter(CHROM == "NC_037124.1" & str_detect(ALT, "NC_037124.1")) %>%
  mutate(ALT_POS = as.integer(str_match(ALT, ":([0-9]+)")[,2]),
         BE_ABS_DIFF = abs(ALT_POS - POS)) %>%
  select(ChromKey:POS, ALT_POS, BE_ABS_DIFF, everything()) %>%
  filter(pmin(POS, ALT_POS) < 12.26e6 & pmax(POS, ALT_POS) > 12.29e6) %>%
  arrange(desc(BE_ABS_DIFF))

candidates

That shows two candidates. The first four rows are the characteristic pattern you expect from breakend notation for an inversion (see https://samtools.github.io/hts-specs/VCFv4.3.pdf). In this case, it would be a ~1 Mb inversion spanning from 11.67 Mb to 12.61 Mb. However, this is a Low Quality Variant with no assembly. Furthermore, if we look at the genotype information for the aggregated spring run and fall-run alignments, we don’t see evidence that the variants are specific to ecotypes:

One_Mb_candi <- ftvt$gt %>%
  filter(POS %in% candidates$POS[1:4])
One_Mb_candi

In particular, we see that there are not substantial differences between the spring-run and the fall-run fish in the count of read pairs spanning any of the breakends supporting the reference allele (gt_REFPAIR column). Likewise, similar (and very low) numbers of fragments supporting the alternate allele (i.e. an inversion) are observed from spring-run and fall-run fish (gt_VF column). These records certainly do not provide evidence that this is an inversion fixed for different forms between spring run and fall run.

The final two rows of candidates suggest a 387 Kb inversion from 12.17 to 12.55 Mb; however this is also considered a low quality variant with support from just a single assembly, and only includes half of the breakends expected for an inversion. Here is the genotype information for that variant:

other_candi <- ftvt$gt %>%
  filter(POS %in% candidates$POS[5:6])
other_candi

Investigating that we see that both spring run and fall run fish have comparable numbers of read pairs supporting the reference at one of the breakends (gt_REFPAIR). No fragments support the alternate allele in spring run, while only four fragments support the alternate allele in the fall run, but the genome reference is a fall-run fish, so one would expect the spring-run fish to have the inverted form. Furthermore, an inversion from 12.17 to 12.55 Mb should suppress recombination between the RoSA and the distal region, but recombination has clearly occurred there. The verdict here is that neither of these records provide credible support for an inversion in the region.

Try running it on indiv samples

I suspect that it will not work, but let’s try giving it all the springs and all the falls. Basically 128 individuals, or so.

# get 20 cores
srun --pty -c 20 /bin/bash

mkdir gridss-work-area
cd gridss-work-area

# make a configuration file for one changed option:
echo 'useReadGroupSampleNameCategoryLabel=false' > my_config.R

# load modules
module load aligners/bwa
module load bio/samtools
module load R


# get all the fall then all the spring bams
INDIV_BAMS=$(ls -l ../intermediates/201/single_bams/fall/*.bam ../intermediates/201/single_bams/spring/*.bam | awk '{printf("%s ", $NF);}')

# now, try a command to launch it:

../gridss/gridss.sh \
  --reference ../genome/Otsh_V1_genomic.fna \
  --output as-indivs-test.vcf.gz \
  --assembly as-indivs-test.bam \
  --threads 20 \
  --jar ../gridss/gridss-2.8.3-gridss-jar-with-dependencies.jar \
  --workingdir as-indivs-working-dir \
  --jvmheap 25g \
  --configuration my_config.R \
  $INDIV_BAMS

That ended up failing with a “too many files” open error. While assembling Chunk 157, which included the RoSA. However, I am satisfied with the analysis of the aggregated data.

Session Info

sessioninfo::session_info()

## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 3.6.2 (2019-12-12)
##  os       macOS Sierra 10.12.6        
##  system   x86_64, darwin15.6.0        
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  ctype    en_US.UTF-8                 
##  tz       America/Denver              
##  date     2020-05-08                  
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package     * version  date       lib source        
##  ape           5.3      2019-03-17 [1] CRAN (R 3.6.0)
##  assertthat    0.2.1    2019-03-21 [1] CRAN (R 3.6.0)
##  backports     1.1.6    2020-04-05 [1] CRAN (R 3.6.2)
##  broom         0.5.6    2020-04-20 [1] CRAN (R 3.6.2)
##  cellranger    1.1.0    2016-07-27 [1] CRAN (R 3.6.0)
##  cli           2.0.2    2020-02-28 [1] CRAN (R 3.6.0)
##  cluster       2.1.0    2019-06-19 [2] CRAN (R 3.6.2)
##  colorspace    1.4-1    2019-03-18 [1] CRAN (R 3.6.0)
##  crayon        1.3.4    2017-09-16 [1] CRAN (R 3.6.0)
##  DBI           1.1.0    2019-12-15 [1] CRAN (R 3.6.0)
##  dbplyr        1.4.3    2020-04-19 [1] CRAN (R 3.6.2)
##  digest        0.6.25   2020-02-23 [1] CRAN (R 3.6.0)
##  dplyr       * 0.8.5    2020-03-07 [1] CRAN (R 3.6.0)
##  ellipsis      0.3.0    2019-09-20 [1] CRAN (R 3.6.0)
##  evaluate      0.14     2019-05-28 [1] CRAN (R 3.6.0)
##  fansi         0.4.1    2020-01-08 [1] CRAN (R 3.6.0)
##  forcats     * 0.5.0    2020-03-01 [1] CRAN (R 3.6.0)
##  fs            1.4.1    2020-04-04 [1] CRAN (R 3.6.2)
##  generics      0.0.2    2018-11-29 [1] CRAN (R 3.6.0)
##  ggplot2     * 3.3.0    2020-03-05 [1] CRAN (R 3.6.0)
##  glue          1.4.0    2020-04-03 [1] CRAN (R 3.6.2)
##  gtable        0.3.0    2019-03-25 [1] CRAN (R 3.6.0)
##  haven         2.2.0    2019-11-08 [1] CRAN (R 3.6.0)
##  hms           0.5.3    2020-01-08 [1] CRAN (R 3.6.0)
##  htmltools     0.4.0    2019-10-04 [1] CRAN (R 3.6.0)
##  httr          1.4.1    2019-08-05 [1] CRAN (R 3.6.0)
##  jsonlite      1.6.1    2020-02-02 [1] CRAN (R 3.6.0)
##  knitr         1.28     2020-02-06 [1] CRAN (R 3.6.0)
##  lattice       0.20-38  2018-11-04 [2] CRAN (R 3.6.2)
##  lifecycle     0.2.0    2020-03-06 [1] CRAN (R 3.6.0)
##  lubridate     1.7.8    2020-04-06 [1] CRAN (R 3.6.2)
##  magrittr      1.5      2014-11-22 [1] CRAN (R 3.6.0)
##  MASS          7.3-51.6 2020-04-26 [1] CRAN (R 3.6.2)
##  Matrix        1.2-18   2019-11-27 [2] CRAN (R 3.6.2)
##  memuse        4.1-0    2020-02-17 [1] CRAN (R 3.6.0)
##  mgcv          1.8-31   2019-11-09 [2] CRAN (R 3.6.2)
##  modelr        0.1.6    2020-02-22 [1] CRAN (R 3.6.0)
##  munsell       0.5.0    2018-06-12 [1] CRAN (R 3.6.0)
##  nlme          3.1-142  2019-11-07 [2] CRAN (R 3.6.2)
##  permute       0.9-5    2019-03-12 [1] CRAN (R 3.6.0)
##  pillar        1.4.3    2019-12-20 [1] CRAN (R 3.6.0)
##  pinfsc50      1.1.0    2016-12-02 [1] CRAN (R 3.6.0)
##  pkgconfig     2.0.3    2019-09-22 [1] CRAN (R 3.6.0)
##  purrr       * 0.3.4    2020-04-17 [1] CRAN (R 3.6.2)
##  R6            2.4.1    2019-11-12 [1] CRAN (R 3.6.0)
##  Rcpp          1.0.4    2020-03-17 [1] CRAN (R 3.6.0)
##  readr       * 1.3.1    2018-12-21 [1] CRAN (R 3.6.0)
##  readxl        1.3.1    2019-03-13 [1] CRAN (R 3.6.0)
##  reprex        0.3.0    2019-05-16 [1] CRAN (R 3.6.0)
##  rlang         0.4.5    2020-03-01 [1] CRAN (R 3.6.0)
##  rmarkdown     2.1      2020-01-20 [1] CRAN (R 3.6.0)
##  rstudioapi    0.11     2020-02-07 [1] CRAN (R 3.6.0)
##  rvest         0.3.5    2019-11-08 [1] CRAN (R 3.6.0)
##  scales        1.1.0    2019-11-18 [1] CRAN (R 3.6.0)
##  sessioninfo   1.1.1    2018-11-05 [1] CRAN (R 3.6.0)
##  stringi       1.4.6    2020-02-17 [1] CRAN (R 3.6.0)
##  stringr     * 1.4.0    2019-02-10 [1] CRAN (R 3.6.0)
##  tibble      * 3.0.1    2020-04-20 [1] CRAN (R 3.6.2)
##  tidyr       * 1.0.2    2020-01-24 [1] CRAN (R 3.6.0)
##  tidyselect    1.0.0    2020-01-27 [1] CRAN (R 3.6.0)
##  tidyverse   * 1.3.0    2019-11-21 [1] CRAN (R 3.6.0)
##  vcfR        * 1.10.0   2020-02-06 [1] CRAN (R 3.6.0)
##  vctrs         0.2.4    2020-03-10 [1] CRAN (R 3.6.0)
##  vegan         2.5-6    2019-09-01 [1] CRAN (R 3.6.0)
##  viridisLite   0.3.0    2018-02-01 [1] CRAN (R 3.6.0)
##  withr         2.2.0    2020-04-20 [1] CRAN (R 3.6.2)
##  xfun          0.13     2020-04-13 [1] CRAN (R 3.6.2)
##  xml2          1.3.2    2020-04-23 [1] CRAN (R 3.6.2)
##  yaml          2.2.1    2020-02-01 [1] CRAN (R 3.6.0)
## 
## [1] /Users/eriq/Library/R/3.6/library
## [2] /Library/Frameworks/R.framework/Versions/3.6/Resources/library

Running Time

Running the code and rendering this notebook required approximately this much time on a Mac laptop of middling speed:

Sys.time() - start_time

## Time difference of 3.986401 secs

Use GRIDSS to search for structural variation

Eric C. Anderson

Last Updated: 2020-05-08