Chapter 6 Week Five Meeting

6.1 Some things regarding people’s repositories

6.1.1 Data Compression

We have been endorsing .csv as a good format for data, and it is, because it is human-readable and easily parsed into tibbles. However, when you have very long tibbles, it is not necessarily the most space-efficient format. Large .csv files can take up much more space on your hard drive than they should.

What do we mean by “than they should?” in that context? This has to do with how much information is in the file, where information is used in the context of information theory. Often tibbles will have columns that have fairly “redundant” information—for example, in a mutlispecies salmon data set, one column might have entries that are either “Chinook”, “coho”, or “steelhead”. It takes a few bytes to store each one of those words, and if they are used in a column that has millions of rows, that can add up to a lot of space on your hard drive. Colloquially, data compression is the art of finding ways of using short “code-names” for things or patterns that occur frequently in a file, and in so doing, reducing the overall file size.

The consequences of data compression can be profound and wonderful. You can reduce the size of a file, sometimes by an order or magnitude or more. There are a few good choices available for compressing your data (making it smaller.). Note that doing so often makes it a little harder to edit your data set; however, if your data set is not going to change, and it is large, then it makes sense to compress it—especially if it is so big you would rather not (or can’t) put it on GitHub.

6.1.1.1 gzip

If you are working on a Mac, you have the Unix utility gzip. We will illustrate its use on Katie’s big salmon data set, ASL_merged.csv. Let’s see how big that is. We can use the Unix utility du (stands for “disk usage”).

# give this command on the Terminal in the directory where the file lives:
du -h ASL_merged.csv 

The results comes back:

322M    ASL_merged.csv

Whoa! This file is 332 Megabytes. That is quite large!

However, we can compress it like this:

gzip ASL_merged.csv 

When we do that, it compresses the file and renames it to have a .gz extension on it: ASL_merged.csv.gz. We can then see how big that file is:

du -h ASL_merged.csv.gz 

tells us:

11M ASL_merged.csv.gz

Whoa! We went from 332 Megabytes to 11. It is just 3% of its original size (and small enough that you can safely put it on GitHub).

One very nice feature is that gzipped files can be read in directly by the functions of the readr package. So, for example, read_csv() works just fine on the gzipped version of Katie’s massive salmon data set:

# this works the same as it would on the ungzipped file
salmon <- read_csv("data/ASL_merged.csv.gz")

6.1.1.2 xz comression with saveRDS()

Another method that can be even more efficient with tibbles is to store them as R objects using the saveRDS() function with the xz compression option. This has the nice advantage that all the data types of the variables (for example, if you had made factors out of some) will be preserved exactly as they are in the tibble when you save it.

Let’s imagine we have read the tibble into the variable salmon, and all the column types were as we wanted. Then, we could save that tibble directly to a compressed file like this:

saveRDS(salmon, file = "ASL_xz.rds", compress = "xz")

Note that compress = "xz" option. Let’s see how that did using du on the Unix terminal:

du -h ASL_xz.rds

tells us:

3.7M    ASL_xz.rds

Holy Smokes! Only 3.7 Megabytes. That is only 1.1% of its original size. Lovely!

In order to read that tibble back into a variable (named my_var, say) in R, you would use readRDS() like this:

my_var <- readRDS(file = "ASL_xz.rds")

Voila!

6.2 A quick aside about missing data

Garrett and Hadley note that “missing values are ‘contagious’: almost any operation involving an unknown value will also be unknown.” This is true for the most part, but there is a vexing inconsistency. Observe

This gives us NA as we would hope it would

NA == 0 | NA == 1
## [1] NA

However, this one returns FALSE. What gives?

NA %in% c(0, 1)
## [1] FALSE

6.3 Brief Highlights of the Joins Chapter

You will likely end up using joins all the time. As noted in the book, the left_join() is what you will likely use all the time. In this case you have a “focal” data frame with all the rows (“cases” or “observations”) in the x table that you are going to be wanting to add some columns to. Those columns live in the y table (along with the matching keys).

The left join should be your default join: use it unless you have a strong reason to prefer one of the others.

You an do the same things with base R’s merge() function, but it is slower and somewhat harder to express your intent with it. (I’ve always really disliked the merge() function…)

6.3.1 A few thoughts on keys

It is always a worthwhile exercise to go through and figure out what the primary key is in a tibble you are working with. It might be that the primary key is a compound key: it defines unique observations by a combination of several variables. Sometimes there is no explicit primary key! It is worthwhile to add a surrogate key in that case.

On the flip side of these issues: when you are compiling your own data set, you might want to spend some time making sure that units that might be relevant to an analysis are explicitly identified in a single column. Here is an example: the NOAA Observer program takes tissue samples from bycatch for genetic analysis. There is a primary key tissue_sample for every tissue sample. However, under some circumstances they take multiple tissue samples from the same individual. But they don’t have a column in the data set with an indvividual ID. So, when the send their samples to people who will genotype them, a lot of individuals are unwittingly genotyped twice. Their response: “Well, isn’t it obvious that if a tissue sample is taken from a fish that was caught on the same vessel on the same day and in the same haul, and is of the same species and has the same recorded length is the same individual?” My response: “NO! It isn’t!. Don’t make people use a whole lot of columns to identify things that should be identified in a single column!”

6.4 An example of using some joins

Let’s walk through a simple case that should be familiar to those in the group who have worked with genetic data and have had to deal with the problem of attaching meta data to genetic data coming off a genotyping instrument.

Typically those data come out in a form that can be made into a tibble. Let’s read in a toy example:

library(tidyverse)
genos <- read_csv("inputs/toy_geno.csv")
genos
## # A tibble: 8 x 5
##     bird locus1_a locus1_b locus2_a locus2_b
##    <chr>    <int>    <int>    <int>    <int>
## 1 wiwa01        1        2        3        3
## 2 wiwa02        2        2        4        4
## 3 wiwa03        2        2        3        4
## 4 wiwa04        1        1        4        4
## 5 wiwa05        2        2        3        4
## 6 wiwa06        1        1        4        4
## 7 wiwa07        1        2        3        4
## 8 wiwa08        1        1        3        4

There is a single column (bird in this case) that is the primary key that uniquely identifies individuals. Then each locus gets two columns of data (one for each gene copy in a diploid).

That is all well and good. But now, consider this problem: for a particular analysis we are going to do, we need to have the latitude and longitude coordinates where each bird was sampled. Let’s say that we got these samples from friendly collectors who provided a meta data file that gave us the collection location and the name of the collector. Let’s look at that:

meta <- read_csv("inputs/toy_meta.csv")
meta
## # A tibble: 10 x 3
##    field_id location collector
##       <chr>    <chr>     <chr>
##  1   wiwa01     swAZ       joe
##  2   wiwa02     nwAZ      mary
##  3   wiwa03     soCA       ted
##  4   wiwa05     soCA       ted
##  5   wiwa06     swAZ       joe
##  6   wiwa08     noCA      erin
##  7   wiwa09     noCA      erin
##  8   wiwa10     noCA      erin
##  9   wiwa11     noCA      erin
## 10   wiwa12     noCA      erin

Notice: there are some birds in this data set that we don’t have in the genos tibble. Not only that, but if you look closely, it is missing some birds in genos: they are “wiwa04” and “wiwa07”. Also, note that the column that holds the ID of each bird is called field_id not bird.

Finally, the network of bird sample collectors maintains a data base of all their location codes that looks like this:

locations <- read_csv("inputs/toy_locations.csv")
locations
## # A tibble: 6 x 3
##   location   lat   long
##      <chr> <dbl>  <dbl>
## 1     swAZ  32.1 -112.8
## 2     nwAZ  36.8 -113.5
## 3     soCA  33.2 -117.1
## 4     noCA  40.6 -123.9
## 5     soOR  42.9 -123.7
## 6     noOR  45.9 -123.1

Aha! So, what we need to do is associate with each bird a location, and then once we have done with that, we need to associate a lat and a long with those locations. This is the perfect job for a join (two of them, actually).

Note that we are focused on our birds, here, so we want to keep them all around and not add any information where we don’t have a bird. Hence left_join() is our go-to friend there (as it almost always will be).

Here is what the first step looks like:

genos %>%
  left_join(., meta, by = c("bird" = "field_id"))
## # A tibble: 8 x 7
##     bird locus1_a locus1_b locus2_a locus2_b location collector
##    <chr>    <int>    <int>    <int>    <int>    <chr>     <chr>
## 1 wiwa01        1        2        3        3     swAZ       joe
## 2 wiwa02        2        2        4        4     nwAZ      mary
## 3 wiwa03        2        2        3        4     soCA       ted
## 4 wiwa04        1        1        4        4     <NA>      <NA>
## 5 wiwa05        2        2        3        4     soCA       ted
## 6 wiwa06        1        1        4        4     swAZ       joe
## 7 wiwa07        1        2        3        4     <NA>      <NA>
## 8 wiwa08        1        1        3        4     noCA      erin

Notice that we have some NAs for birds that are not in the meta data. That is the behavior we expect from left_join(): it is not going to discard some of your birds, just because they don’t appear in the meta data.

Note also that when we explicitly give the names of the keys (which differ in the different tibbles) the one on the left corresponds to the x argument to left_join(). Also, notice that these key names must be quoted!! (It is easy to forget that, because you so seldom need quotation marks around things in the tidyverse.)

Now, we can add the lat-longs on there. We will show how that is done by chaining onto the previous command:

genos %>%
  left_join(., meta, by = c("bird" = "field_id")) %>%
  left_join(., locations)
## Joining, by = "location"
## # A tibble: 8 x 9
##     bird locus1_a locus1_b locus2_a locus2_b location collector   lat
##    <chr>    <int>    <int>    <int>    <int>    <chr>     <chr> <dbl>
## 1 wiwa01        1        2        3        3     swAZ       joe  32.1
## 2 wiwa02        2        2        4        4     nwAZ      mary  36.8
## 3 wiwa03        2        2        3        4     soCA       ted  33.2
## 4 wiwa04        1        1        4        4     <NA>      <NA>    NA
## 5 wiwa05        2        2        3        4     soCA       ted  33.2
## 6 wiwa06        1        1        4        4     swAZ       joe  32.1
## 7 wiwa07        1        2        3        4     <NA>      <NA>    NA
## 8 wiwa08        1        1        3        4     noCA      erin  40.6
## # ... with 1 more variables: long <dbl>

Voila! That is what we wanted. Now, we could filter out those NAs and drop the collector column, if desired.

Notice that, since the location column was named location in tibble, left_join() just used that.

6.4.1 When would I use right_join()?

The only time I use this is when I want to add columns to the beginning of a tibble, but I want to preserve all the keys in the table that is going to end up with its columns on the right hand side of the table. And even then I usually just use select after a left_join(). Perhaps an example will be best: for some purposes it is best to keep the genotype data all together on the right hand side of the tibble (often genotype data can take up lots of columns and you might want to be able to see the columns you have joined on without using View() and scrolling way over). In this case you can do this:

genos %>%
  right_join(meta, ., by = c("field_id" = "bird")) %>%
  right_join(locations, .) 
## Joining, by = "location"
## # A tibble: 8 x 9
##   location   lat   long field_id collector locus1_a locus1_b locus2_a
##      <chr> <dbl>  <dbl>    <chr>     <chr>    <int>    <int>    <int>
## 1     swAZ  32.1 -112.8   wiwa01       joe        1        2        3
## 2     nwAZ  36.8 -113.5   wiwa02      mary        2        2        4
## 3     soCA  33.2 -117.1   wiwa03       ted        2        2        3
## 4     <NA>    NA     NA   wiwa04      <NA>        1        1        4
## 5     soCA  33.2 -117.1   wiwa05       ted        2        2        3
## 6     swAZ  32.1 -112.8   wiwa06       joe        1        1        4
## 7     <NA>    NA     NA   wiwa07      <NA>        1        2        3
## 8     noCA  40.6 -123.9   wiwa08      erin        1        1        3
## # ... with 1 more variables: locus2_b <int>

Notice that when you do this, the column name of the x variable, field_id is the one that gets retained.

6.4.2 What about that inner_join()

If you knew ahead of time that you couldn’t use any birds that you didn’t have lat-longs for, you could start with an inner_join(), because that would discard birds that don’t have an entry in the meta data:

genos %>%
  inner_join(., meta, by = c("bird" = "field_id"))
## # A tibble: 6 x 7
##     bird locus1_a locus1_b locus2_a locus2_b location collector
##    <chr>    <int>    <int>    <int>    <int>    <chr>     <chr>
## 1 wiwa01        1        2        3        3     swAZ       joe
## 2 wiwa02        2        2        4        4     nwAZ      mary
## 3 wiwa03        2        2        3        4     soCA       ted
## 4 wiwa05        2        2        3        4     soCA       ted
## 5 wiwa06        1        1        4        4     swAZ       joe
## 6 wiwa08        1        1        3        4     noCA      erin

But, it will probably be easier to follow if you explicitly discard those birds using filter() or by doing a filtering join.

6.4.3 An anti_join example

If you want to get all the genotype data for birds that don’t occur in the meta data you can use the filtering anti_join():

genos %>%
  anti_join(., meta, by = c("bird" = "field_id"))
## # A tibble: 2 x 5
##     bird locus1_a locus1_b locus2_a locus2_b
##    <chr>    <int>    <int>    <int>    <int>
## 1 wiwa07        1        2        3        4
## 2 wiwa04        1        1        4        4

And to return only those rows for birds that are in the meta data, you could use semi_join():

genos %>%
  semi_join(., meta, by = c("bird" = "field_id"))
## # A tibble: 6 x 5
##     bird locus1_a locus1_b locus2_a locus2_b
##    <chr>    <int>    <int>    <int>    <int>
## 1 wiwa01        1        2        3        3
## 2 wiwa02        2        2        4        4
## 3 wiwa03        2        2        3        4
## 4 wiwa05        2        2        3        4
## 5 wiwa06        1        1        4        4
## 6 wiwa08        1        1        3        4

Note that we could turn it around in order to see which birds are in the meta data, but which don’t occur in the genos:

genos %>%
  anti_join(meta, ., by = c("field_id" = "bird"))
## # A tibble: 4 x 3
##   field_id location collector
##      <chr>    <chr>     <chr>
## 1   wiwa12     noCA      erin
## 2   wiwa11     noCA      erin
## 3   wiwa10     noCA      erin
## 4   wiwa09     noCA      erin

Dammit Erin! You always forget to send us the friggin’ samples! Clearly smokin’ too much of the kind green there in noCA. (Note. These names really are totally fictitious.)

Quick quiz: why would this not work:

genos %>%
   anti_join(meta, ., by = c("bird" = "field_id"))

6.4.4 Just for fun, let’s see a full_join()

This bad boy makes a row with NAs for values in either tibble that are not matched in the other:

genos %>%
  full_join(., meta, by = c("bird" = "field_id")) %>%
  full_join(., locations) %>%
  print(n = 20) # so all rows print
## Joining, by = "location"
## # A tibble: 14 x 9
##      bird locus1_a locus1_b locus2_a locus2_b location collector   lat
##     <chr>    <int>    <int>    <int>    <int>    <chr>     <chr> <dbl>
##  1 wiwa01        1        2        3        3     swAZ       joe  32.1
##  2 wiwa02        2        2        4        4     nwAZ      mary  36.8
##  3 wiwa03        2        2        3        4     soCA       ted  33.2
##  4 wiwa04        1        1        4        4     <NA>      <NA>    NA
##  5 wiwa05        2        2        3        4     soCA       ted  33.2
##  6 wiwa06        1        1        4        4     swAZ       joe  32.1
##  7 wiwa07        1        2        3        4     <NA>      <NA>    NA
##  8 wiwa08        1        1        3        4     noCA      erin  40.6
##  9 wiwa09       NA       NA       NA       NA     noCA      erin  40.6
## 10 wiwa10       NA       NA       NA       NA     noCA      erin  40.6
## 11 wiwa11       NA       NA       NA       NA     noCA      erin  40.6
## 12 wiwa12       NA       NA       NA       NA     noCA      erin  40.6
## 13   <NA>       NA       NA       NA       NA     soOR      <NA>  42.9
## 14   <NA>       NA       NA       NA       NA     noOR      <NA>  45.9
## # ... with 1 more variables: long <dbl>

This is not typically what we want! Sometimes it is…but usually you will be using left_join().

6.5 Working with R Notebooks

R Notebooks are totally awesome! They combine the nice features of working at the R console (namely having access to variables that remain in your .GlobalEnvironment), with the beauty of being able to document things in an easy to read and digest RMarkdown format.

Here, you can download an example of a short Notebook from one of the projects that Kristen and Eric are working on: choosing-snps

R notebooks are RMarkdown documents using the html_document option. They are great for doing and explaining analyses. This is where I end up doing most of my analyses these days. Typically I put them in a directory called R-main in my project. This is a more appropriate place to put long analyses than in README.Rmd. The README.Rmd file should be reserved for describing your data and giving people instructions on how to conduct the analysis, e.g. “Open ./R-main/01-clean-data.Rmd and run it all. Then run all the code in 02-compute-statistics.Rmd, etc.”

6.5.1 Open your own R Notebook

This is easy in RStudio: File -> New File -> R Notebook. This gives you a simple template that lets you see how things work and into which you can insert your own thoughts and writings.

For quick help with formatting: Help -> Markdown Quick Reference

For more Markdown info: Help -> Cheatsheets -> R Markdown Cheat Sheet

6.5.2 Working with R Notebooks

Big difference from “regular” R Markdown documents: there is no “Knit” button.

Instead, to get results from the code, you must evaluate it, then “Preview”.

Ways of evaluating code blocks:

  1. right-facing triangle – evaluate current block
  2. down-facing gray triangle – evaluate all the blocks above this one.

Or you can use keyboard shortcuts, or use the “Run” button in the upper right of the document.

Results from code blocks get presented in the Notebook.

To get something like “Knit”: do this:

Run -> Restart R and Run All Chunks

6.5.3 Some caveats about notebooks

  1. You should restart and Run All occasionally to make sure it is reproducible.
  2. When evaluating code within an R notebook, by default the working directory for R is set to the directory that the R notebook live in, not the root directory of the project. So, it can give different results (for example when doing file acess) than the R Console. This can be hugely frustrating.
  3. your variables all live in the GlobalEnvironment, so they are at risk of getting overwritten if you use the same variable name in another Notebook that you are working on at the moment. For this reason, to check reproducibility, occasionaly check that Run -> Restart R and Run All Chunks works for you.

The .nb.html files don’t play very well with GitHub. If you want to share them, they are great for emailing to people (but tell them to download it and view it as a file—the gmail viewer does a crappy job of rendering it.)

6.5.4 Opening a .nb.html file in Rstudio

Doing this “reconstitutes” the .Rmd file that made it (along with the results that are saved in it). Which is sort of cool. However, it does not reconstitute all the data, etc. that went into it. So, y’all wouldn’t be able to run 02-choosing-96-SNPs.Rmd.

6.6 For Next Week:

Read Chapter 3: Data Visualization