Chapter 6 Week Five Meeting
6.1 Some things regarding people’s repositories
6.1.1 Data Compression
We have been endorsing
.csv as a good format for data, and it is, because it is human-readable and easily parsed into tibbles. However, when you have very long tibbles, it is not necessarily the most space-efficient format. Large
.csv files can take up much more space on your hard drive than they should.
What do we mean by “than they should?” in that context? This has to do with how much information is in the file, where information is used in the context of information theory. Often tibbles will have columns that have fairly “redundant” information—for example, in a mutlispecies salmon data set, one column might have entries that are either “Chinook”, “coho”, or “steelhead”. It takes a few bytes to store each one of those words, and if they are used in a column that has millions of rows, that can add up to a lot of space on your hard drive. Colloquially, data compression is the art of finding ways of using short “code-names” for things or patterns that occur frequently in a file, and in so doing, reducing the overall file size.
The consequences of data compression can be profound and wonderful. You can reduce the size of a file, sometimes by an order or magnitude or more. There are a few good choices available for compressing your data (making it smaller.). Note that doing so often makes it a little harder to edit your data set; however, if your data set is not going to change, and it is large, then it makes sense to compress it—especially if it is so big you would rather not (or can’t) put it on GitHub.
If you are working on a Mac, you have the Unix utility
gzip. We will illustrate its use on Katie’s big salmon data set,
ASL_merged.csv. Let’s see how big that is. We can use the Unix utility
du (stands for “disk usage”).
# give this command on the Terminal in the directory where the file lives: du -h ASL_merged.csv
The results comes back:
Whoa! This file is 332 Megabytes. That is quite large!
However, we can compress it like this:
When we do that, it compresses the file and renames it to have a
.gz extension on it:
ASL_merged.csv.gz. We can then see how big that file is:
du -h ASL_merged.csv.gz
Whoa! We went from 332 Megabytes to 11. It is just 3% of its original size (and small enough that you can safely put it on GitHub).
One very nice feature is that gzipped files can be read in directly by the functions of the
readr package. So, for example,
read_csv() works just fine on the gzipped version of Katie’s massive salmon data set:
# this works the same as it would on the ungzipped file salmon <- read_csv("data/ASL_merged.csv.gz")
xz comression with
Another method that can be even more efficient with tibbles is to store them as R objects using the
saveRDS() function with the
xz compression option. This has the nice advantage that all the data types of the variables (for example, if you had made factors out of some) will be preserved exactly as they are in the tibble when you save it.
Let’s imagine we have read the tibble into the variable
salmon, and all the column types were as we wanted. Then, we could save that tibble directly to a compressed file like this:
saveRDS(salmon, file = "ASL_xz.rds", compress = "xz")
compress = "xz" option. Let’s see how that did using
du on the Unix terminal:
du -h ASL_xz.rds
Holy Smokes! Only 3.7 Megabytes. That is only 1.1% of its original size. Lovely!
In order to read that tibble back into a variable (named
my_var, say) in R, you would use
readRDS() like this:
my_var <- readRDS(file = "ASL_xz.rds")
6.2 A quick aside about missing data
Garrett and Hadley note that “missing values are ‘contagious’: almost any operation involving an unknown value will also be unknown.” This is true for the most part, but there is a vexing inconsistency. Observe
This gives us NA as we would hope it would
NA == 0 | NA == 1
##  NA
However, this one returns FALSE. What gives?
NA %in% c(0, 1)
##  FALSE
6.3 Brief Highlights of the Joins Chapter
You will likely end up using joins all the time. As noted in the book, the
left_join() is what you will likely use all the time. In this case you have a “focal” data frame with all the rows (“cases” or “observations”) in the
x table that you are going to be wanting to add some columns to. Those columns live in the
y table (along with the matching keys).
“The left join should be your default join: use it unless you have a strong reason to prefer one of the others.”
You an do the same things with base R’s
merge() function, but it is slower and somewhat harder to express your intent with it. (I’ve always really disliked the
6.3.1 A few thoughts on keys
It is always a worthwhile exercise to go through and figure out what the primary key is in a tibble you are working with. It might be that the primary key is a compound key: it defines unique observations by a combination of several variables. Sometimes there is no explicit primary key! It is worthwhile to add a surrogate key in that case.
On the flip side of these issues: when you are compiling your own data set, you might want to spend some time making sure that units that might be relevant to an analysis are explicitly identified in a single column. Here is an example: the NOAA Observer program takes tissue samples from bycatch for genetic analysis. There is a primary key
tissue_sample for every tissue sample. However, under some circumstances they take multiple tissue samples from the same individual. But they don’t have a column in the data set with an indvividual ID. So, when the send their samples to people who will genotype them, a lot of individuals are unwittingly genotyped twice. Their response: “Well, isn’t it obvious that if a tissue sample is taken from a fish that was caught on the same
vessel on the same
day and in the same
haul, and is of the same
species and has the same recorded
length is the same individual?” My response: “NO! It isn’t!. Don’t make people use a whole lot of columns to identify things that should be identified in a single column!”
6.4 An example of using some joins
Let’s walk through a simple case that should be familiar to those in the group who have worked with genetic data and have had to deal with the problem of attaching meta data to genetic data coming off a genotyping instrument.
Typically those data come out in a form that can be made into a tibble. Let’s read in a toy example:
library(tidyverse) genos <- read_csv("inputs/toy_geno.csv") genos
## # A tibble: 8 x 5 ## bird locus1_a locus1_b locus2_a locus2_b ## <chr> <int> <int> <int> <int> ## 1 wiwa01 1 2 3 3 ## 2 wiwa02 2 2 4 4 ## 3 wiwa03 2 2 3 4 ## 4 wiwa04 1 1 4 4 ## 5 wiwa05 2 2 3 4 ## 6 wiwa06 1 1 4 4 ## 7 wiwa07 1 2 3 4 ## 8 wiwa08 1 1 3 4
There is a single column (
bird in this case) that is the primary key that uniquely identifies individuals. Then each locus gets two columns of data (one for each gene copy in a diploid).
That is all well and good. But now, consider this problem: for a particular analysis we are going to do, we need to have the latitude and longitude coordinates where each bird was sampled. Let’s say that we got these samples from friendly collectors who provided a meta data file that gave us the collection location and the name of the collector. Let’s look at that:
meta <- read_csv("inputs/toy_meta.csv") meta
## # A tibble: 10 x 3 ## field_id location collector ## <chr> <chr> <chr> ## 1 wiwa01 swAZ joe ## 2 wiwa02 nwAZ mary ## 3 wiwa03 soCA ted ## 4 wiwa05 soCA ted ## 5 wiwa06 swAZ joe ## 6 wiwa08 noCA erin ## 7 wiwa09 noCA erin ## 8 wiwa10 noCA erin ## 9 wiwa11 noCA erin ## 10 wiwa12 noCA erin
Notice: there are some birds in this data set that we don’t have in the
genos tibble. Not only that, but if you look closely, it is missing some birds in
genos: they are “wiwa04” and “wiwa07”. Also, note that the column that holds the ID of each bird is called
Finally, the network of bird sample collectors maintains a data base of all their location codes that looks like this:
locations <- read_csv("inputs/toy_locations.csv") locations
## # A tibble: 6 x 3 ## location lat long ## <chr> <dbl> <dbl> ## 1 swAZ 32.1 -112.8 ## 2 nwAZ 36.8 -113.5 ## 3 soCA 33.2 -117.1 ## 4 noCA 40.6 -123.9 ## 5 soOR 42.9 -123.7 ## 6 noOR 45.9 -123.1
Aha! So, what we need to do is associate with each
location, and then once we have done with that, we need to associate a
lat and a
long with those locations. This is the perfect job for a join (two of them, actually).
Note that we are focused on our birds, here, so we want to keep them all around and not add any information where we don’t have a bird. Hence
left_join() is our go-to friend there (as it almost always will be).
Here is what the first step looks like:
genos %>% left_join(., meta, by = c("bird" = "field_id"))
## # A tibble: 8 x 7 ## bird locus1_a locus1_b locus2_a locus2_b location collector ## <chr> <int> <int> <int> <int> <chr> <chr> ## 1 wiwa01 1 2 3 3 swAZ joe ## 2 wiwa02 2 2 4 4 nwAZ mary ## 3 wiwa03 2 2 3 4 soCA ted ## 4 wiwa04 1 1 4 4 <NA> <NA> ## 5 wiwa05 2 2 3 4 soCA ted ## 6 wiwa06 1 1 4 4 swAZ joe ## 7 wiwa07 1 2 3 4 <NA> <NA> ## 8 wiwa08 1 1 3 4 noCA erin
Notice that we have some NAs for birds that are not in the meta data. That is the behavior we expect from
left_join(): it is not going to discard some of your birds, just because they don’t appear in the meta data.
Note also that when we explicitly give the names of the keys (which differ in the different tibbles) the one on the left corresponds to the
x argument to
left_join(). Also, notice that these key names must be quoted!! (It is easy to forget that, because you so seldom need quotation marks around things in the tidyverse.)
Now, we can add the lat-longs on there. We will show how that is done by chaining onto the previous command:
genos %>% left_join(., meta, by = c("bird" = "field_id")) %>% left_join(., locations)
## Joining, by = "location"
## # A tibble: 8 x 9 ## bird locus1_a locus1_b locus2_a locus2_b location collector lat ## <chr> <int> <int> <int> <int> <chr> <chr> <dbl> ## 1 wiwa01 1 2 3 3 swAZ joe 32.1 ## 2 wiwa02 2 2 4 4 nwAZ mary 36.8 ## 3 wiwa03 2 2 3 4 soCA ted 33.2 ## 4 wiwa04 1 1 4 4 <NA> <NA> NA ## 5 wiwa05 2 2 3 4 soCA ted 33.2 ## 6 wiwa06 1 1 4 4 swAZ joe 32.1 ## 7 wiwa07 1 2 3 4 <NA> <NA> NA ## 8 wiwa08 1 1 3 4 noCA erin 40.6 ## # ... with 1 more variables: long <dbl>
Voila! That is what we wanted. Now, we could filter out those NAs and drop the
collector column, if desired.
Notice that, since the
location column was named
location in tibble,
left_join() just used that.
6.4.1 When would I use
The only time I use this is when I want to add columns to the beginning of a tibble, but I want to preserve all the keys in the table that is going to end up with its columns on the right hand side of the table. And even then I usually just use
select after a
left_join(). Perhaps an example will be best: for some purposes it is best to keep the genotype data all together on the right hand side of the tibble (often genotype data can take up lots of columns and you might want to be able to see the columns you have joined on without using View() and scrolling way over). In this case you can do this:
genos %>% right_join(meta, ., by = c("field_id" = "bird")) %>% right_join(locations, .)
## Joining, by = "location"
## # A tibble: 8 x 9 ## location lat long field_id collector locus1_a locus1_b locus2_a ## <chr> <dbl> <dbl> <chr> <chr> <int> <int> <int> ## 1 swAZ 32.1 -112.8 wiwa01 joe 1 2 3 ## 2 nwAZ 36.8 -113.5 wiwa02 mary 2 2 4 ## 3 soCA 33.2 -117.1 wiwa03 ted 2 2 3 ## 4 <NA> NA NA wiwa04 <NA> 1 1 4 ## 5 soCA 33.2 -117.1 wiwa05 ted 2 2 3 ## 6 swAZ 32.1 -112.8 wiwa06 joe 1 1 4 ## 7 <NA> NA NA wiwa07 <NA> 1 2 3 ## 8 noCA 40.6 -123.9 wiwa08 erin 1 1 3 ## # ... with 1 more variables: locus2_b <int>
Notice that when you do this, the column name of the
field_id is the one that gets retained.
6.4.2 What about that
If you knew ahead of time that you couldn’t use any birds that you didn’t have lat-longs for, you could start with an
inner_join(), because that would discard birds that don’t have an entry in the meta data:
genos %>% inner_join(., meta, by = c("bird" = "field_id"))
## # A tibble: 6 x 7 ## bird locus1_a locus1_b locus2_a locus2_b location collector ## <chr> <int> <int> <int> <int> <chr> <chr> ## 1 wiwa01 1 2 3 3 swAZ joe ## 2 wiwa02 2 2 4 4 nwAZ mary ## 3 wiwa03 2 2 3 4 soCA ted ## 4 wiwa05 2 2 3 4 soCA ted ## 5 wiwa06 1 1 4 4 swAZ joe ## 6 wiwa08 1 1 3 4 noCA erin
But, it will probably be easier to follow if you explicitly discard those birds using
filter() or by doing a filtering join.
6.4.3 An anti_join example
If you want to get all the genotype data for birds that don’t occur in the meta data you can use the filtering
genos %>% anti_join(., meta, by = c("bird" = "field_id"))
## # A tibble: 2 x 5 ## bird locus1_a locus1_b locus2_a locus2_b ## <chr> <int> <int> <int> <int> ## 1 wiwa07 1 2 3 4 ## 2 wiwa04 1 1 4 4
And to return only those rows for birds that are in the meta data, you could use
genos %>% semi_join(., meta, by = c("bird" = "field_id"))
## # A tibble: 6 x 5 ## bird locus1_a locus1_b locus2_a locus2_b ## <chr> <int> <int> <int> <int> ## 1 wiwa01 1 2 3 3 ## 2 wiwa02 2 2 4 4 ## 3 wiwa03 2 2 3 4 ## 4 wiwa05 2 2 3 4 ## 5 wiwa06 1 1 4 4 ## 6 wiwa08 1 1 3 4
Note that we could turn it around in order to see which birds are in the meta data, but which don’t occur in the genos:
genos %>% anti_join(meta, ., by = c("field_id" = "bird"))
## # A tibble: 4 x 3 ## field_id location collector ## <chr> <chr> <chr> ## 1 wiwa12 noCA erin ## 2 wiwa11 noCA erin ## 3 wiwa10 noCA erin ## 4 wiwa09 noCA erin
Dammit Erin! You always forget to send us the friggin’ samples! Clearly smokin’ too much of the kind green there in noCA. (Note. These names really are totally fictitious.)
Quick quiz: why would this not work:
genos %>% anti_join(meta, ., by = c("bird" = "field_id"))
6.4.4 Just for fun, let’s see a full_join()
This bad boy makes a row with NAs for values in either tibble that are not matched in the other:
genos %>% full_join(., meta, by = c("bird" = "field_id")) %>% full_join(., locations) %>% print(n = 20) # so all rows print
## Joining, by = "location"
## # A tibble: 14 x 9 ## bird locus1_a locus1_b locus2_a locus2_b location collector lat ## <chr> <int> <int> <int> <int> <chr> <chr> <dbl> ## 1 wiwa01 1 2 3 3 swAZ joe 32.1 ## 2 wiwa02 2 2 4 4 nwAZ mary 36.8 ## 3 wiwa03 2 2 3 4 soCA ted 33.2 ## 4 wiwa04 1 1 4 4 <NA> <NA> NA ## 5 wiwa05 2 2 3 4 soCA ted 33.2 ## 6 wiwa06 1 1 4 4 swAZ joe 32.1 ## 7 wiwa07 1 2 3 4 <NA> <NA> NA ## 8 wiwa08 1 1 3 4 noCA erin 40.6 ## 9 wiwa09 NA NA NA NA noCA erin 40.6 ## 10 wiwa10 NA NA NA NA noCA erin 40.6 ## 11 wiwa11 NA NA NA NA noCA erin 40.6 ## 12 wiwa12 NA NA NA NA noCA erin 40.6 ## 13 <NA> NA NA NA NA soOR <NA> 42.9 ## 14 <NA> NA NA NA NA noOR <NA> 45.9 ## # ... with 1 more variables: long <dbl>
This is not typically what we want! Sometimes it is…but usually you will be using
6.5 Working with R Notebooks
R Notebooks are totally awesome! They combine the nice features of working at the R console (namely having access to variables that remain in your .GlobalEnvironment), with the beauty of being able to document things in an easy to read and digest RMarkdown format.
Here, you can download an example of a short Notebook from one of the projects that Kristen and Eric are working on: choosing-snps
R notebooks are RMarkdown documents using the
html_document option. They are great for doing and explaining analyses. This is where I end up doing most of my analyses these days. Typically I put them in a directory called
R-main in my project. This is a more appropriate place to put long analyses than in README.Rmd. The README.Rmd file should be reserved for describing your data and giving people instructions on how to conduct the analysis, e.g. “Open ./R-main/01-clean-data.Rmd and run it all. Then run all the code in 02-compute-statistics.Rmd, etc.”
6.5.1 Open your own R Notebook
This is easy in RStudio: File -> New File -> R Notebook. This gives you a simple template that lets you see how things work and into which you can insert your own thoughts and writings.
For quick help with formatting: Help -> Markdown Quick Reference
For more Markdown info: Help -> Cheatsheets -> R Markdown Cheat Sheet
6.5.2 Working with R Notebooks
Big difference from “regular” R Markdown documents: there is no “Knit” button.
Instead, to get results from the code, you must evaluate it, then “Preview”.
Ways of evaluating code blocks:
- right-facing triangle – evaluate current block
- down-facing gray triangle – evaluate all the blocks above this one.
Or you can use keyboard shortcuts, or use the “Run” button in the upper right of the document.
Results from code blocks get presented in the Notebook.
To get something like “Knit”: do this:
Run -> Restart R and Run All Chunks
6.5.3 Some caveats about notebooks
- You should restart and Run All occasionally to make sure it is reproducible.
- When evaluating code within an R notebook, by default the working directory for R is set to the directory that the R notebook live in, not the root directory of the project. So, it can give different results (for example when doing file acess) than the R Console. This can be hugely frustrating.
- your variables all live in the GlobalEnvironment, so they are at risk of getting overwritten if you use the same variable name in another Notebook that you are working on at the moment. For this reason, to check reproducibility, occasionaly check that Run -> Restart R and Run All Chunks works for you.
The .nb.html files don’t play very well with GitHub. If you want to share them, they are great for emailing to people (but tell them to download it and view it as a file—the gmail viewer does a crappy job of rendering it.)
6.5.4 Opening a .nb.html file in Rstudio
Doing this “reconstitutes” the .Rmd file that made it (along with the results that are saved in it). Which is sort of cool. However, it does not reconstitute all the data, etc. that went into it. So, y’all wouldn’t be able to run 02-choosing-96-SNPs.Rmd.
6.6 For Next Week:
Read Chapter 3: Data Visualization