Chapter 5 Week Four Meeting

This was a bit of a free-form discussion on a variety of topics.

5.1 Knit your README.Rmd files

First thing we talked about was the fact that GitHub will render a file to an html web page that is nice and easy to read. It will sort of render a README.Rmd file, but it won’t do everything to it. Namely:

  1. The YAML header block comes out as a table.
  2. It will not evaluate all the R code and deliver the results.

Rather, it is necessary to locally knit the README.Rmd file to create a, then this file must be committed to the repository and pushed. This needs to happen each time you have updated the README.Rmd file.

Note that your README.Rmd file should start with the following:

cat(readLines("inputs/readme-header.txt"), sep = '\n')
title: "NameOfPackage"
date: "`r format(Sys.time(), '%d %B, %Y')`"
    toc: true

<!-- is generated from README.Rmd. Please edit that file -->

```{r, echo = FALSE}
  collapse = TRUE,
  comment = "#>",
  fig.path = "readme-figs/"

Then start adding your text here...

5.2 Changing to factors

Mikki had a question: she wanted to have a column that contained 1’s and 2’s as factors. her data set had several entries that were “1,2”. She wanted to convert those to 2’s and then make them all factors. We discussed how this could be done with dplyr. The important message was that dplyr does not change the original input variable, but in the output, you can “mutate over the top of an existing variable” (i.e. in the output the column will have been changed, but not in the original input data frame).

5.3 The group_by() function

We spent a bit of time going over how to think about what the group_by() function does. Eric likes to think of it as breaking your original tibble up into a lot of different tibbles, according to the grouping variables, after which, each little tibble gets sent to the following verb (summarise(), mutate(), filter(), etc.)

We talked about that fact that while it is quite natural to think about using the group_by() function in conjunction with summarise(), it is also very powerful to be able to use it in conjuntion with mutate().

When you do a summarise, only the grouping variables and the newly-created summary variables get returned in the output tibble, and the rows are arranged by the grouping variables. When you do a group_by() and then mutate() all of the columns get returned and there is no automatic arranging that goes on.

5.4 How do I learn about all the vectorized functions I can use in mutate() and summarize()?

There was consensus in the class that even once we have learned the mechanics of using mutate() and summarise(), we might still be at a loss as to how to use them, or with which functions. Admittedly, there are many, many vectorized functions in R that you might apply within a mutate() or summarise() function, and learning about all of those, and having them at your fingertips when you need them is part of the never-ending journey of gaining experience with R.

However, there are a few things that can help with that journey. Here are my two favorite suggestions:

  1. Get the RStudio dplyr cheatsheet. In RStudio, Go to Help–>Cheatsheets–>Data Manipulation-with-dplyr,-tidyr. While you are at it. Check out their other cheatsheets.
  2. Review Hadley’s Advanced R recommended vocabulary.
    This is a nice, compact list of R functions that you should be familiar with, or at least aware of.

5.5 Next Week’s Assignment

For Week 5, we are going to talk about joins. This is a very important topic for combining data from different data sets. Thus, everyone should read Chapter 13: Relational Data in the R for Data Science Book. This is an amazing chapter, and will go a long way in helping people understand how to make their lives easier when it comes to combining multiple tibbles of information (for example, a tibble of metadata for each individual and a tibble of genotype information, for the same individauls, etc.). Try to work through all the examples and do the exercises.