Case Studies in Reproducible Research: a spring seminar at UCSC
Eric C. Anderson, Kristen C. Ruegg, Tina Cheng, and the students of EEB 295
Chapter 1 Course Overview
This is the home of the notes for a proposed course in data analysis and reproducible research using R, Rstudio, and GitHub.
The seminar is called, “Case Studies in Reproducible Research,” but we utter that title with the caveat that, although the organizers have quite a few case studies they could spin up for this course, the case studies we will be studying in this course are going to be actual research projects that you—the participants—are working on. You’re gonna bring ’em, and we are going to collectively help you wrassle them into a reasonable and reproducible data analysis. In the process we will touch on a number of elements of data analysis with R.
We will be working through a healthy chunk of the material in Garrett Grolemund and Hadley Wickham’s book, R for Data Science, which is readable for free at the link above. We intend to use a handful of our own data sets each week to illustrate points from the book and show the material in action on real data sets.
This is not intended as a “first course in R”. Students coming to the course should have at least a modicum of familiarity with using R, and we will launch more directly into using the tools of the tidyverse. EEB students with little or no experience in R might be interested in sitting in with Giacomo Bernardi’s lab group on Mondays at 3PM in the COH library. They are conducting a Bio 295 seminar, working through “a super basic book that takes the very first steps into R.”
1.1 Meeting Times, Location, Requirements
Intended to be Friday afternoons, 1:45–3:15 PM in the library/conference room at Long Marine Lab.
Students must bring a laptop to do examples during the seminar, and all students are expected to have a data set that they are in the midst of analyzing (or upon which they hope to commence analysis soon!) for a research project. We will
1.2 The origin of this seminar
The idea for this course was floated by Tina Cheng who was planning to lead a seminar in spring 2017 based in part on Eric C. Anderson’s “Reproducible Research Course”, taught at the Southwest Fisheries Science Center in the fall of 2014. Although going over those notes might have been a reasonable exercise, it turns out that a lot has changed in the world of data analysis since fall 2014, and the notes from that course are, today, a little bit dated.
We have been particularly excited by the ascendancy of Hadley Wickham’s tidyverse approach to data analysis, and the tremendous development of a variety of tools developed by RStudio for integrating report generation and data analysis into reproducible workflows. In fact, Eric has been saying for the last year that if he were to teach another course on data analysis it would be structured completely differently than his “Reproducible Research Course”. So, it was clearly time for him to stop talking and help put together an updated and different course.
At the same time, in working on our own projects and in helping others, we have consistently found that the most effective way for anyone to learn data analysis is to ensure that it is immediately relevant to whatever ongoing research project is currently consuming them. Therefore, in the current seminar, we are hoping to spend at least half of our time “workshopping” the data sets that seminar participants are actually involved in analyzing. Together we will help students wrestle their data, analyses, and ideas into a single, well-organized RStudio project under version control with git. Therefore, every student should come to this course with a data set and an associated analysis project.
1.3 Course Organizers
- Kristen C. Ruegg
Kristen is a conservation geneticist who specializes in the application of genome-wide data to understand population level processes and inform management, with a particular focus on migratory birds. She has has been enlightened to the powers of the “tidyverse” over the last couple of years (mostly through the constant insistence of her enthusiastic husband Eric Anderson) and is looking forward to becoming more fluid in its application over the course of the quarter. Her main role in this course will be to help with the course design and logistics and help reign Eric in when he has started to orbit into some obscure realm of statistical nuance.
- Eric C. Anderson
Eric trained as a statistician who specializes in genetic data. Since 2003 he has worked at the NMFS Southwest Fisheries Science Center in Santa Cruz. Although much of his statistical research involves the development of computationally intensive methods for specialized analyses of genetic data, he has been involved in a variety of data analysis projects at NMFS and with collaborators worldwide. Eric was an early adherent to reproducible research principles and continues, as such, performing most of his research and data analysis in the open and publicly available on GitHub (find his GitHub page here). In 2014, he taught the “Reproducible Research Course” at NMFS, and is excited to provide an updated version, focusing more, this time, on the recently developed “tidyverse”.
- Tina Cheng
Tina is a graduate student in EEB. She is going to be leading the session during the first week of the course when Kristen and Eric are still on spring break, and then she is going to be joining in on the fun with us for the remainder of the quarter until she has to travel off to Baja, TA-ing the “supercourse” during the last four weeks of the quarter.
1.4 Course Goals
The goal of this course is for scientists, researchers, and students to learn to:
- properly store, manage, and distribute their data in a tidy format
- consolidate their digital research materials and analyses into well-organized RStudio projects.
- use the tools of the tidyverse to manipulate and analyze those data sets
- integrate data analysis with report generation and article preparation using the Rmarkdown format and using R Notebooks
- use git version control software and GitHub to effectively manage data and source code, collaborate efficiently with other researchers, and neatly package their research.
By the end of the course, the hope is that we will all have mastered strategies allowing us to use the above-listed, freely-available and open-source tools for conducting research in a reproducible fashion. The ideal we will be striving for is to be able to start from a raw data set and then write a computer program that conducts all the cleaning, manipulation, and analysis of the data, and presentation of the results, in an automated fashion. Carrying out analysis and report-generation in this way carries a number of advantages to the researcher:
- Newly-collected data can be integrated easily into your analysis.
- If a mistake is found in one section of your analysis, it is not terribly onerous to correct it and then re-run all the downstream analyses.
- Revising a manuscript to address referee comments can be done quickly.
- Years after publication, the exact steps taken to analyze the data will still be available should anyone ask you how, exactly, you did an analysis!
- If you have to conduct similar analyses and produce similar reports on a regular bias with new data each time, you might be able to do this readily by merely updating your data and then automatically producing the entire report.
- If someone finds an error in your work, they can fix it and then easily show you exactly what they did to fix it.
Additionally, packaging one’s research in a reproducible fashion is beneficial to the research community. Others that would like to confirm your results can do so easily. If someone has concerns about exactly how a particular analysis was carried out, they can find the precise details in the code that you wrote to do it. Someone wanting to apply your methods to their own data can easily do so, and, finally, if we are all transparent and open about the methods that we use, then everyone can learn more quickly from their colleagues.
In many fields today, publication of research requires the submission of the original data to a publicly-available data repository. Currently, several journals require that all analyses be packaged in a clear and transparent fashion for easy reproduction of the results, and I predict that trend will continue until most, if not all, journals will require that data analyses be available in easily reproduced formats. This course will help scientists prepare themselves for this eventuality. In the process, you will probably find that conducting your research in a reproducible fashion helps you work more efficiently (and perhaps even more enjoyably!)
1.5 Weekly Syllabus
1.5.1 Week 1 — Introduction and Getting Your Workspace Set Up
- At the end of this session we want to make sure that everyone has R, RStudio, and Git installed on their systems, and that they are working as expected.
- Additionally, everyoneshould have a free account on GitHub.
- And finally we need everyone’s email address.
Some things to do:
- Get Rstudio cheat Sheets!
- Assemble data into a project
- Get private GitHub repos
Eric! You need to make an example project repo.
1.5.2 Week 2 — RStudio project organization; using git and GitHub; Quick RMarkdown
After this, students are going to have to put their own data into their own repositories and write a README.Rmd and make a README.md out of it.
1.5.3 Week 3 — Tibbles. Reading data in. Data rectangling
- Reading data into the data frames.
- read.table and read.csv
- The readr package
- Data types in the different columns and quick data sanity checks.
- A few different gotcha’s
- Saving and reading data in R formats.