Chapter 4 Week Three Meeting

4.1 Git Basics

Goals for Today:

  • Explain what git is (and how it is different than GitHub)
  • Introduce the sha-1 hash (for fun!)
  • Get familiar with RStudio’s very convenient interface to git
    • staging files
    • unstaging file
    • viewing differences between staged and unstaged files
    • committing files
    • viewing the commit history

4.1.1 An overview of Version Control Systems (VCS)

  • Git is a type of VCS
  • At its crudest, a VCS is a system that provides a way of saving and restoring earlier versions of a file.

4.1.1.1 A typical VCS for a non-computer programmer

  • Start writing my_manuscript.doc.
  • At some point worry that MS Word is going to eat your file, so,
    • Make a “backup” called my_manuscript_A.doc
  • Then, before overhauling the discussion, save the current file as my_manuscript_B.doc.
  • Email it to your coauthors and then have a series of files with other extensions such as the initials of their names when they edit them and send them back.
  • Etc.
  • Disadvantages:
    • Hard to find a good record of what is in each version. (Wait! I liked the introduction I wrote three weeks ago…where is that now?)
    • A terrible system if you have multiple files that are dependent on one another (for example, figures in your document, or scripts and data sets if you have a programming project.)
    • If you decide that you want to merge the changes you made to the discussion in version _C with the edits on the introduction in version _K, it is hard.

4.1.1.3 The Git model — Distributed Version Control

  • Git stores “snapshots” of your collection of files in a repository
  • For our work, the “collection of files” will be “the stuff in your RStudio project”
    • Another reason it is nice to keep everything you need for a project together in a “project directory”
    • Though git scans your project directory for new additions and changes, it will not add a new file, or add new changes, to the repository until you stage and subsequently commit the file.
  • When you clone a repository, you get the whole version history
  • When someone else clones that repository, they also get the whole version history.
  • Git has well-developed features for merging changes made in different repositories
    • But, for today, we will talk mostly of a single user interacting with git.

4.1.2 Git versus GitHub

  • They are not the same thing!
  • Git is software that you can run on your own machine for doing version control on a repository.
    • It can be entirely local. i.e. only on your hard drive and nowhere else.
    • This is super-useful for any project, because solid version control is great to have.
  • GitHub is a website, with tools powered by Git (and many that they brewed up themselves) that makes it very, very easy to share git repositories with people all over the place.

4.1.2.1 Is everything on GitHub public?

  • No! Many companies use GitHub to host their proprietary code
    • They just have to pay for that…
  • By default, you can put anything on GitHub for free as long as it is under a fairly free-use (open source type) license and it is available to anyone
  • If you want a private repository, as an academic affiliate you just have to ask and GitHub will give you unlimited private repos for free.
  • And if you are a student and you have not yet done so, go to https://education.github.com/pack to sign up for your free student pack.

4.1.3 Using git through RStudio

  • Now we can do a few things together to see how this works.
  • Most of the action is in the Git Pane…
  • Today we will talk about:
    • Staging files (preparing them to be committed)
    • Committing files (putting them into the repository)
    • viewing differences between staged and unstaged files
    • committing files
    • viewing the commit history

4.1.3.1 Two final configurations before starting:

  • Open the shell (Tools->Shell…) and issue these two commands, replacing the name “John Doe” with yours, and his email with yours.
    • You may as well use the email address that you gave to GitHub, though it doesn’t necessarily have to be

      git config --global user.name "John Doe"
      git config --global user.email johndoe@example.com
      You only need to do this once.
  • Finally, if you are using a Mac, configure it to cache your GitHub credentials so you needn’t give your password every time you push to it:

    git config --global credential.helper osxkeychain

4.1.3.2 The status/staging panel

  • RStudio keeps git constantly scanning the project directory to find any files that have changed or which are new.
  • By clicking a file’s little “check-box” you can stage it.
  • Some symbols:
    • Blue-M: a file that is already under version control that has been modified.
    • Yellow-?: a file that is not under version control (yet…)
    • Green-A: a file that was not under version control, but which has been staged to be committed.
    • Red-D: a file under version control has been deleted. To make it really disappear, you have to stage its disappearance and commit. Note that it still lives on, but you have to dig back into your history to find it.
    • Purple-R a file that was renamed. (Note that git in Rstudio seems to be figuring this out on its own.)

4.1.3.3 The Diff window

  • Shows what has changed between the last committed version of a file and its current state.
  • Holy smokes this is convenient
  • (Note: all this output is available from the command line, but the Rstudio interface is very nice, IMHO)

4.1.3.4 Making a Commit

  • Super easy:
    • After staging the files you want to commit…
    • Write a brief message (first line short, then as much after that as you want) and hit the commit button.

4.1.3.5 The History window

  • Easy inspection of past commits.
  • See what changes were made at each commit.

4.1.4 Go for it everyone!

  • Make some changes and commit them yourselves.
  • Add some new files to the project, and commit those.
  • Get familiar with the diff window.
  • Check the history after a few commits.

4.1.5 How does git store and keep track of things

  • Everything is stored in the .git folder inside the RStudio project.
  • The “working copy” gets checkout out of there
  • Committed changes are recorded to the directory

4.1.5.1 What is inside of the .git directory?

We can use R to list the files. My rep-res-course repository that hold all the materials for a course like this one looks like:

## check out this file-system command in R
dir(path = ".git", all.files = TRUE, recursive = TRUE)

The output from that command looks something like this:

  [1] "#MERGE_MSG#"                                       "COMMIT_EDITMSG"                                   
  [3] "COMMIT_EDITMSG~"                                   "config"                                           
  [5] "description"                                       "FETCH_HEAD"                                       
  [7] "HEAD"                                              "hooks/applypatch-msg.sample"                      
  [9] "hooks/commit-msg.sample"                           "hooks/post-update.sample"                         
 [11] "hooks/pre-applypatch.sample"                       "hooks/pre-commit.sample"                          
 [13] "hooks/pre-push.sample"                             "hooks/pre-rebase.sample"                          
 [15] "hooks/prepare-commit-msg.sample"                   "hooks/update.sample"                              
 [17] "index"                                             "info/exclude"                                     
 [19] "logs/HEAD"                                         "logs/refs/heads/gh-pages"                         
 [21] "logs/refs/heads/master"                            "logs/refs/remotes/origin/gh-pages"                
 [23] "logs/refs/remotes/origin/master"                   "objects/00/906f99e192ff64b4e9e9a0e5745b0a4f841cbd"
 [25] "objects/01/ab18d4ce04fb06532bb06ed579218fef89d478" "objects/02/74554e0b574b9beb2144f26ad3925830056870"
 [27] "objects/03/2d224bf78798e8b9765af6d8768ade14694a9d" "objects/04/03d552ab37b0bcaeebed0ac3068d669261c456"
 [29] "objects/04/4a12f8ccc12a4a5ba84ab2bf5a1ae751feea6f" "objects/04/9ec3065bb0434ded671fa83af5ade803bc11a1"
 [31] "objects/04/ea8efb1367727b081dea87e63818be0a4d02f0" "objects/05/b22ecc373d5058e36d7ca773a4475c46daef77"
 [33] "objects/07/8831b46c9b63e8c2d50b79304ed05de9274c28" "objects/07/b57af2a0cbd0545a6cd3e93f10cc5d768e42ba"
 [35] "objects/08/674e6e4d534b3424e2629510d20bb6d1b0be94" "objects/08/8b282d5b978dc1ff6eef3871d3fb3a9256246a"
 [37] "objects/09/565dcd10d7adc0551783b443e8fd71486b3997" "objects/09/6828a0cfb96f30d6e99cfc04a5c1686b9e318c"
 [39] "objects/0a/30fe678abc342c58daab0ad42163b371babda0" "objects/0a/b71109dae6e5711755feddfb06b81b13766496"
 [41] "objects/0b/442fdfc183783537985c17151ae3483fa00cf6" "objects/0b/c0451fd0e7081a7db05fdd38b12870bdcabd13"
 [43] "objects/0c/0f7cf8c73d901795dad4bd5f504c53c3bf2093" "objects/0d/14a7a2a19ffc3b9820f011e3270c965a5fefce"
 [45] "objects/0e/35cfb4d55e52d27083b8d2eccab9296b920d76" "objects/0e/7bba5882077a8b00a76d3eabe6b23cadc658e6"
 [47] "objects/0e/8abf4cc0885a727ee2459fdbb272828e267cc4" "objects/0f/1f3f7be7787d5d44dc1155f3b7a44eddc9f0be"
 [49] "objects/10/54d2e7a9baf61618521c522b15db40855b3431" "objects/11/7d874e1616500b5fba51b9f0ee1e8d0fbe1dc2"
 [51] "objects/11/c33cc1d5c8de7c7cbf7257b7d32f7ca3d458ef" "objects/14/1ccea514e106e20eef47b791a23e036d1fa1d2"
 [53] "objects/15/cc3a6f15dadb3446ad0af34a3ecde8d81d65f9" "objects/15/ddaf45bf00c3ef2d8f499ebd6dc3a86bf9c3ab"
 [55] "objects/16/0c9386dfa9707d81fbbbcc52f0c7638703f9a9" "objects/16/8ee93b6a4612dbd76bc06a49460df9f9f6c41b"

Yikes!

4.1.5.2 How does git know a file has changed?

  • Does it just look at the modification date?
  • NO! It “fingerprints” every file, so it knows when it has changed from the most recent committed version.
    • Demonstration. Change a file. Save, then undo the change and save again…Git knows the file has been changed back to its “former self”
  • SHA-1 hashes. We will learn more about those later.
  • You will see things like ed00c10ae6cf7bcc35d335d2edad7e71bc0f6770 all over in Git-land.
  • You can treat them as very specific names for different commits.

4.1.5.3 What should I keep under version control?

  • General rule: don’t keep derived products.
    • i.e. If you have an Rmd file that creates an html file, there isn’t much need to put the html file under version control with git, because you can just regenerate it by Knitting the Rmd file.
  • Do keep data, source code, etc.
  • Sometimes certain outputs and intermediate results from long calculations can be committed so that you don’t have to run a 4 hour analysis to start where you were before.
  • For such results, consider saveRDS() with the compress = "xz" option (and its companion readRDS().

4.1.5.4 How can I make git ignore certain files?

  • The .gitignore file!
  • File names (and patterns) in the .gitignore file are ignored recursively (down into subdirectories), by default.
  • Files won’t be ignored if they are already in the repository.
  • Example: *.html

4.2 Pushing and Pulling With GitHub

4.2.1 Creating a Repository on GitHub and the initial push

When you have an RStudio project under git version control on your laptop or desktop computer, creating a remote repository on GitHub is quite easy. A few steps:

  1. Upper right corner: “create new” button (a “+” with a little triangle.) Choose “New Repository”
  2. Give it a name. It makes most sense to name it the same as the RStudio project you want to push up there. So, for example, if my project file was boing.Rproj, I would name the repository boing.
  3. Add a 5 or 6 word description if you want.
  4. Choose public or private
  5. DO NOT choose to “Initialize this repository with a README”. You likely already have a README. Initializing the repository with one will create headaches.
  6. Also, don’t add a .gitignore or a license (select “none”, which should be the default, for both of those)
  7. Click the green “Create Repository” button

That will take you to another screen. In the middle find the code box below the heading, …or push an existing repository from the command line.

  1. Copy that two lines of code from you web browser. It will look something like this:

    git remote add origin https://github.com/eriqande/boing.git
    git push -u origin master
    but it will be specific to the repository you just made, so the URL and name of the repo will be different than what you see above. Note, you can copy the lines by clicking the “copy this text” icon on the right side of the page.
  2. Go to RStudio, in the project that you want to push to GitHub, and choose “Tools-Shell”. That will give you a terminal window. Paste the commands you copied into that terminal window and hit return.
  3. It might ask you for your GitHub username and password.

Voila!

4.2.2 Subsequent pushes

Once you have pushed the repo up there. Try making some changes on your laptop, committing them, and then hitting the “Push” button on the git panel…

4.2.3 Assign collaborators

From the repository page on GitHub, choose “Settings” (on the upper right) and find the “collaborators” link (on the left).

If you have a private repository, you can add GitHub user eriqande (that’s me…) to it and I will be able to view it and give comments and suggestions.

4.3 Next Week’s Assignment

  • With luck, we will get everyone’s projects up to GitHub before the end of our session today. However, if we don’t, please get that done ASAP.
  • The big assignment is to read the Data Transformation chapter in “R for Data Science.” Warning: This is a long and meaty chapter, so get an early start! The chapter goes through what you need to know to leverage all the dplyr goodness. Please work all the examples, and do the exercises. In fact, when you are working through the examples with the nycflights data set, you should, after each example, try to do the same type of operation on your own data set.