Reproducible Research Course by Eric C. Anderson for (NOAA/SWFSC)


Comments and thoughts on Homework #1 (Trial Homework)

Preliminaries

First off!

  1. Woo-hoo! Way to go everyone who got those in!
  2. Woo-hoo! Way to go everyone who is still working on it!

I’m pumped by how many people made their first pull request.

What does a pull request look like to me?

  • Check it out!
    • I get an email and gmail is github-aware
    • I can see the chnanges that you have made
    • I can comment, etc.
  • You can all do this too! Just go to https://github.com/eriqande/rep-res-course and find the pull requests button.
    • In fact, if you aren’t sure how to do the homework or what the best answer is, feel free to browse what other people have done and get ideas.
      • I don’t consider this cheating—especially if you view everyone’s responses with a scientific attitude. You’ll be learning about GitHub and reviewing lots of R code.
      • Keep in mind that some suggested answers you see from other people might not be optimal.
      • If you see that someone has made a mistake and want to let them know, just comment on their commit.
  • Note, please keep your pull requests Open. That way my scripts can fetch your work easily.
  • I will Close them when we are done with them. You can always Re-open them.

What if I want to change an answer?

  • By all means, feel free. This is where GitHub really excels.
  • Just make your changes, commit them, and push them up and the pull request should be automatically updated (I think…)

My responses data base

Show it to them. View(ans)

General comments from what I saw

It is great to have everyone’s responses. Here are some comments that should be helpful to everyone.

Strive for Economy of characters

When you are writing code, usually, but not always) shorter is going to be

  1. easier to read
  2. easier to debug
  3. easier to maintain

As long as it clearly expresses the intent of the program.

Along those lines, (intermixed with some of my OCD code-style ideas) some guidelines are:

  1. You don’t have to define intermediate variables. Sometimes it is helpful to break up long calculation with some intermediates, but not always. So:

    # this is preferred
    gnames > "github"
    
    # this makes unnecessary variable assignments
    a <- "github"
    b <- a < gnames
    b
    
    # this also makes unnecessary variable assignments
    y <- c("github")
    x <- gnames > y
    x
    The important take home is that an expression basically behaves like a variable anywhere in R.
  2. Character vectors don’t have to be a single character, so you can say what you want!

    # this is preferred
    gnames > "github"
    
    # this is not so precise.  Might work in a certain
    # problem, but is not general:
    gnames > "g"
  3. You don’t have to repeat the question in the answer:

    # here are some github names of people taking the course
    gnames <- c("cpetrik", "wildflowermt", "mad4mocha", "sjohnson216", "okisutch99", "sczTWilliams", "rbeas", "mtarjan", "aaronmams", "lslefebvre")
    
    # return a logical vector that gives TRUE for each name that comes after
    # the word "github" alphabetically
    submit_answer({
    gnames <- c("cpetrik", "wildflowermt", "mad4mocha", "sjohnson216", "okisutch99", "sczTWilliams", "rbeas", "mtarjan", "aaronmams", "lslefebvre")
    b <- c("github")
    gnames > b
    })
  4. If doing comparisons, put the variable on the left and the constant (if there is one) on the right:

    gnames > "github"  # eric prefers this
    "github" < gnames  # rather than this
  5. Some things aren’t necessary. They aren’t wrong, but they are not economical and make code harder to read. The top few from the last homework:
    1. If it is a vector, you don’t have to put c() around it to make it a vector:

      y <- c(gnames[x])  # gnames[x] is a vector already.
      y <- gnames[x]     # same things as above, but preferred
      The c() function is for catenating vectors, (but beware of “growing vectors”, see below.)
    2. Logical vectors index as logical vectors. They don’t have to be wrapped in which(). The function which(LL) returns the indexes for which the logical vector LL is TRUE. Many people wrap their logical vectors in it. Don’t.

      gnames[which(gnames > "github")] <- "zzz"  # unnecessary which
      gnames[gnames > "github"] <- "zzz"   # same thing and simpler
    3. Also, if it is a logical vector, you need not coerce it to a logical—it already is:

      as.logical(gnames > "github")  # unnecessary coercion
      gnames > "github" # the > comparison operator returns a logical vector anyway
    4. Get comfortable with precedence

      isAfterGithub <- (gnames > "github") # parentheses unnecessary
      isAfterGithub <- gnames > "github"   # same as above but easier to read
      gnames > "github"  # best: no intermediate assignment when not needed

Don’t use a for loop if the vectorized operation will get you there

This was one of the hardest things for me as a C programmer, and I suspect that python programmers might find it a difficult too.

  • Remember. R is a vectorized language. If you give it a vector it wants to operate elementwise on every element in that vector. This means that quite often you needn’t write for loops for operations that you do have to write for loops for in C or python.

    # this is concise and precise (and computationlly efficient)
    gnames < "github"
    
    # this is how a C programmer things about it:
    x <- c()   # make an empty vector
    for (name in gnames) { # let name cycle over the values in gname
        if (name > "github")  # test each value
            x <- c(x, TRUE) # if it is true, "grow"" x with a TRUE
        else x <- c(x, FALSE) # if it if FALSE "grow" x with a FALSE
    }
    x # return x
    The latter is clearly harder to write, harder to maintain, and easier to hide bugs in than the former.
  • BUT, did you know that it is also orders of magnitude slower in R?
    • Try this at home, comparing 10^5 numbers:
    x <- rnorm(n = 10^5, mean=1.0, sd=5)  # make 10^5 numbers
    
    # test if any are greater than 2.
    
    # the fast, vectorized way
    g2_fast <- x > 2
    
    # slower for-loop way
    gt_slow <- c() 
    for(i in 1:length(x)) {
      gt_slow <- c(gt_slow, x[i]>2)
    }
    
    # see that you get the same result with either method:
    all(g2_fast == gt_slow)
    # but clearly the vectorized operation is faster

    The much maligned “slowness” of R, is sometimes attributable to not doing vectorized operations.


comments powered by Disqus