Data frames and factors

Goals of this lecture:
1. Introduce data frames! (Possibly the most widely-used and useful data structure in R)
  1. What is a data frame?
  2. Making data frames
  3. Viewing data frames in RStudio
  4. Indexing data frames
  5. Reading in data frames
  6. Writing data frames
2. Introduce, very briefly, factors (A tricky little data structure that probably causes more problems than anything else in R.)
  1. What they are / what they look like.
  2. Why we talk about them with data frames
  3. How they behave.
  4. Ways that they are useful.

Data frames basics

What’s a data frame?

A data frame is a list that:
- has the class data.frame
- has components that are all atomic vectors of the same length.
Think of them as a table of data. Where:
- The rows are records and
- The columns are the atomic vectors that contain values of variables.
Probably 90% of the time (or more), what someone might call a data set is something that can be represented in R as a data frame.

Example:

d <- data.frame(
  age = c(4, 6, 3, 4), 
  sex = c("MALE", "FEMALE", "FEMALE", "MALE"), 
  height.inches = c(40, 49, 38, 42), 
  favorite.sport.or.activity = c("soccer", "soccer", "martial_arts", "ballet")
)
# now, print it to the screen
d
#>   age    sex height.inches favorite.sport.or.activity
#> 1   4   MALE            40                     soccer
#> 2   6 FEMALE            49                     soccer
#> 3   3 FEMALE            38               martial_arts
#> 4   4   MALE            42                     ballet

This thing is shaped like a matrix and can be indexed in special ways (below), but at its core it is a //list//.

The data.frame() function

Syntactically, this is like the list() function, taking “key=value” pairs.
- For example, the first component has the “key”, age and the “value” c(4, 6, 3, 4).
- The keys become the names attribute of the data frame.
But, returns a data.frame:
```
class(d)
#> [1] "data.frame"
```

The names / colnames of a data frame

The names attribute of a data frame holds the “column headers”

names(d)
#> [1] "age"                        "sex"                       
#> [3] "height.inches"              "favorite.sport.or.activity"

These can also be accessed as the colnames (column names):

colnames(d)
#> [1] "age"                        "sex"                       
#> [3] "height.inches"              "favorite.sport.or.activity"

Which begs the question, are there _row_names of a data frame? Let’s try:

rownames(d)
#> [1] "1" "2" "3" "4"

The rownames of a data frame

You can assign names to the rows of a data frame.

Use the rownames() function. For example:

rownames(d) <- c("Jon", "Scarlett", "Nancy", "Terry")
# then print it out again:
d
#>          age    sex height.inches favorite.sport.or.activity
#> Jon        4   MALE            40                     soccer
#> Scarlett   6 FEMALE            49                     soccer
#> Nancy      3 FEMALE            38               martial_arts
#> Terry      4   MALE            42                     ballet

rownames have to be unique!

rownames(d) <- c("Jon", "Scarlett", "Nancy", "Jon")
#> Warning: non-unique value when setting 'row.names': 'Jon'
#> Error in `row.names<-.data.frame`(`*tmp*`, value = value): duplicate 'row.names' are not allowed

…and the right length, too:

rownames(d) <- c("Jon", "Scarlett")
#> Error in `row.names<-.data.frame`(`*tmp*`, value = value): invalid 'row.names' length

If you don’t provide them, they will be integers 1:nrow(df)

Dimensions of a data frame

A useful summary of the extent of a data frame is dim. Likewise ncol and nrow
```
dim(d)
#> [1] 4 4
nrow(d)
#> [1] 4
ncol(d)
#> [1] 4
```

Data frame indexing

data frames can be indexed like lists or like matrices

Data frame indexing like a list

Single-chome extractor [ ] with a single vector and no commas picks out the columns, and returns it as another data.frame:

# index with integers
d[c(1,3)]
#>          age height.inches
#> Jon        4            40
#> Scarlett   6            49
#> Nancy      3            38
#> Terry      4            42

# index with colnames
d[c("age", "sex")]
#>          age    sex
#> Jon        4   MALE
#> Scarlett   6 FEMALE
#> Nancy      3 FEMALE
#> Terry      4   MALE

Note that the rownames get carried along with the result.

Two-chomp extractor $ returns the vector itself. (Naked, not as part of a data frame)
```
d$age
#> [1] 4 6 3 4

d$height.inches
#> [1] 40 49 38 42
```
*Two-chomp extractor [[ ]] does the same as the $ but doesn’t do prefix-matching
```
d[["age"]]
#> [1] 4 6 3 4
```
The rownames don’t come along with the result.

Matrix-like indexing of data frames

This is new thing! Subset with two vectors separated by a comma!
i.e., [row, col] where:
- rows is an indexing vector for the rows of row indices or rownames or logical values
- cols is an indexing vector for the columns indices, or colnames or logical values
- And…(big note!) the absence of rows or cols means “give me all of them” d[1:2,]
rows and cols can be:
- positive integer vectors,
- negative interger vectors,
- character vectors of names,
- logical vectors
- (or mixtures of the two, i.e. rows as one and cols as another

Examples:

d[,]  # the whole data frame
#>          age    sex height.inches favorite.sport.or.activity
#> Jon        4   MALE            40                     soccer
#> Scarlett   6 FEMALE            49                     soccer
#> Nancy      3 FEMALE            38               martial_arts
#> Terry      4   MALE            42                     ballet

d[,1:3] # all rows, first three columns
#>          age    sex height.inches
#> Jon        4   MALE            40
#> Scarlett   6 FEMALE            49
#> Nancy      3 FEMALE            38
#> Terry      4   MALE            42

d[c(1,4), ] # first and fourth rows, all columns
#>       age  sex height.inches favorite.sport.or.activity
#> Jon     4 MALE            40                     soccer
#> Terry   4 MALE            42                     ballet

d[-1, -2] # all rows except 1 and all columns except 2
#>          age height.inches favorite.sport.or.activity
#> Scarlett   6            49                     soccer
#> Nancy      3            38               martial_arts
#> Terry      4            42                     ballet

d[d$sex == "MALE", c("age", "favorite.sport.or.activity")] # age and favorite activities of MALES
#>       age favorite.sport.or.activity
#> Jon     4                     soccer
#> Terry   4                     ballet

d[d$sex == "FEMALE", c(1,3)] # ages and heights of  FEMALES
#>          age height.inches
#> Scarlett   6            49
#> Nancy      3            38

d[d$age == 3, ] # all columns from the one three-year-old
#>       age    sex height.inches favorite.sport.or.activity
#> Nancy   3 FEMALE            38               martial_arts

Whoa! What happens when [rows, cols] picks out a single column?

Beware, if your [rows, cols] extractor picks out just a single column, then by default, R will just return a (unnamed) vector, not a data frame!
```
# ages of Jon and Terry... What! Where's my data frame?
d[c("Jon", "Terry"), "age"]  
#> [1] 4 4
```

When you want to get a one-column data frame rather than a naked vector, do this:

d[c("Jon", "Terry"), "age", drop = FALSE]
#>       age
#> Jon     4
#> Terry   4

This is super-important if you are writing functions that grab variable numbers of columns out of data frames (or matrices)

Replacement form indexing

All these indexing measures have replacement forms:

# change Terry's favorite activity to soccer
d["Terry", 4] <- "paint-ball"
#> Warning in `[<-.factor`(`*tmp*`, iseq, value = "paint-ball"): invalid
#> factor level, NA generated
d # print it
#>          age    sex height.inches favorite.sport.or.activity
#> Jon        4   MALE            40                     soccer
#> Scarlett   6 FEMALE            49                     soccer
#> Nancy      3 FEMALE            38               martial_arts
#> Terry      4   MALE            42                       <NA>

# what if we tried to change it to "mushroom hunting"?
d["Terry", 4] <- "mushroom hunting"
#> Warning in `[<-.factor`(`*tmp*`, iseq, value = "mushroom hunting"):
#> invalid factor level, NA generated
d
#>          age    sex height.inches favorite.sport.or.activity
#> Jon        4   MALE            40                     soccer
#> Scarlett   6 FEMALE            49                     soccer
#> Nancy      3 FEMALE            38               martial_arts
#> Terry      4   MALE            42                       <NA>

Surprise! What happened? (Wait till we talk about factors later.)

Assigning values to columns will recycle to the right length:

# make them all five years old...
d$age <- 5
d
#>          age    sex height.inches favorite.sport.or.activity
#> Jon        5   MALE            40                     soccer
#> Scarlett   5 FEMALE            49                     soccer
#> Nancy      5 FEMALE            38               martial_arts
#> Terry      5   MALE            42                       <NA>

Reading, viewing, and writing data frames

Hooray! We are finally learning what to do to get our own data into R!
We’ll use some data from Big Creek for examples
- You should pull the master branch of https://github.com/eriqande/rep-res-course.git to get a file in the data directory.
- Then go ahead and open up R Studio in that repository if you want to follow along.
I have the first 100 lines of the big-creek data set in the data directory in both
- .xlsx format (Ahhh! This is just here if you want to see it. Remember, never house and manipulate the sole copy of your data in Excel!)
- .csv format (comma separate values — a decent format for reading into R)
Rather than opening .csv files in Excel to look at them, it’s possible to just look at them if they are on GitHub. Try this link.

read.table()

A function that reads in “table-shaped” data and returns a data frame
read.table() is a rather generic function, that lets you specify:
- file : the name of the file
- header : TRUE/FALSE depending on it the file has a header row for the columns
- sep : the character used to separate columns
- row.names : column number holding the values to be used for rownames
- na.strings : what strings signify values that should be read as NA And many, many others. Do ?read.table for the complete list.

read.csv()

A function identical to read.table() except that the default values are set up to read in CSV files (like those produced by Excel…)

Let’s try it:

bc <- read.csv("data/big_creek_excerpt.csv", stringsAsFactors = FALSE, na.strings = c(""))

We are using two extra options:
- stringsAsFactors = FALSE (see next lecture)
- na.strings = c("") : This means count empty cells as missing data
Did that work? Check the dim of bc:
```
dim(bc)
#> [1] 100  55
```
Sweet!

Looking at our data frame

To figure out what is in our data frame, there are several options.

Just print it: bc. If the data frame is large, this produces a bunch of hard to read output
- All rows at as many columns as can fit on the screen…then the next set of columns, etc.
use the head function. i.e., head(bc). Prints just the first 10, rows. With lots of columns, this is hard to read too.

Use indexing to look at just a small part: i.e.:

bc[1:5, 1:4]
#>   NMFS_DNA_ID BOX_ID BOX_POSITION         SAMPLE_ID
#> 1     M035484   M355           1A  5-21-2008-UBC-98
#> 2     M035485   M355           1B 5-21-2008-UBC-102
#> 3     M035486   M355           1C 5-21-2008-UBC-103
#> 4     M035487   M355           1D 5-21-2008-UBC-116
#> 5     M035488   M355           1E 5-21-2008-UBC-226

Look at the names:

names(bc)
#>  [1] "NMFS_DNA_ID"         "BOX_ID"              "BOX_POSITION"       
#>  [4] "SAMPLE_ID"           "TK"                  "BATCH_ID"           
#>  [7] "PROJECT_NAME"        "GENUS"               "SPECIES"            
#> [10] "LENGTH"              "WEIGHT"              "SEX"                
#> [13] "AGE"                 "REPORTED_LIFE_STAGE" "PHENOTYPE"          
#> [16] "HATCHERY_MARK"       "TAG_NUMBER"          "COLLECTION_DATE"    
#> [19] "ESTIMATED_DATE"      "PICKER"              "PICK_DATE"          
#> [22] "LEFTOVER_SAMPLE"     "SAMPLE_COMMENTS"     "NMFS_DNA_ID.1"      
#> [25] "STATE_F"             "COUNTY_F"            "WATERSHED"          
#> [28] "TRIB_1"              "TRIB_2"              "WATER_NAME"         
#> [31] "REACH_SITE"          "HATCHERY"            "STRAIN"             
#> [34] "LATITUDE_F"          "LONGITUDE_F"         "LOCATION_COMMENTS_F"
#> [37] "NMFS_DNA_ID.2"       "SNPplate"            "Plate_POS"          
#> [40] "BOX_ID.1"            "DilPlate"            "SNPplateorder"      
#> [43] "SNPorder"            "DilSampleOrder"      "DNAbox"             
#> [46] "DNABoxSampleOrder"   "CONCAT_ID"           "Omy_AldA"           
#> [49] "Omy_AldA.1"          "SexID"               "SexID.1"            
#> [52] "SH95489.423"         "SH95489.423.1"       "SH100771.63"        
#> [55] "SH100771.63.1"

That is a little cumbersome

Perhaps the most information-rich way of looking at it is with the str function, which gives you the __str__ucture of an R object:

str(bc)
#> 'data.frame':    100 obs. of  55 variables:
#>  $ NMFS_DNA_ID        : chr  "M035484" "M035485" "M035486" "M035487" ...
#>  $ BOX_ID             : chr  "M355" "M355" "M355" "M355" ...
#>  $ BOX_POSITION       : chr  "1A" "1B" "1C" "1D" ...
#>  $ SAMPLE_ID          : chr  "5-21-2008-UBC-98" "5-21-2008-UBC-102" "5-21-2008-UBC-103" "5-21-2008-UBC-116" ...
#>  $ TK                 : chr  "UBC05210898" "UBC052108102" "UBC052108103" "UBC052108116" ...
#>  $ BATCH_ID           : int  3038 3038 3038 3038 3038 3038 3038 3038 3038 3038 ...
#>  $ PROJECT_NAME       : chr  NA NA NA NA ...
#>  $ GENUS              : chr  "Oncorhynchus" "Oncorhynchus" "Oncorhynchus" "Oncorhynchus" ...
#>  $ SPECIES            : chr  "mykiss" "mykiss" "mykiss" "mykiss" ...
#>  $ LENGTH             : int  50 177 50 150 49 48 205 59 61 60 ...
#>  $ WEIGHT             : num  1.3 78.9 1.1 48.1 1.2 ...
#>  $ SEX                : chr  NA NA NA NA ...
#>  $ AGE                : logi  NA NA NA NA NA NA ...
#>  $ REPORTED_LIFE_STAGE: logi  NA NA NA NA NA NA ...
#>  $ PHENOTYPE          : logi  NA NA NA NA NA NA ...
#>  $ HATCHERY_MARK      : logi  NA NA NA NA NA NA ...
#>  $ TAG_NUMBER         : num  NA 1.52e+08 NA 1.52e+08 NA ...
#>  $ COLLECTION_DATE    : chr  "5/21/08" "5/21/08" "5/21/08" "5/21/08" ...
#>  $ ESTIMATED_DATE     : logi  NA NA NA NA NA NA ...
#>  $ PICKER             : chr  "AC" "AC" "AC" "AC" ...
#>  $ PICK_DATE          : chr  "7/13/09" "7/13/09" "7/13/09" "7/13/09" ...
#>  $ LEFTOVER_SAMPLE    : logi  NA NA NA NA NA NA ...
#>  $ SAMPLE_COMMENTS    : chr  "Gender ID samples, Notes: M-R, Database number: 964, Upper caudal clip" "Gender ID samples, Notes: M-R, Database number: 7171" "Gender ID samples, Notes: M-R, Database number: 965, Upper caudal clip" "Gender ID samples, Notes: M-R, Database number: 7178" ...
#>  $ NMFS_DNA_ID.1      : chr  "M035484" "M035485" "M035486" "M035487" ...
#>  $ STATE_F            : chr  "California" "California" "California" "California" ...
#>  $ COUNTY_F           : chr  "Monterey" "Monterey" "Monterey" "Monterey" ...
#>  $ WATERSHED          : chr  "Big Creek" "Big Creek" "Big Creek" "Big Creek" ...
#>  $ TRIB_1             : logi  NA NA NA NA NA NA ...
#>  $ TRIB_2             : logi  NA NA NA NA NA NA ...
#>  $ WATER_NAME         : chr  "Big Creek" "Big Creek" "Big Creek" "Big Creek" ...
#>  $ REACH_SITE         : chr  "upper" "upper" "upper" "upper" ...
#>  $ HATCHERY           : logi  NA NA NA NA NA NA ...
#>  $ STRAIN             : logi  NA NA NA NA NA NA ...
#>  $ LATITUDE_F         : logi  NA NA NA NA NA NA ...
#>  $ LONGITUDE_F        : logi  NA NA NA NA NA NA ...
#>  $ LOCATION_COMMENTS_F: chr  "Running distance from bottom of reach: 225m" "Running distance from bottom of reach: 225m" "Running distance from bottom of reach: 250m" "Running distance from bottom of reach: 275m" ...
#>  $ NMFS_DNA_ID.2      : chr  "M035484" "M035485" "M035486" "M035487" ...
#>  $ SNPplate           : chr  "MPQ" "MPQ" "MPQ" "MPQ" ...
#>  $ Plate_POS          : chr  "1A" "1B" "1C" "1D" ...
#>  $ BOX_ID.1           : chr  "M355" "M355" "M355" "M355" ...
#>  $ DilPlate           : chr  "MPS" "MPQ" "MPQ" "MPQ" ...
#>  $ SNPplateorder      : chr  "7D" "1B" "1C" "1D" ...
#>  $ SNPorder           : int  41 11 23 35 47 59 71 83 1 12 ...
#>  $ DilSampleOrder     : int  52 2 3 4 5 6 7 8 9 10 ...
#>  $ DNAbox             : chr  "M355" "M355" "M355" "M355" ...
#>  $ DNABoxSampleOrder  : int  1 2 3 4 5 6 7 8 9 10 ...
#>  $ CONCAT_ID          : chr  "MPQ_M355_1A" "MPQ_M355_1B" "MPQ_M355_1C" "MPQ_M355_1D" ...
#>  $ Omy_AldA           : int  3 0 3 3 4 3 3 4 4 3 ...
#>  $ Omy_AldA.1         : int  4 0 4 4 4 4 4 4 4 4 ...
#>  $ SexID              : int  0 7 6 7 7 6 7 0 6 6 ...
#>  $ SexID.1            : int  0 6 6 6 6 6 6 0 6 6 ...
#>  $ SH95489.423        : int  3 0 3 3 3 3 3 1 3 1 ...
#>  $ SH95489.423.1      : int  1 0 1 3 1 1 1 1 1 1 ...
#>  $ SH100771.63        : int  4 4 1 4 1 1 4 4 4 4 ...
#>  $ SH100771.63.1      : int  1 1 1 1 1 1 1 4 4 1 ...

Finally, RStudio offers the very useful View function. Try this: View(bc)
- You can even pop that out into a separate window.
- They really ought to find a way to keep the headers visible when scrolling.

Writing a data frame back out to a .csv file

There is a write.table function much like read.table
And there is a write.csv function that is similar
Here we pick out just the fish between 60 and 100 mm and write the resulting data frame back to a .csv file:
```
bc2 <- bc[ bc$LENGTH >= 6 & bc$LENGTH <= 100, ]

write.csv(bc2, file = "~/Desktop/bc-bits.csv")
```
and you can open that with Excel, even.
- Note that the numeric rownames are in there by default with no header.
- If you read it back in, you would want to use row.names = 1.
- Read ?write.table for more info.

A tiny blurb about factors

In read.csv we used the option stringsAsFactors = FALSE
- What does that mean, and why did I use it?
In all the read.table family of functions, columns with character data (i.e. text strings) get converted to an object of class factor.
In R you will see factors everywhere.
The name derives from the idea of factors in experimental design, which is a shame (I think) since factors in R are useful in many ways.
My suggestion: when you see factor think vector of categories

Factors are vectors that record discrete categories

Anything measured on a disrete scale can be said to fall into one of a set of categories.
The discrete scale could be a summary of a continuous scale
- For example, the categories of Small, Medium, and Large are (likely) summaries of a continuous variable like weight or height.

If you have measured fish and put them into Small, Medium, and Large, categories you might have them in a data frame like this:

set.seed(17)
sml <- data.frame(ID = paste("Fish", 1:15, sep="_"),
                  SizeCategory = sample(c("Small", "Medium", "Large"), size = 15, replace = T)
                  )

# when you print it out it looks pretty normal
sml                 
#>         ID SizeCategory
#> 1   Fish_1        Small
#> 2   Fish_2        Large
#> 3   Fish_3       Medium
#> 4   Fish_4        Large
#> 5   Fish_5       Medium
#> 6   Fish_6       Medium
#> 7   Fish_7        Small
#> 8   Fish_8        Small
#> 9   Fish_9        Large
#> 10 Fish_10        Small
#> 11 Fish_11       Medium
#> 12 Fish_12        Small
#> 13 Fish_13        Large
#> 14 Fish_14        Large
#> 15 Fish_15        Large

Underlying structure of a factor

The “SizeCategory” column looks like a vector of strings (a character vector), but it isn’t.
A factor is a class that contains:
1. A levels attribute that maps N categories to the integers 1, …, N
  - (This sounds more complex than it is. It is just a character vector that gives an ordered collection of category names)
2. An integer vector of values between 1 and N used to describe the occurrence of the categories.
What? If that’s not clear, continuing with the sml example from above should help clarify things

sml data frame’s SizeCategory

We can access the levels attribute of sml$SizeCategory like this:
```
levels(sml$SizeCategory)
#> [1] "Large"  "Medium" "Small"
```
The order these are in the levels tells us that:
- 1 = “Large”
- 2 = “Medium”
- 3 = “Small”

And the integer vector part of sml$SizeCategory can be visualized by attaching it on the right side of the sml data frame like this:

cbind(sml, underlying_integer_vector = unclass(sml$SizeCategory))
#>         ID SizeCategory underlying_integer_vector
#> 1   Fish_1        Small                         3
#> 2   Fish_2        Large                         1
#> 3   Fish_3       Medium                         2
#> 4   Fish_4        Large                         1
#> 5   Fish_5       Medium                         2
#> 6   Fish_6       Medium                         2
#> 7   Fish_7        Small                         3
#> 8   Fish_8        Small                         3
#> 9   Fish_9        Large                         1
#> 10 Fish_10        Small                         3
#> 11 Fish_11       Medium                         2
#> 12 Fish_12        Small                         3
#> 13 Fish_13        Large                         1
#> 14 Fish_14        Large                         1
#> 15 Fish_15        Large                         1

(Note that, by default, if categories are named by characters, R sorts them alphabetically to give them an order in the levels of the factor.)

Factors are immensely useful, but tricky

We will continue talking about factors on Thursday.
Before that class, please download The R Inferno and read the Preface on page 8 and the first few paragraphs of Chapter 1 (because it is fun to do so—we have all been in R hell at one time or another), then read from section 8.2 through 8.2.8, which covers factor hell.

Your mission

In lieu of homework on this topic, everyone should just do the following while this is fresh in your mind:

Read ?read.table
Go get your own data sets that you want to work with (or are working with) and read them into R and have a look around them.
- Look over their structure
- print them to the console in various ways
- View() them.
- Change some values
- Extract just a few, non-adjacent columns
- Then save those non-adjacent columns to a new csv file.

If you don’t have your own data and want some practice, play with more files that I put in thedata directory of the course repo:

# parentage assignments of hatchery salmon
pbt <- read.table("data/snppit_output_ParentageAssignments.txt", header = TRUE, na.strings = "---")
dim(pbt)
#> [1] 7837   23

# candidate genes involved in avian song development
bird_genes <- read.table("data/candidate-genes.txt", header = TRUE, sep = "\t")

Reproducible Research Course by Eric C. Anderson for (NOAA/SWFSC)

Page Contents