Data frames and factors
- Goals of this lecture:
- Introduce data frames! (Possibly the most widely-used and useful data structure in R)
- What is a data frame?
- Making data frames
- Viewing data frames in RStudio
- Indexing data frames
- Reading in data frames
- Writing data frames
- Introduce, very briefly, factors (A tricky little data structure that probably causes more problems than anything else in R.)
- What they are / what they look like.
- Why we talk about them with data frames
- How they behave.
- Ways that they are useful.
- Introduce data frames! (Possibly the most widely-used and useful data structure in R)
Data frames basics
What’s a data frame?
- A data frame is a list that:
- has the class
data.frame
- has components that are all atomic vectors of the same length.
- has the class
- Think of them as a table of data. Where:
- The rows are records and
- The columns are the atomic vectors that contain values of variables.
- Probably 90% of the time (or more), what someone might call a data set is something that can be represented in R as a data frame.
Example:
d <- data.frame( age = c(4, 6, 3, 4), sex = c("MALE", "FEMALE", "FEMALE", "MALE"), height.inches = c(40, 49, 38, 42), favorite.sport.or.activity = c("soccer", "soccer", "martial_arts", "ballet") ) # now, print it to the screen d #> age sex height.inches favorite.sport.or.activity #> 1 4 MALE 40 soccer #> 2 6 FEMALE 49 soccer #> 3 3 FEMALE 38 martial_arts #> 4 4 MALE 42 ballet
This thing is shaped like a matrix and can be indexed in special ways (below), but at its core it is a //list//.
The data.frame() function
- Syntactically, this is like the
list()
function, taking “key=value” pairs.- For example, the first component has the “key”,
age
and the “value”c(4, 6, 3, 4)
. - The keys become the names attribute of the data frame.
- For example, the first component has the “key”,
But, returns a
data.frame
:class(d) #> [1] "data.frame"
The names / colnames of a data frame
The names attribute of a data frame holds the “column headers”
names(d) #> [1] "age" "sex" #> [3] "height.inches" "favorite.sport.or.activity"
These can also be accessed as the
colnames
(column names):colnames(d) #> [1] "age" "sex" #> [3] "height.inches" "favorite.sport.or.activity"
Which begs the question, are there _row_names of a data frame? Let’s try:
rownames(d) #> [1] "1" "2" "3" "4"
The rownames of a data frame
- You can assign names to the rows of a data frame.
Use the
rownames()
function. For example:rownames(d) <- c("Jon", "Scarlett", "Nancy", "Terry") # then print it out again: d #> age sex height.inches favorite.sport.or.activity #> Jon 4 MALE 40 soccer #> Scarlett 6 FEMALE 49 soccer #> Nancy 3 FEMALE 38 martial_arts #> Terry 4 MALE 42 ballet
rownames have to be unique!
rownames(d) <- c("Jon", "Scarlett", "Nancy", "Jon") #> Warning: non-unique value when setting 'row.names': 'Jon' #> Error in `row.names<-.data.frame`(`*tmp*`, value = value): duplicate 'row.names' are not allowed
…and the right length, too:
rownames(d) <- c("Jon", "Scarlett") #> Error in `row.names<-.data.frame`(`*tmp*`, value = value): invalid 'row.names' length
If you don’t provide them, they will be integers
1:nrow(df)
Dimensions of a data frame
A useful summary of the extent of a data frame is
dim
. Likewisencol
andnrow
dim(d) #> [1] 4 4 nrow(d) #> [1] 4 ncol(d) #> [1] 4
Data frame indexing
- data frames can be indexed like lists or like matrices
Data frame indexing like a list
Single-chome extractor
[ ]
with a single vector and no commas picks out the columns, and returns it as another data.frame:
Note that the rownames get carried along with the result.# index with integers d[c(1,3)] #> age height.inches #> Jon 4 40 #> Scarlett 6 49 #> Nancy 3 38 #> Terry 4 42 # index with colnames d[c("age", "sex")] #> age sex #> Jon 4 MALE #> Scarlett 6 FEMALE #> Nancy 3 FEMALE #> Terry 4 MALE
Two-chomp extractor
$
returns the vector itself. (Naked, not as part of a data frame)d$age #> [1] 4 6 3 4 d$height.inches #> [1] 40 49 38 42
*Two-chomp extractor
[[ ]]
does the same as the$
but doesn’t do prefix-matchingd[["age"]] #> [1] 4 6 3 4
The rownames don’t come along with the result.
Matrix-like indexing of data frames
- This is new thing! Subset with two vectors separated by a comma!
- i.e.,
[row, col]
where:rows
is an indexing vector for the rows of row indices or rownames or logical valuescols
is an indexing vector for the columns indices, or colnames or logical values- And…(big note!) the absence of
rows
orcols
means “give me all of them” d[1:2,]
rows
andcols
can be:- positive integer vectors,
- negative interger vectors,
- character vectors of names,
- logical vectors
- (or mixtures of the two, i.e.
rows
as one andcols
as another
Examples:
d[,] # the whole data frame #> age sex height.inches favorite.sport.or.activity #> Jon 4 MALE 40 soccer #> Scarlett 6 FEMALE 49 soccer #> Nancy 3 FEMALE 38 martial_arts #> Terry 4 MALE 42 ballet d[,1:3] # all rows, first three columns #> age sex height.inches #> Jon 4 MALE 40 #> Scarlett 6 FEMALE 49 #> Nancy 3 FEMALE 38 #> Terry 4 MALE 42 d[c(1,4), ] # first and fourth rows, all columns #> age sex height.inches favorite.sport.or.activity #> Jon 4 MALE 40 soccer #> Terry 4 MALE 42 ballet d[-1, -2] # all rows except 1 and all columns except 2 #> age height.inches favorite.sport.or.activity #> Scarlett 6 49 soccer #> Nancy 3 38 martial_arts #> Terry 4 42 ballet d[d$sex == "MALE", c("age", "favorite.sport.or.activity")] # age and favorite activities of MALES #> age favorite.sport.or.activity #> Jon 4 soccer #> Terry 4 ballet d[d$sex == "FEMALE", c(1,3)] # ages and heights of FEMALES #> age height.inches #> Scarlett 6 49 #> Nancy 3 38 d[d$age == 3, ] # all columns from the one three-year-old #> age sex height.inches favorite.sport.or.activity #> Nancy 3 FEMALE 38 martial_arts
Whoa! What happens when [rows, cols] picks out a single column?
Beware, if your
[rows, cols]
extractor picks out just a single column, then by default, R will just return a (unnamed) vector, not a data frame!# ages of Jon and Terry... What! Where's my data frame? d[c("Jon", "Terry"), "age"] #> [1] 4 4
When you want to get a one-column data frame rather than a naked vector, do this:
d[c("Jon", "Terry"), "age", drop = FALSE] #> age #> Jon 4 #> Terry 4
This is super-important if you are writing functions that grab variable numbers of columns out of data frames (or matrices)
Replacement form indexing
All these indexing measures have replacement forms:
Surprise! What happened? (Wait till we talk about factors later.)# change Terry's favorite activity to soccer d["Terry", 4] <- "paint-ball" #> Warning in `[<-.factor`(`*tmp*`, iseq, value = "paint-ball"): invalid #> factor level, NA generated d # print it #> age sex height.inches favorite.sport.or.activity #> Jon 4 MALE 40 soccer #> Scarlett 6 FEMALE 49 soccer #> Nancy 3 FEMALE 38 martial_arts #> Terry 4 MALE 42 <NA> # what if we tried to change it to "mushroom hunting"? d["Terry", 4] <- "mushroom hunting" #> Warning in `[<-.factor`(`*tmp*`, iseq, value = "mushroom hunting"): #> invalid factor level, NA generated d #> age sex height.inches favorite.sport.or.activity #> Jon 4 MALE 40 soccer #> Scarlett 6 FEMALE 49 soccer #> Nancy 3 FEMALE 38 martial_arts #> Terry 4 MALE 42 <NA>
Assigning values to columns will recycle to the right length:
# make them all five years old... d$age <- 5 d #> age sex height.inches favorite.sport.or.activity #> Jon 5 MALE 40 soccer #> Scarlett 5 FEMALE 49 soccer #> Nancy 5 FEMALE 38 martial_arts #> Terry 5 MALE 42 <NA>
Reading, viewing, and writing data frames
- Hooray! We are finally learning what to do to get our own data into R!
- We’ll use some data from Big Creek for examples
- You should pull the master branch of https://github.com/eriqande/rep-res-course.git to get a file in the
data
directory. - Then go ahead and open up R Studio in that repository if you want to follow along.
- You should pull the master branch of https://github.com/eriqande/rep-res-course.git to get a file in the
- I have the first 100 lines of the big-creek data set in the
data
directory in both.xlsx
format (Ahhh! This is just here if you want to see it. Remember, never house and manipulate the sole copy of your data in Excel!).csv
format (comma separate values — a decent format for reading into R)
Rather than opening .csv files in Excel to look at them, it’s possible to just look at them if they are on GitHub. Try this link.
read.table()
- A function that reads in “table-shaped” data and returns a data frame
read.table()
is a rather generic function, that lets you specify:file
: the name of the fileheader
: TRUE/FALSE depending on it the file has a header row for the columnssep
: the character used to separate columnsrow.names
: column number holding the values to be used for rownamesna.strings
: what strings signify values that should be read asNA
And many, many others. Do?read.table
for the complete list.
read.csv()
- A function identical to
read.table()
except that the default values are set up to read in CSV files (like those produced by Excel…) Let’s try it:
bc <- read.csv("data/big_creek_excerpt.csv", stringsAsFactors = FALSE, na.strings = c(""))
- We are using two extra options:
stringsAsFactors = FALSE
(see next lecture)na.strings = c("")
: This means count empty cells as missing data
Did that work? Check the
dim
of bc:dim(bc) #> [1] 100 55
Sweet!
Looking at our data frame
- To figure out what is in our data frame, there are several options.
- Just print it:
bc
. If the data frame is large, this produces a bunch of hard to read output- All rows at as many columns as can fit on the screen…then the next set of columns, etc.
- use the
head
function. i.e.,head(bc)
. Prints just the first 10, rows. With lots of columns, this is hard to read too. Use indexing to look at just a small part: i.e.:
bc[1:5, 1:4] #> NMFS_DNA_ID BOX_ID BOX_POSITION SAMPLE_ID #> 1 M035484 M355 1A 5-21-2008-UBC-98 #> 2 M035485 M355 1B 5-21-2008-UBC-102 #> 3 M035486 M355 1C 5-21-2008-UBC-103 #> 4 M035487 M355 1D 5-21-2008-UBC-116 #> 5 M035488 M355 1E 5-21-2008-UBC-226
Look at the names:
That is a little cumbersomenames(bc) #> [1] "NMFS_DNA_ID" "BOX_ID" "BOX_POSITION" #> [4] "SAMPLE_ID" "TK" "BATCH_ID" #> [7] "PROJECT_NAME" "GENUS" "SPECIES" #> [10] "LENGTH" "WEIGHT" "SEX" #> [13] "AGE" "REPORTED_LIFE_STAGE" "PHENOTYPE" #> [16] "HATCHERY_MARK" "TAG_NUMBER" "COLLECTION_DATE" #> [19] "ESTIMATED_DATE" "PICKER" "PICK_DATE" #> [22] "LEFTOVER_SAMPLE" "SAMPLE_COMMENTS" "NMFS_DNA_ID.1" #> [25] "STATE_F" "COUNTY_F" "WATERSHED" #> [28] "TRIB_1" "TRIB_2" "WATER_NAME" #> [31] "REACH_SITE" "HATCHERY" "STRAIN" #> [34] "LATITUDE_F" "LONGITUDE_F" "LOCATION_COMMENTS_F" #> [37] "NMFS_DNA_ID.2" "SNPplate" "Plate_POS" #> [40] "BOX_ID.1" "DilPlate" "SNPplateorder" #> [43] "SNPorder" "DilSampleOrder" "DNAbox" #> [46] "DNABoxSampleOrder" "CONCAT_ID" "Omy_AldA" #> [49] "Omy_AldA.1" "SexID" "SexID.1" #> [52] "SH95489.423" "SH95489.423.1" "SH100771.63" #> [55] "SH100771.63.1"
Perhaps the most information-rich way of looking at it is with the
str
function, which gives you the __str__ucture of an R object:str(bc) #> 'data.frame': 100 obs. of 55 variables: #> $ NMFS_DNA_ID : chr "M035484" "M035485" "M035486" "M035487" ... #> $ BOX_ID : chr "M355" "M355" "M355" "M355" ... #> $ BOX_POSITION : chr "1A" "1B" "1C" "1D" ... #> $ SAMPLE_ID : chr "5-21-2008-UBC-98" "5-21-2008-UBC-102" "5-21-2008-UBC-103" "5-21-2008-UBC-116" ... #> $ TK : chr "UBC05210898" "UBC052108102" "UBC052108103" "UBC052108116" ... #> $ BATCH_ID : int 3038 3038 3038 3038 3038 3038 3038 3038 3038 3038 ... #> $ PROJECT_NAME : chr NA NA NA NA ... #> $ GENUS : chr "Oncorhynchus" "Oncorhynchus" "Oncorhynchus" "Oncorhynchus" ... #> $ SPECIES : chr "mykiss" "mykiss" "mykiss" "mykiss" ... #> $ LENGTH : int 50 177 50 150 49 48 205 59 61 60 ... #> $ WEIGHT : num 1.3 78.9 1.1 48.1 1.2 ... #> $ SEX : chr NA NA NA NA ... #> $ AGE : logi NA NA NA NA NA NA ... #> $ REPORTED_LIFE_STAGE: logi NA NA NA NA NA NA ... #> $ PHENOTYPE : logi NA NA NA NA NA NA ... #> $ HATCHERY_MARK : logi NA NA NA NA NA NA ... #> $ TAG_NUMBER : num NA 1.52e+08 NA 1.52e+08 NA ... #> $ COLLECTION_DATE : chr "5/21/08" "5/21/08" "5/21/08" "5/21/08" ... #> $ ESTIMATED_DATE : logi NA NA NA NA NA NA ... #> $ PICKER : chr "AC" "AC" "AC" "AC" ... #> $ PICK_DATE : chr "7/13/09" "7/13/09" "7/13/09" "7/13/09" ... #> $ LEFTOVER_SAMPLE : logi NA NA NA NA NA NA ... #> $ SAMPLE_COMMENTS : chr "Gender ID samples, Notes: M-R, Database number: 964, Upper caudal clip" "Gender ID samples, Notes: M-R, Database number: 7171" "Gender ID samples, Notes: M-R, Database number: 965, Upper caudal clip" "Gender ID samples, Notes: M-R, Database number: 7178" ... #> $ NMFS_DNA_ID.1 : chr "M035484" "M035485" "M035486" "M035487" ... #> $ STATE_F : chr "California" "California" "California" "California" ... #> $ COUNTY_F : chr "Monterey" "Monterey" "Monterey" "Monterey" ... #> $ WATERSHED : chr "Big Creek" "Big Creek" "Big Creek" "Big Creek" ... #> $ TRIB_1 : logi NA NA NA NA NA NA ... #> $ TRIB_2 : logi NA NA NA NA NA NA ... #> $ WATER_NAME : chr "Big Creek" "Big Creek" "Big Creek" "Big Creek" ... #> $ REACH_SITE : chr "upper" "upper" "upper" "upper" ... #> $ HATCHERY : logi NA NA NA NA NA NA ... #> $ STRAIN : logi NA NA NA NA NA NA ... #> $ LATITUDE_F : logi NA NA NA NA NA NA ... #> $ LONGITUDE_F : logi NA NA NA NA NA NA ... #> $ LOCATION_COMMENTS_F: chr "Running distance from bottom of reach: 225m" "Running distance from bottom of reach: 225m" "Running distance from bottom of reach: 250m" "Running distance from bottom of reach: 275m" ... #> $ NMFS_DNA_ID.2 : chr "M035484" "M035485" "M035486" "M035487" ... #> $ SNPplate : chr "MPQ" "MPQ" "MPQ" "MPQ" ... #> $ Plate_POS : chr "1A" "1B" "1C" "1D" ... #> $ BOX_ID.1 : chr "M355" "M355" "M355" "M355" ... #> $ DilPlate : chr "MPS" "MPQ" "MPQ" "MPQ" ... #> $ SNPplateorder : chr "7D" "1B" "1C" "1D" ... #> $ SNPorder : int 41 11 23 35 47 59 71 83 1 12 ... #> $ DilSampleOrder : int 52 2 3 4 5 6 7 8 9 10 ... #> $ DNAbox : chr "M355" "M355" "M355" "M355" ... #> $ DNABoxSampleOrder : int 1 2 3 4 5 6 7 8 9 10 ... #> $ CONCAT_ID : chr "MPQ_M355_1A" "MPQ_M355_1B" "MPQ_M355_1C" "MPQ_M355_1D" ... #> $ Omy_AldA : int 3 0 3 3 4 3 3 4 4 3 ... #> $ Omy_AldA.1 : int 4 0 4 4 4 4 4 4 4 4 ... #> $ SexID : int 0 7 6 7 7 6 7 0 6 6 ... #> $ SexID.1 : int 0 6 6 6 6 6 6 0 6 6 ... #> $ SH95489.423 : int 3 0 3 3 3 3 3 1 3 1 ... #> $ SH95489.423.1 : int 1 0 1 3 1 1 1 1 1 1 ... #> $ SH100771.63 : int 4 4 1 4 1 1 4 4 4 4 ... #> $ SH100771.63.1 : int 1 1 1 1 1 1 1 4 4 1 ...
- Finally, RStudio offers the very useful
View
function. Try this:View(bc)
- You can even pop that out into a separate window.
- They really ought to find a way to keep the headers visible when scrolling.
- Just print it:
Writing a data frame back out to a .csv file
- There is a
write.table
function much likeread.table
- And there is a
write.csv
function that is similar Here we pick out just the fish between 60 and 100 mm and write the resulting data frame back to a .csv file:
and you can open that with Excel, even.bc2 <- bc[ bc$LENGTH >= 6 & bc$LENGTH <= 100, ] write.csv(bc2, file = "~/Desktop/bc-bits.csv")
- Note that the numeric rownames are in there by default with no header.
- If you read it back in, you would want to use
row.names = 1
. - Read
?write.table
for more info.
A tiny blurb about factors
- In
read.csv
we used the optionstringsAsFactors = FALSE
- What does that mean, and why did I use it?
- In all the
read.table
family of functions, columns with character data (i.e. text strings) get converted to an object of class factor. - In R you will see factors everywhere.
- The name derives from the idea of factors in experimental design, which is a shame (I think) since factors in R are useful in many ways.
- My suggestion: when you see factor think vector of categories
Factors are vectors that record discrete categories
- Anything measured on a disrete scale can be said to fall into one of a set of categories.
- The discrete scale could be a summary of a continuous scale
- For example, the categories of Small, Medium, and Large are (likely) summaries of a continuous variable like weight or height.
If you have measured fish and put them into Small, Medium, and Large, categories you might have them in a data frame like this:
set.seed(17) sml <- data.frame(ID = paste("Fish", 1:15, sep="_"), SizeCategory = sample(c("Small", "Medium", "Large"), size = 15, replace = T) ) # when you print it out it looks pretty normal sml #> ID SizeCategory #> 1 Fish_1 Small #> 2 Fish_2 Large #> 3 Fish_3 Medium #> 4 Fish_4 Large #> 5 Fish_5 Medium #> 6 Fish_6 Medium #> 7 Fish_7 Small #> 8 Fish_8 Small #> 9 Fish_9 Large #> 10 Fish_10 Small #> 11 Fish_11 Medium #> 12 Fish_12 Small #> 13 Fish_13 Large #> 14 Fish_14 Large #> 15 Fish_15 Large
Underlying structure of a factor
- The “SizeCategory” column looks like a vector of strings (a character vector), but it isn’t.
- A factor is a class that contains:
- A levels attribute that maps N categories to the integers 1, …, N
- (This sounds more complex than it is. It is just a character vector that gives an ordered collection of category names)
- An integer vector of values between 1 and N used to describe the occurrence of the categories.
- A levels attribute that maps N categories to the integers 1, …, N
- What? If that’s not clear, continuing with the
sml
example from above should help clarify things
sml data frame’s SizeCategory
We can access the levels attribute of
sml$SizeCategory
like this:levels(sml$SizeCategory) #> [1] "Large" "Medium" "Small"
- The order these are in the levels tells us that:
- 1 = “Large”
- 2 = “Medium”
- 3 = “Small”
And the integer vector part of
sml$SizeCategory
can be visualized by attaching it on the right side of thesml
data frame like this:cbind(sml, underlying_integer_vector = unclass(sml$SizeCategory)) #> ID SizeCategory underlying_integer_vector #> 1 Fish_1 Small 3 #> 2 Fish_2 Large 1 #> 3 Fish_3 Medium 2 #> 4 Fish_4 Large 1 #> 5 Fish_5 Medium 2 #> 6 Fish_6 Medium 2 #> 7 Fish_7 Small 3 #> 8 Fish_8 Small 3 #> 9 Fish_9 Large 1 #> 10 Fish_10 Small 3 #> 11 Fish_11 Medium 2 #> 12 Fish_12 Small 3 #> 13 Fish_13 Large 1 #> 14 Fish_14 Large 1 #> 15 Fish_15 Large 1
(Note that, by default, if categories are named by characters, R sorts them alphabetically to give them an order in the levels of the factor.)
Factors are immensely useful, but tricky
- We will continue talking about factors on Thursday.
- Before that class, please download The R Inferno and read the Preface on page 8 and the first few paragraphs of Chapter 1 (because it is fun to do so—we have all been in R hell at one time or another), then read from section 8.2 through 8.2.8, which covers factor hell.
Your mission
In lieu of homework on this topic, everyone should just do the following while this is fresh in your mind:
- Read
?read.table
- Go get your own data sets that you want to work with (or are working with) and read them into R and have a look around them.
- Look over their structure
- print them to the console in various ways
View()
them.- Change some values
- Extract just a few, non-adjacent columns
- Then save those non-adjacent columns to a new csv file.
If you don’t have your own data and want some practice, play with more files that I put in the
data
directory of the course repo:# parentage assignments of hatchery salmon pbt <- read.table("data/snppit_output_ParentageAssignments.txt", header = TRUE, na.strings = "---") dim(pbt) #> [1] 7837 23 # candidate genes involved in avian song development bird_genes <- read.table("data/candidate-genes.txt", header = TRUE, sep = "\t")
comments powered by Disqus