Factors
- Goals of this lecture:
- Go into moderate detail on factors (A tricky little data structure that probably causes more problems than anything else in R.)
- What they are / what they look like.
- Why we talk about them with data frames
- How they behave.
- Ways that they are useful.
- In the process we will look at the
table
function and have some examples from the world of genetic assignment of birds.
- Go into moderate detail on factors (A tricky little data structure that probably causes more problems than anything else in R.)
Factor basics
Let’s reiterate some points/examples from the previous session.
Factors are vectors that record discrete categories
- Anything measured on a disrete scale can be said to fall into one of a set of categories.
- The discrete scale could be a summary of a continuous scale
- For example, the categories of Small, Medium, and Large are (likely) summaries of a continuous variable like weight or height.
If you have measured fish and put them into Small, Medium, and Large, categories you might have them in a data frame like this:
set.seed(17) sml <- data.frame(ID = paste("Fish", 1:15, sep="_"), SizeCategory = sample(c("Small", "Medium", "Large"), size = 15, replace = T) ) # when you print it out it looks pretty normal sml #> ID SizeCategory #> 1 Fish_1 Small #> 2 Fish_2 Large #> 3 Fish_3 Medium #> 4 Fish_4 Large #> 5 Fish_5 Medium #> 6 Fish_6 Medium #> 7 Fish_7 Small #> 8 Fish_8 Small #> 9 Fish_9 Large #> 10 Fish_10 Small #> 11 Fish_11 Medium #> 12 Fish_12 Small #> 13 Fish_13 Large #> 14 Fish_14 Large #> 15 Fish_15 Large
Underlying structure of a factor
- The “SizeCategory” column looks like a vector of strings (a character vector), but it isn’t.
- A factor is a class that contains:
- A levels attribute that maps N categories to the integers 1, …, N
- (This sounds more complex than it is. It is just a character vector that gives an ordered collection of category names)
- An integer vector of values between 1 and N used to describe the occurrence of the categories.
- A levels attribute that maps N categories to the integers 1, …, N
- What? If that’s not clear, continuing with the
sml
example from above should help clarify things
sml data frame’s SizeCategory
We can access the levels attribute of
sml$SizeCategory
like this:levels(sml$SizeCategory) #> [1] "Large" "Medium" "Small"
- The order these are in the levels tells us that:
- 1 = “Large”
- 2 = “Medium”
- 3 = “Small”
And the integer vector part of
sml$SizeCategory
can be visualized by attaching it on the right side of thesml
data frame like this:cbind(sml, underlying_integer_vector = unclass(sml$SizeCategory)) #> ID SizeCategory underlying_integer_vector #> 1 Fish_1 Small 3 #> 2 Fish_2 Large 1 #> 3 Fish_3 Medium 2 #> 4 Fish_4 Large 1 #> 5 Fish_5 Medium 2 #> 6 Fish_6 Medium 2 #> 7 Fish_7 Small 3 #> 8 Fish_8 Small 3 #> 9 Fish_9 Large 1 #> 10 Fish_10 Small 3 #> 11 Fish_11 Medium 2 #> 12 Fish_12 Small 3 #> 13 Fish_13 Large 1 #> 14 Fish_14 Large 1 #> 15 Fish_15 Large 1
(Note that, by default, if categories are named by characters, R sorts them alphabetically to give them an order in the levels of the factor.)
How R prints factors
- R prints factors by showing the values as the strings that they are.
- And, at the bottom it prints out the levels
- Or if there are lots of levels (i.e. categories) then it prints a few of them
It looks like this:
sml$SizeCategory #> [1] Small Large Medium Large Medium Medium Small Small Large Small #> [11] Medium Small Large Large Large #> Levels: Large Medium Small
So, when you print something and it says
Levels:
on the last line, you know you are dealing with a factor.
A different example
Another example Data Frame
We can make some bogus data
set.seed(1)
bogus <- data.frame(
students = rep(c("Devon", "Martha", "Hilary"), 3),
tests = rep(c("Sep","Oct", "Nov"), each = 3),
scores = as.integer(runif(9, min = 55, max = 98))
)
bogus # look at it
#> students tests scores
#> 1 Devon Sep 66
#> 2 Martha Sep 71
#> 3 Hilary Sep 79
#> 4 Devon Oct 94
#> 5 Martha Oct 63
#> 6 Hilary Oct 93
#> 7 Devon Nov 95
#> 8 Martha Nov 83
#> 9 Hilary Nov 82
str(bogus) # see what the types are. Hey there are factors!
#> 'data.frame': 9 obs. of 3 variables:
#> $ students: Factor w/ 3 levels "Devon","Hilary",..: 1 3 2 1 3 2 1 3 2
#> $ tests : Factor w/ 3 levels "Nov","Oct","Sep": 3 3 3 2 2 2 1 1 1
#> $ scores : int 66 71 79 94 63 93 95 83 82
Important Note
- The default behavior of R is to convert character vectors to factors when putting them into a data frame.
The column you get in
bogus$students
is the same as is returned byfactor(rep(c("Devon", "Martha", "Hilary"), 3)) #> [1] Devon Martha Hilary Devon Martha Hilary Devon Martha Hilary #> Levels: Devon Hilary Martha
So, the function
factor()
takes a vector and makes a factor vector out of it
What a factor consists of in R
- Somewhat more tersely and technically than before:
- A factor is a vector with class attribute of
factor
and with another attribute calledlevels
For a factor
f
:levels(f) # returns the levels of f levels(f) <- # can be used to set/modify the levels attribute of f
- A factor is a vector with class attribute of
levels(f)
is a character vector, that will be sorted by default.- The values of the factor variable itself are integers.
- The i-th element of the factor vector tells us which level (or category) the i-th observation falls into.
What a Factor Looks like Under the Hood
One can use the
unclass
function to see the actual parts of an R object without having them printed in a way that is specific to the objects
class` attribute.bogus$students # printed as a factor #> [1] Devon Martha Hilary Devon Martha Hilary Devon Martha Hilary #> Levels: Devon Hilary Martha unclass(bogus$students) # printed generically #> [1] 1 3 2 1 3 2 1 3 2 #> attr(,"levels") #> [1] "Devon" "Hilary" "Martha" bogus$tests # printed as a factor #> [1] Sep Sep Sep Oct Oct Oct Nov Nov Nov #> Levels: Nov Oct Sep unclass(bogus$tests) # printed generically #> [1] 3 3 3 2 2 2 1 1 1 #> attr(,"levels") #> [1] "Nov" "Oct" "Sep"
Issues and such with factors
You can make R not create factors of character data in data frames
- The
data.frame
function, as well as theread.table
family of functions accept astringsAsFactors
parameter. - This can be a reasonable thing to do, since you can always explicitly make certain columns factors if you want to, using the
factor
function later.
Why does R use factors?
- The idea of factors is central to the fitting of various statistical models.
- However R seems to go overboard by wanting to squash any character vector into a factor in a data frame.
- Some of this relates to the fact that prior to a fairly late version of R, coding character vectors as factors was more space efficient.
- There are numerous hassles and headaches involved in dealing with factors, but factors are here to stay in R, so we had better get comfortable with them.
- There are also many good things about factors (see later).
Factors, once made, restrict allowable levels
Example:
studentsf <- bogus$students # this is a factor variable
studentsf # print it and see its values and levels
#> [1] Devon Martha Hilary Devon Martha Hilary Devon Martha Hilary
#> Levels: Devon Hilary Martha
studentsf[c(1,4,7)] # return all the Devon values.
#> [1] Devon Devon Devon
#> Levels: Devon Hilary Martha
# note that the levels are still all three names
# Now, what if we want to change the name "Devon" to "The Dude"?
studentsf[c(1,4,7)] <- "The Dude" # R gets upset when you do this!
#> Warning in `[<-.factor`(`*tmp*`, c(1, 4, 7), value = "The Dude"): invalid
#> factor level, NA generated
How can you change values of factors?
- Two main ways:
Modify the levels. In this example we will change “Devon” to “The Dude”
# Look at bogus$students bogus$students #> [1] Devon Martha Hilary Devon Martha Hilary Devon Martha Hilary #> Levels: Devon Hilary Martha # Confirm that Devon is the first element of the levels levels(bogus$students) #> [1] "Devon" "Hilary" "Martha" # Change that to "The Dude" using assignment-form indexing levels(bogus$students)[1] <- "The Dude" # Now look at the factor bogus$students #> [1] The Dude Martha Hilary The Dude Martha Hilary The Dude Martha #> [9] Hilary #> Levels: The Dude Hilary Martha
Coerce the factor to a character vector, modify, and re-
factor()
it# let's change "Martha" to "Martha A" # what happens when we coerce to character? as.character(bogus$students) #> [1] "The Dude" "Martha" "Hilary" "The Dude" "Martha" "Hilary" #> [7] "The Dude" "Martha" "Hilary" # OK, so make a variable of that and then modify it tmp <- as.character(bogus$students) tmp[tmp == "Martha"] <- "Martha A" # change every occurrence of "Martha" to "Martha A" # When we turn tmp back into a factor, what does it look like? factor(tmp) #> [1] The Dude Martha A Hilary The Dude Martha A Hilary The Dude Martha A #> [9] Hilary #> Levels: Hilary Martha A The Dude # OK, cool, we can assign that to bogus$students bogus$students <- factor(tmp) # Look at the result: bogus #> students tests scores #> 1 The Dude Sep 66 #> 2 Martha A Sep 71 #> 3 Hilary Sep 79 #> 4 The Dude Oct 94 #> 5 Martha A Oct 63 #> 6 Hilary Oct 93 #> 7 The Dude Nov 95 #> 8 Martha A Nov 83 #> 9 Hilary Nov 82
Catenating two factors
What if we have this scenario:
and, further, imagine you want to bung them together into a factor of# imagine you have two factors boys_f <- factor(c("Joe", "Ted", "Fred", "Joe")) girls_f <- factor(c("Anne", "Louise", "Louise", "Lucy", "Louise"))
kids_f
.This fails spectacularly:
Yikes! It has just catenated the underlying integer vectors!kids_f <- c(boys_f, girls_f) kids_f #> [1] 2 3 1 2 1 2 2 3 2
- To get what you want:
- coerce each to character
- catenate
- re-
factor
it i.e.:
kids_f <- factor(c(as.character(boys_f), as.character(girls_f))) kids_f #> [1] Joe Ted Fred Joe Anne Louise Louise Lucy Louise #> Levels: Anne Fred Joe Louise Lucy Ted # check out the levels: levels(kids_f) #> [1] "Anne" "Fred" "Joe" "Louise" "Lucy" "Ted"
What about adding rows to data frames?
Fortunately, if you want to add rows to a data frame, you can do that with
rbind()
and it will update the factor columns appropriately:extra <- rbind(bogus, data.frame(students = c("Hilary", "Eve"), tests = c("Jan", "Sep"), scores = c(88, 97) ) ) # what was the result? extra #> students tests scores #> 1 The Dude Sep 66 #> 2 Martha A Sep 71 #> 3 Hilary Sep 79 #> 4 The Dude Oct 94 #> 5 Martha A Oct 63 #> 6 Hilary Oct 93 #> 7 The Dude Nov 95 #> 8 Martha A Nov 83 #> 9 Hilary Nov 82 #> 10 Hilary Jan 88 #> 11 Eve Sep 97 # what do the levels look like: levels(extra$students) #> [1] "Hilary" "Martha A" "The Dude" "Eve" levels(extra$tests) #> [1] "Nov" "Oct" "Sep" "Jan"
Factor levels stick around
Even if you delete all occurrences of a level in a factor vector, the levels do not automatically change:
no.dude <- bogus[ bogus$students != "The Dude", ] # drop Devon (The Dude) and his dudeliness no.dude # print it out...no "The Dude" #> students tests scores #> 2 Martha A Sep 71 #> 3 Hilary Sep 79 #> 5 Martha A Oct 63 #> 6 Hilary Oct 93 #> 8 Martha A Nov 83 #> 9 Hilary Nov 82 no.dude$students # print that column of students #> [1] Martha A Hilary Martha A Hilary Martha A Hilary #> Levels: Hilary Martha A The Dude # whoa-ho! The Dude is still a level...The Dude abides! # check again levels(no.dude$students) #> [1] "Hilary" "Martha A" "The Dude"
If you have subsetted a data frame and you want to get rid of the extra levels of all the factors, you can do like this with
droplevels()
:no.dude2 <- droplevels(no.dude) no.dude2 # print it #> students tests scores #> 2 Martha A Sep 71 #> 3 Hilary Sep 79 #> 5 Martha A Oct 63 #> 6 Hilary Oct 93 #> 8 Martha A Nov 83 #> 9 Hilary Nov 82 # check the levels levels(no.dude2$students) # no The Dude! #> [1] "Hilary" "Martha A"
In many contexts you will want the factor levels to stick around. In others you don’t.
Numeric/Character/Factor Disasters
The most common disaster that can happen with factors occurs when you think you can get back to a numeric vector by coercing a factor to as.numeric:
# here are some integers
my.nums <- c(1,4,8,10,1,8,8,8,10)
# make them a factor
numf <- factor(my.nums)
# try to recover the original integers
as.numeric(numf) # disaster
#> [1] 1 2 3 4 1 3 3 3 4
# 2 "correct" ways of doing it
as.numeric(as.character(numf)) # coerce to character first, then to numeric
#> [1] 1 4 8 10 1 8 8 8 10
as.numeric(levels(numf)[numf]) # slurp out the levels by numf and coerce
#> [1] 1 4 8 10 1 8 8 8 10
Why factors are super useful!
- I am going to go through just one example that involves counting up occurrences of different categories.
- When counting categories you usually will want to:
- Record a zero for known categories that had no observations
- List the categories in a particular order
- Both of these desires can be accommodated by judicious use of factors!
- Because levels “stick around” categories will be counted (as 0) even if there are no observations of them
- The levels of a factor can be put in any order desired, and that order will be used in reporting from many different functions.
The table() function
table(x)
gives the number of occurrence of each unique category inx
.set.seed(2) x <- sample(letters, size = 100, replace = TRUE) x # print it #> [1] "e" "s" "o" "e" "y" "y" "d" "v" "m" "o" "o" "g" "t" "e" "k" "w" "z" #> [18] "f" "l" "b" "r" "k" "v" "d" "j" "m" "d" "j" "z" "d" "a" "e" "v" "w" #> [35] "n" "q" "v" "h" "r" "d" "z" "h" "c" "e" "y" "u" "z" "j" "n" "v" "a" #> [52] "a" "r" "y" "h" "v" "u" "z" "p" "s" "u" "x" "q" "g" "w" "l" "k" "l" #> [69] "f" "b" "h" "i" "b" "e" "e" "t" "h" "w" "k" "o" "j" "r" "a" "k" "f" #> [86] "w" "z" "i" "t" "i" "z" "k" "j" "o" "m" "f" "l" "c" "c" "l" # count the number of each occurence table(x) #> x #> a b c d e f g h i j k l m n o p q r s t u v w x y z #> 4 3 3 5 7 4 2 5 3 5 6 5 3 2 5 1 2 4 2 3 3 6 5 1 4 7
It also can count the number of occurrences of pairs of categories:
set.seed(20) x <- sample(letters[1:3], size = 10, replace = TRUE) y <- sample(LETTERS[1:3], size = 10, replace = TRUE) cbind(x,y) # think of lining up x and y together #> x y #> [1,] "c" "C" #> [2,] "c" "C" #> [3,] "a" "A" #> [4,] "b" "C" #> [5,] "c" "A" #> [6,] "c" "B" #> [7,] "a" "A" #> [8,] "a" "A" #> [9,] "a" "A" #> [10,] "b" "C" # how often do you see the combination a,A or a,B or c,B etc. table(list(x, y)) #> .2 #> .1 A B C #> a 4 0 0 #> b 0 0 2 #> c 1 1 2
Some sample data from birds
- Example from Mapping migration in a songbird …
- 393 birds from various locations in the breeding range of Wilson’s warbler
- These were genotyped, and locations were lumped into regions
- Then we asked how well we could use the genetic data to assign individual birds from each location to the correct region
Here is what the output looks like (a data frame)
wiwa <- read.csv("data/bird-self-assignments.csv", row.names=1) head(wiwa) #> PopulationOfOrigin NumberOfLoci AK.EastBC.AB Wa.To.NorCalCoast #> wAKDE01 wAKDE 95 0.99999 0.00002 #> wAKDE02 wAKDE 94 0.99622 0.00327 #> wAKDE03 wAKDE 95 1.00000 0.00000 #> wAKDE04 wAKDE 95 0.99999 0.00000 #> wAKDE05 wAKDE 95 0.99984 0.00015 #> wAKDE06 wAKDE 95 0.99999 0.00000 #> CentCalCoast CalSierra Basin.Rockies Eastern MaxColumn #> wAKDE01 0 0 0e+00 0 AK.EastBC.AB #> wAKDE02 0 0 5e-04 0 AK.EastBC.AB #> wAKDE03 0 0 0e+00 0 AK.EastBC.AB #> wAKDE04 0 0 0e+00 0 AK.EastBC.AB #> wAKDE05 0 0 0e+00 0 AK.EastBC.AB #> wAKDE06 0 0 0e+00 0 AK.EastBC.AB #> MaxPosterior #> wAKDE01 0.99999 #> wAKDE02 0.99622 #> wAKDE03 1.00000 #> wAKDE04 0.99999 #> wAKDE05 0.99984 #> wAKDE06 0.99999 # here are the different locations levels(wiwa$PopulationOfOrigin) #> [1] "eNBFR" "eONHI" "eQCCM" "wABCA" "wAKDE" "wAKJU" "wAKUG" "wAKYA" #> [9] "wBCMH" "wCABS" "wCACL" "wCAEU" "wCAHM" "wCAHU" "wCASL" "wCATE" #> [17] "wCOGM" "wCOPP" "wMTHM" "wOREL" "wORHA" "wORMB" "wWADA" # here are the different regions levels(wiwa$MaxColumn) #> [1] "AK.EastBC.AB" "Basin.Rockies" "CalSierra" #> [4] "CentCalCoast" "Eastern" "Wa.To.NorCalCoast"
Counting up self-assignments
We can count how many birds from each location were assigned to which regions using
table()
table(list(wiwa$PopulationOfOrigin, wiwa$MaxColumn)) #> .2 #> .1 AK.EastBC.AB Basin.Rockies CalSierra CentCalCoast Eastern #> eNBFR 0 0 0 0 4 #> eONHI 0 0 0 0 4 #> eQCCM 0 0 0 0 17 #> wABCA 24 1 0 0 0 #> wAKDE 29 0 0 0 0 #> wAKJU 10 0 0 0 0 #> wAKUG 26 0 0 0 0 #> wAKYA 21 0 0 0 0 #> wBCMH 9 4 0 0 0 #> wCABS 0 0 1 11 0 #> wCACL 0 0 12 2 0 #> wCAEU 0 0 3 0 0 #> wCAHM 0 0 1 14 0 #> wCAHU 0 0 15 0 0 #> wCASL 0 0 3 19 0 #> wCATE 0 0 21 2 0 #> wCOGM 0 11 0 0 0 #> wCOPP 0 19 0 0 0 #> wMTHM 1 2 0 0 0 #> wOREL 6 19 0 0 0 #> wORHA 0 0 0 3 0 #> wORMB 0 0 1 1 0 #> wWADA 2 0 0 1 0 #> .2 #> .1 Wa.To.NorCalCoast #> eNBFR 0 #> eONHI 0 #> eQCCM 0 #> wABCA 0 #> wAKDE 0 #> wAKJU 0 #> wAKUG 0 #> wAKYA 0 #> wBCMH 0 #> wCABS 3 #> wCACL 1 #> wCAEU 15 #> wCAHM 2 #> wCAHU 1 #> wCASL 1 #> wCATE 2 #> wCOGM 0 #> wCOPP 0 #> wMTHM 0 #> wOREL 0 #> wORHA 20 #> wORMB 20 #> wWADA 9
- That is all right, but the locations and regions are not ordered very sensibly.
- They are ordered alphabetically,
- It would be better to order them geographically
- We can do this by resetting the levels in the order we want:
First, get vectors that have all the categories you want in the order you want them in
# a vector of regions in a geographically sensible order regions_ordered <- c("AK.EastBC.AB", "Wa.To.NorCalCoast", "CentCalCoast", "CalSierra", "Basin.Rockies", "Eastern") # get a vector of locations in a good order locations_ordered <- c("wAKDE", "wAKYA", "wAKUG", "wAKJU", "wABCA", "wBCMH", "wWADA", "wORHA", "wORMB", "wCAEU", "wCAHM", "wCABS", "wCASL", "wCATE", "wCACL", "wCAHU", "wMTHM", "wOREL", "wCOPP", "wCOGM", "eQCCM", "eONHI", "eNBFR" )
Then, this is the magical step: reset the levels to be the ordered vectors of categories you want. You do this by passing in the ordered vector to the
levels
argument of thefactor()
function:# order the levels of the regions nicely wiwa$MaxColumn <- factor(wiwa$MaxColumn, levels = regions_ordered) # order the levels of the locations nicely wiwa$PopulationOfOrigin <- factor(wiwa$PopulationOfOrigin, levels = locations_ordered)
WARNING DO NOT DO THIS!
You have to reconstitute is as a factor after changing the levels. Otherwise you can get totally wrong values.levels(wiwa$MaxColumn) <- regions_ordered levels(wiwa$PopulationOfOrigin) <- locations_ordered
Then use table again, and note the ordering of the categories in the output:
table(list(wiwa$PopulationOfOrigin, wiwa$MaxColumn)) #> .2 #> .1 AK.EastBC.AB Wa.To.NorCalCoast CentCalCoast CalSierra #> wAKDE 29 0 0 0 #> wAKYA 21 0 0 0 #> wAKUG 26 0 0 0 #> wAKJU 10 0 0 0 #> wABCA 24 0 0 0 #> wBCMH 9 0 0 0 #> wWADA 2 9 1 0 #> wORHA 0 20 3 0 #> wORMB 0 20 1 1 #> wCAEU 0 15 0 3 #> wCAHM 0 2 14 1 #> wCABS 0 3 11 1 #> wCASL 0 1 19 3 #> wCATE 0 2 2 21 #> wCACL 0 1 2 12 #> wCAHU 0 1 0 15 #> wMTHM 1 0 0 0 #> wOREL 6 0 0 0 #> wCOPP 0 0 0 0 #> wCOGM 0 0 0 0 #> eQCCM 0 0 0 0 #> eONHI 0 0 0 0 #> eNBFR 0 0 0 0 #> .2 #> .1 Basin.Rockies Eastern #> wAKDE 0 0 #> wAKYA 0 0 #> wAKUG 0 0 #> wAKJU 0 0 #> wABCA 1 0 #> wBCMH 4 0 #> wWADA 0 0 #> wORHA 0 0 #> wORMB 0 0 #> wCAEU 0 0 #> wCAHM 0 0 #> wCABS 0 0 #> wCASL 0 0 #> wCATE 0 0 #> wCACL 0 0 #> wCAHU 0 0 #> wMTHM 2 0 #> wOREL 19 0 #> wCOPP 19 0 #> wCOGM 11 0 #> eQCCM 0 17 #> eONHI 0 4 #> eNBFR 0 4
Many, many functions use the order of the levels of a factor to determine what order to output things in (like drawning legends on plots, etc.). So knowing how to set the order of the levels with
factor(my.factor, levels = my.ord)
is very useful.
comments powered by Disqus