Factors

• Goals of this lecture:
1. Go into moderate detail on factors (A tricky little data structure that probably causes more problems than anything else in R.)
1. What they are / what they look like.
2. Why we talk about them with data frames
3. How they behave.
4. Ways that they are useful.
2. In the process we will look at the table function and have some examples from the world of genetic assignment of birds.

Factor basics

Let’s reiterate some points/examples from the previous session.

Factors are vectors that record discrete categories

• Anything measured on a disrete scale can be said to fall into one of a set of categories.
• The discrete scale could be a summary of a continuous scale
• For example, the categories of Small, Medium, and Large are (likely) summaries of a continuous variable like weight or height.
• If you have measured fish and put them into Small, Medium, and Large, categories you might have them in a data frame like this:

set.seed(17)
sml <- data.frame(ID = paste("Fish", 1:15, sep="_"),
SizeCategory = sample(c("Small", "Medium", "Large"), size = 15, replace = T)
)

# when you print it out it looks pretty normal
sml
#>         ID SizeCategory
#> 1   Fish_1        Small
#> 2   Fish_2        Large
#> 3   Fish_3       Medium
#> 4   Fish_4        Large
#> 5   Fish_5       Medium
#> 6   Fish_6       Medium
#> 7   Fish_7        Small
#> 8   Fish_8        Small
#> 9   Fish_9        Large
#> 10 Fish_10        Small
#> 11 Fish_11       Medium
#> 12 Fish_12        Small
#> 13 Fish_13        Large
#> 14 Fish_14        Large
#> 15 Fish_15        Large

Underlying structure of a factor

• The “SizeCategory” column looks like a vector of strings (a character vector), but it isn’t.
• A factor is a class that contains:
1. A levels attribute that maps N categories to the integers 1, …, N
• (This sounds more complex than it is. It is just a character vector that gives an ordered collection of category names)
2. An integer vector of values between 1 and N used to describe the occurrence of the categories.
• What? If that’s not clear, continuing with the sml example from above should help clarify things

sml data frame’s SizeCategory

• We can access the levels attribute of sml$SizeCategory like this: levels(sml$SizeCategory)
#> [1] "Large"  "Medium" "Small"
• The order these are in the levels tells us that:
• 1 = “Large”
• 2 = “Medium”
• 3 = “Small”
• And the integer vector part of sml$SizeCategory can be visualized by attaching it on the right side of the sml data frame like this: cbind(sml, underlying_integer_vector = unclass(sml$SizeCategory))
#>         ID SizeCategory underlying_integer_vector
#> 1   Fish_1        Small                         3
#> 2   Fish_2        Large                         1
#> 3   Fish_3       Medium                         2
#> 4   Fish_4        Large                         1
#> 5   Fish_5       Medium                         2
#> 6   Fish_6       Medium                         2
#> 7   Fish_7        Small                         3
#> 8   Fish_8        Small                         3
#> 9   Fish_9        Large                         1
#> 10 Fish_10        Small                         3
#> 11 Fish_11       Medium                         2
#> 12 Fish_12        Small                         3
#> 13 Fish_13        Large                         1
#> 14 Fish_14        Large                         1
#> 15 Fish_15        Large                         1
• (Note that, by default, if categories are named by characters, R sorts them alphabetically to give them an order in the levels of the factor.)

How R prints factors

• R prints factors by showing the values as the strings that they are.
• And, at the bottom it prints out the levels
• Or if there are lots of levels (i.e. categories) then it prints a few of them
• It looks like this:

sml$SizeCategory #> [1] Small Large Medium Large Medium Medium Small Small Large Small #> [11] Medium Small Large Large Large #> Levels: Large Medium Small • So, when you print something and it says Levels: on the last line, you know you are dealing with a factor. A different example Another example Data Frame We can make some bogus data set.seed(1) bogus <- data.frame( students = rep(c("Devon", "Martha", "Hilary"), 3), tests = rep(c("Sep","Oct", "Nov"), each = 3), scores = as.integer(runif(9, min = 55, max = 98)) ) bogus # look at it #> students tests scores #> 1 Devon Sep 66 #> 2 Martha Sep 71 #> 3 Hilary Sep 79 #> 4 Devon Oct 94 #> 5 Martha Oct 63 #> 6 Hilary Oct 93 #> 7 Devon Nov 95 #> 8 Martha Nov 83 #> 9 Hilary Nov 82 str(bogus) # see what the types are. Hey there are factors! #> 'data.frame': 9 obs. of 3 variables: #>$ students: Factor w/ 3 levels "Devon","Hilary",..: 1 3 2 1 3 2 1 3 2
#>  $tests : Factor w/ 3 levels "Nov","Oct","Sep": 3 3 3 2 2 2 1 1 1 #>$ scores  : int  66 71 79 94 63 93 95 83 82

Important Note

• The default behavior of R is to convert character vectors to factors when putting them into a data frame.
• The column you get in bogus$students is the same as is returned by factor(rep(c("Devon", "Martha", "Hilary"), 3)) #> [1] Devon Martha Hilary Devon Martha Hilary Devon Martha Hilary #> Levels: Devon Hilary Martha • So, the function factor() takes a vector and makes a factor vector out of it What a factor consists of in R • Somewhat more tersely and technically than before: • A factor is a vector with class attribute of factor and with another attribute called levels • For a factor f: levels(f) # returns the levels of f levels(f) <- # can be used to set/modify the levels attribute of f • levels(f) is a character vector, that will be sorted by default. • The values of the factor variable itself are integers. • The i-th element of the factor vector tells us which level (or category) the i-th observation falls into. What a Factor Looks like Under the Hood • One can use the unclass function to see the actual parts of an R object without having them printed in a way that is specific to the objectsclass attribute. bogus$students  # printed as a factor
#> [1] Devon  Martha Hilary Devon  Martha Hilary Devon  Martha Hilary
#> Levels: Devon Hilary Martha

unclass(bogus$students) # printed generically #> [1] 1 3 2 1 3 2 1 3 2 #> attr(,"levels") #> [1] "Devon" "Hilary" "Martha" bogus$tests   # printed as a factor
#> [1] Sep Sep Sep Oct Oct Oct Nov Nov Nov
#> Levels: Nov Oct Sep

unclass(bogus$tests) # printed generically #> [1] 3 3 3 2 2 2 1 1 1 #> attr(,"levels") #> [1] "Nov" "Oct" "Sep" Issues and such with factors You can make R not create factors of character data in data frames • The data.frame function, as well as the read.table family of functions accept a stringsAsFactors parameter. • This can be a reasonable thing to do, since you can always explicitly make certain columns factors if you want to, using the factor function later. Why does R use factors? • The idea of factors is central to the fitting of various statistical models. • However R seems to go overboard by wanting to squash any character vector into a factor in a data frame. • Some of this relates to the fact that prior to a fairly late version of R, coding character vectors as factors was more space efficient. • There are numerous hassles and headaches involved in dealing with factors, but factors are here to stay in R, so we had better get comfortable with them. • There are also many good things about factors (see later). Factors, once made, restrict allowable levels Example: studentsf <- bogus$students # this is a factor variable

studentsf # print it and see its values and levels
#> [1] Devon  Martha Hilary Devon  Martha Hilary Devon  Martha Hilary
#> Levels: Devon Hilary Martha

studentsf[c(1,4,7)] # return all the Devon values.
#> [1] Devon Devon Devon
#> Levels: Devon Hilary Martha
# note that the levels are still all three names

# Now, what if we want to change the name "Devon" to "The Dude"?
studentsf[c(1,4,7)] <- "The Dude"  # R gets upset when you do this!
#> Warning in [<-.factor(*tmp*, c(1, 4, 7), value = "The Dude"): invalid
#> factor level, NA generated

How can you change values of factors?

• Two main ways:
1. Modify the levels. In this example we will change “Devon” to “The Dude”

# Look at bogus$students bogus$students
#> [1] Devon  Martha Hilary Devon  Martha Hilary Devon  Martha Hilary
#> Levels: Devon Hilary Martha

# Confirm that Devon is the first element of the levels
levels(bogus$students) #> [1] "Devon" "Hilary" "Martha" # Change that to "The Dude" using assignment-form indexing levels(bogus$students)[1] <- "The Dude"

# Now look at the factor
bogus$students #> [1] The Dude Martha Hilary The Dude Martha Hilary The Dude Martha #> [9] Hilary #> Levels: The Dude Hilary Martha 2. Coerce the factor to a character vector, modify, and re-factor() it # let's change "Martha" to "Martha A" # what happens when we coerce to character? as.character(bogus$students)
#> [1] "The Dude" "Martha"   "Hilary"   "The Dude" "Martha"   "Hilary"
#> [7] "The Dude" "Martha"   "Hilary"

# OK, so make a variable of that and then modify it
tmp <- as.character(bogus$students) tmp[tmp == "Martha"] <- "Martha A" # change every occurrence of "Martha" to "Martha A" # When we turn tmp back into a factor, what does it look like? factor(tmp) #> [1] The Dude Martha A Hilary The Dude Martha A Hilary The Dude Martha A #> [9] Hilary #> Levels: Hilary Martha A The Dude # OK, cool, we can assign that to bogus$students
bogus$students <- factor(tmp) # Look at the result: bogus #> students tests scores #> 1 The Dude Sep 66 #> 2 Martha A Sep 71 #> 3 Hilary Sep 79 #> 4 The Dude Oct 94 #> 5 Martha A Oct 63 #> 6 Hilary Oct 93 #> 7 The Dude Nov 95 #> 8 Martha A Nov 83 #> 9 Hilary Nov 82 Catenating two factors • What if we have this scenario: # imagine you have two factors boys_f <- factor(c("Joe", "Ted", "Fred", "Joe")) girls_f <- factor(c("Anne", "Louise", "Louise", "Lucy", "Louise")) and, further, imagine you want to bung them together into a factor of kids_f. • This fails spectacularly: kids_f <- c(boys_f, girls_f) kids_f #> [1] 2 3 1 2 1 2 2 3 2 Yikes! It has just catenated the underlying integer vectors! • To get what you want: 1. coerce each to character 2. catenate 3. re-factor it i.e.: kids_f <- factor(c(as.character(boys_f), as.character(girls_f))) kids_f #> [1] Joe Ted Fred Joe Anne Louise Louise Lucy Louise #> Levels: Anne Fred Joe Louise Lucy Ted # check out the levels: levels(kids_f) #> [1] "Anne" "Fred" "Joe" "Louise" "Lucy" "Ted" What about adding rows to data frames? • Fortunately, if you want to add rows to a data frame, you can do that with rbind() and it will update the factor columns appropriately: extra <- rbind(bogus, data.frame(students = c("Hilary", "Eve"), tests = c("Jan", "Sep"), scores = c(88, 97) ) ) # what was the result? extra #> students tests scores #> 1 The Dude Sep 66 #> 2 Martha A Sep 71 #> 3 Hilary Sep 79 #> 4 The Dude Oct 94 #> 5 Martha A Oct 63 #> 6 Hilary Oct 93 #> 7 The Dude Nov 95 #> 8 Martha A Nov 83 #> 9 Hilary Nov 82 #> 10 Hilary Jan 88 #> 11 Eve Sep 97 # what do the levels look like: levels(extra$students)
#> [1] "Hilary"   "Martha A" "The Dude" "Eve"

levels(extra$tests) #> [1] "Nov" "Oct" "Sep" "Jan" Factor levels stick around • Even if you delete all occurrences of a level in a factor vector, the levels do not automatically change: no.dude <- bogus[ bogus$students != "The Dude", ]  # drop Devon (The Dude) and his dudeliness
no.dude  # print it out...no "The Dude"
#>   students tests scores
#> 2 Martha A   Sep     71
#> 3   Hilary   Sep     79
#> 5 Martha A   Oct     63
#> 6   Hilary   Oct     93
#> 8 Martha A   Nov     83
#> 9   Hilary   Nov     82

no.dude$students # print that column of students #> [1] Martha A Hilary Martha A Hilary Martha A Hilary #> Levels: Hilary Martha A The Dude # whoa-ho! The Dude is still a level...The Dude abides! # check again levels(no.dude$students)
#> [1] "Hilary"   "Martha A" "The Dude"
• If you have subsetted a data frame and you want to get rid of the extra levels of all the factors, you can do like this with droplevels():

no.dude2 <- droplevels(no.dude)

no.dude2  # print it
#>   students tests scores
#> 2 Martha A   Sep     71
#> 3   Hilary   Sep     79
#> 5 Martha A   Oct     63
#> 6   Hilary   Oct     93
#> 8 Martha A   Nov     83
#> 9   Hilary   Nov     82

# check the levels
levels(no.dude2$students) # no The Dude! #> [1] "Hilary" "Martha A" • In many contexts you will want the factor levels to stick around. In others you don’t. Numeric/Character/Factor Disasters The most common disaster that can happen with factors occurs when you think you can get back to a numeric vector by coercing a factor to as.numeric: # here are some integers my.nums <- c(1,4,8,10,1,8,8,8,10) # make them a factor numf <- factor(my.nums) # try to recover the original integers as.numeric(numf) # disaster #> [1] 1 2 3 4 1 3 3 3 4 # 2 "correct" ways of doing it as.numeric(as.character(numf)) # coerce to character first, then to numeric #> [1] 1 4 8 10 1 8 8 8 10 as.numeric(levels(numf)[numf]) # slurp out the levels by numf and coerce #> [1] 1 4 8 10 1 8 8 8 10 Why factors are super useful! • I am going to go through just one example that involves counting up occurrences of different categories. • When counting categories you usually will want to: 1. Record a zero for known categories that had no observations 2. List the categories in a particular order • Both of these desires can be accommodated by judicious use of factors! 1. Because levels “stick around” categories will be counted (as 0) even if there are no observations of them 2. The levels of a factor can be put in any order desired, and that order will be used in reporting from many different functions. The table() function • table(x) gives the number of occurrence of each unique category in x. set.seed(2) x <- sample(letters, size = 100, replace = TRUE) x # print it #> [1] "e" "s" "o" "e" "y" "y" "d" "v" "m" "o" "o" "g" "t" "e" "k" "w" "z" #> [18] "f" "l" "b" "r" "k" "v" "d" "j" "m" "d" "j" "z" "d" "a" "e" "v" "w" #> [35] "n" "q" "v" "h" "r" "d" "z" "h" "c" "e" "y" "u" "z" "j" "n" "v" "a" #> [52] "a" "r" "y" "h" "v" "u" "z" "p" "s" "u" "x" "q" "g" "w" "l" "k" "l" #> [69] "f" "b" "h" "i" "b" "e" "e" "t" "h" "w" "k" "o" "j" "r" "a" "k" "f" #> [86] "w" "z" "i" "t" "i" "z" "k" "j" "o" "m" "f" "l" "c" "c" "l" # count the number of each occurence table(x) #> x #> a b c d e f g h i j k l m n o p q r s t u v w x y z #> 4 3 3 5 7 4 2 5 3 5 6 5 3 2 5 1 2 4 2 3 3 6 5 1 4 7 • It also can count the number of occurrences of pairs of categories: set.seed(20) x <- sample(letters[1:3], size = 10, replace = TRUE) y <- sample(LETTERS[1:3], size = 10, replace = TRUE) cbind(x,y) # think of lining up x and y together #> x y #> [1,] "c" "C" #> [2,] "c" "C" #> [3,] "a" "A" #> [4,] "b" "C" #> [5,] "c" "A" #> [6,] "c" "B" #> [7,] "a" "A" #> [8,] "a" "A" #> [9,] "a" "A" #> [10,] "b" "C" # how often do you see the combination a,A or a,B or c,B etc. table(list(x, y)) #> .2 #> .1 A B C #> a 4 0 0 #> b 0 0 2 #> c 1 1 2 Some sample data from birds • These were genotyped, and locations were lumped into regions • Then we asked how well we could use the genetic data to assign individual birds from each location to the correct region • Here is what the output looks like (a data frame) wiwa <- read.csv("data/bird-self-assignments.csv", row.names=1) head(wiwa) #> PopulationOfOrigin NumberOfLoci AK.EastBC.AB Wa.To.NorCalCoast #> wAKDE01 wAKDE 95 0.99999 0.00002 #> wAKDE02 wAKDE 94 0.99622 0.00327 #> wAKDE03 wAKDE 95 1.00000 0.00000 #> wAKDE04 wAKDE 95 0.99999 0.00000 #> wAKDE05 wAKDE 95 0.99984 0.00015 #> wAKDE06 wAKDE 95 0.99999 0.00000 #> CentCalCoast CalSierra Basin.Rockies Eastern MaxColumn #> wAKDE01 0 0 0e+00 0 AK.EastBC.AB #> wAKDE02 0 0 5e-04 0 AK.EastBC.AB #> wAKDE03 0 0 0e+00 0 AK.EastBC.AB #> wAKDE04 0 0 0e+00 0 AK.EastBC.AB #> wAKDE05 0 0 0e+00 0 AK.EastBC.AB #> wAKDE06 0 0 0e+00 0 AK.EastBC.AB #> MaxPosterior #> wAKDE01 0.99999 #> wAKDE02 0.99622 #> wAKDE03 1.00000 #> wAKDE04 0.99999 #> wAKDE05 0.99984 #> wAKDE06 0.99999 # here are the different locations levels(wiwa$PopulationOfOrigin)
#>  [1] "eNBFR" "eONHI" "eQCCM" "wABCA" "wAKDE" "wAKJU" "wAKUG" "wAKYA"
#>  [9] "wBCMH" "wCABS" "wCACL" "wCAEU" "wCAHM" "wCAHU" "wCASL" "wCATE"
#> [17] "wCOGM" "wCOPP" "wMTHM" "wOREL" "wORHA" "wORMB" "wWADA"

# here are the different regions
levels(wiwa$MaxColumn) #> [1] "AK.EastBC.AB" "Basin.Rockies" "CalSierra" #> [4] "CentCalCoast" "Eastern" "Wa.To.NorCalCoast" Counting up self-assignments • We can count how many birds from each location were assigned to which regions using table() table(list(wiwa$PopulationOfOrigin, wiwa$MaxColumn)) #> .2 #> .1 AK.EastBC.AB Basin.Rockies CalSierra CentCalCoast Eastern #> eNBFR 0 0 0 0 4 #> eONHI 0 0 0 0 4 #> eQCCM 0 0 0 0 17 #> wABCA 24 1 0 0 0 #> wAKDE 29 0 0 0 0 #> wAKJU 10 0 0 0 0 #> wAKUG 26 0 0 0 0 #> wAKYA 21 0 0 0 0 #> wBCMH 9 4 0 0 0 #> wCABS 0 0 1 11 0 #> wCACL 0 0 12 2 0 #> wCAEU 0 0 3 0 0 #> wCAHM 0 0 1 14 0 #> wCAHU 0 0 15 0 0 #> wCASL 0 0 3 19 0 #> wCATE 0 0 21 2 0 #> wCOGM 0 11 0 0 0 #> wCOPP 0 19 0 0 0 #> wMTHM 1 2 0 0 0 #> wOREL 6 19 0 0 0 #> wORHA 0 0 0 3 0 #> wORMB 0 0 1 1 0 #> wWADA 2 0 0 1 0 #> .2 #> .1 Wa.To.NorCalCoast #> eNBFR 0 #> eONHI 0 #> eQCCM 0 #> wABCA 0 #> wAKDE 0 #> wAKJU 0 #> wAKUG 0 #> wAKYA 0 #> wBCMH 0 #> wCABS 3 #> wCACL 1 #> wCAEU 15 #> wCAHM 2 #> wCAHU 1 #> wCASL 1 #> wCATE 2 #> wCOGM 0 #> wCOPP 0 #> wMTHM 0 #> wOREL 0 #> wORHA 20 #> wORMB 20 #> wWADA 9 • That is all right, but the locations and regions are not ordered very sensibly. • They are ordered alphabetically, • It would be better to order them geographically • We can do this by resetting the levels in the order we want: 1. First, get vectors that have all the categories you want in the order you want them in # a vector of regions in a geographically sensible order regions_ordered <- c("AK.EastBC.AB", "Wa.To.NorCalCoast", "CentCalCoast", "CalSierra", "Basin.Rockies", "Eastern") # get a vector of locations in a good order locations_ordered <- c("wAKDE", "wAKYA", "wAKUG", "wAKJU", "wABCA", "wBCMH", "wWADA", "wORHA", "wORMB", "wCAEU", "wCAHM", "wCABS", "wCASL", "wCATE", "wCACL", "wCAHU", "wMTHM", "wOREL", "wCOPP", "wCOGM", "eQCCM", "eONHI", "eNBFR" ) 2. Then, this is the magical step: reset the levels to be the ordered vectors of categories you want. You do this by passing in the ordered vector to the levels argument of the factor() function: # order the levels of the regions nicely wiwa$MaxColumn <- factor(wiwa$MaxColumn, levels = regions_ordered) # order the levels of the locations nicely wiwa$PopulationOfOrigin <- factor(wiwa$PopulationOfOrigin, levels = locations_ordered) • WARNING DO NOT DO THIS! levels(wiwa$MaxColumn) <- regions_ordered
levels(wiwa$PopulationOfOrigin) <- locations_ordered You have to reconstitute is as a factor after changing the levels. Otherwise you can get totally wrong values. 3. Then use table again, and note the ordering of the categories in the output: table(list(wiwa$PopulationOfOrigin, wiwa\$MaxColumn))
#>        .2
#> .1      AK.EastBC.AB Wa.To.NorCalCoast CentCalCoast CalSierra
#>   wAKDE           29                 0            0         0
#>   wAKYA           21                 0            0         0
#>   wAKUG           26                 0            0         0
#>   wAKJU           10                 0            0         0
#>   wABCA           24                 0            0         0
#>   wBCMH            9                 0            0         0
#>   wWADA            2                 9            1         0
#>   wORHA            0                20            3         0
#>   wORMB            0                20            1         1
#>   wCAEU            0                15            0         3
#>   wCAHM            0                 2           14         1
#>   wCABS            0                 3           11         1
#>   wCASL            0                 1           19         3
#>   wCATE            0                 2            2        21
#>   wCACL            0                 1            2        12
#>   wCAHU            0                 1            0        15
#>   wMTHM            1                 0            0         0
#>   wOREL            6                 0            0         0
#>   wCOPP            0                 0            0         0
#>   wCOGM            0                 0            0         0
#>   eQCCM            0                 0            0         0
#>   eONHI            0                 0            0         0
#>   eNBFR            0                 0            0         0
#>        .2
#> .1      Basin.Rockies Eastern
#>   wAKDE             0       0
#>   wAKYA             0       0
#>   wAKUG             0       0
#>   wAKJU             0       0
#>   wABCA             1       0
#>   wBCMH             4       0
#>   wWADA             0       0
#>   wORHA             0       0
#>   wORMB             0       0
#>   wCAEU             0       0
#>   wCAHM             0       0
#>   wCABS             0       0
#>   wCASL             0       0
#>   wCATE             0       0
#>   wCACL             0       0
#>   wCAHU             0       0
#>   wMTHM             2       0
#>   wOREL            19       0
#>   wCOPP            19       0
#>   wCOGM            11       0
#>   eQCCM             0      17
#>   eONHI             0       4
#>   eNBFR             0       4
• Many, many functions use the order of the levels of a factor to determine what order to output things in (like drawning legends on plots, etc.). So knowing how to set the order of the levels with factor(my.factor, levels = my.ord)` is very useful.

comments powered by Disqus