Factors

Goals of this lecture:
1. Go into moderate detail on factors (A tricky little data structure that probably causes more problems than anything else in R.)
  1. What they are / what they look like.
  2. Why we talk about them with data frames
  3. How they behave.
  4. Ways that they are useful.
2. In the process we will look at the table function and have some examples from the world of genetic assignment of birds.

Factor basics

Let’s reiterate some points/examples from the previous session.

Factors are vectors that record discrete categories

Anything measured on a disrete scale can be said to fall into one of a set of categories.
The discrete scale could be a summary of a continuous scale
- For example, the categories of Small, Medium, and Large are (likely) summaries of a continuous variable like weight or height.

If you have measured fish and put them into Small, Medium, and Large, categories you might have them in a data frame like this:

set.seed(17)
sml <- data.frame(ID = paste("Fish", 1:15, sep="_"),
                  SizeCategory = sample(c("Small", "Medium", "Large"), size = 15, replace = T)
                  )

# when you print it out it looks pretty normal
sml                 
#>         ID SizeCategory
#> 1   Fish_1        Small
#> 2   Fish_2        Large
#> 3   Fish_3       Medium
#> 4   Fish_4        Large
#> 5   Fish_5       Medium
#> 6   Fish_6       Medium
#> 7   Fish_7        Small
#> 8   Fish_8        Small
#> 9   Fish_9        Large
#> 10 Fish_10        Small
#> 11 Fish_11       Medium
#> 12 Fish_12        Small
#> 13 Fish_13        Large
#> 14 Fish_14        Large
#> 15 Fish_15        Large

Underlying structure of a factor

The “SizeCategory” column looks like a vector of strings (a character vector), but it isn’t.
A factor is a class that contains:
1. A levels attribute that maps N categories to the integers 1, …, N
  - (This sounds more complex than it is. It is just a character vector that gives an ordered collection of category names)
2. An integer vector of values between 1 and N used to describe the occurrence of the categories.
What? If that’s not clear, continuing with the sml example from above should help clarify things

sml data frame’s SizeCategory

We can access the levels attribute of sml$SizeCategory like this:
```
levels(sml$SizeCategory)
#> [1] "Large"  "Medium" "Small"
```
The order these are in the levels tells us that:
- 1 = “Large”
- 2 = “Medium”
- 3 = “Small”

And the integer vector part of sml$SizeCategory can be visualized by attaching it on the right side of the sml data frame like this:

cbind(sml, underlying_integer_vector = unclass(sml$SizeCategory))
#>         ID SizeCategory underlying_integer_vector
#> 1   Fish_1        Small                         3
#> 2   Fish_2        Large                         1
#> 3   Fish_3       Medium                         2
#> 4   Fish_4        Large                         1
#> 5   Fish_5       Medium                         2
#> 6   Fish_6       Medium                         2
#> 7   Fish_7        Small                         3
#> 8   Fish_8        Small                         3
#> 9   Fish_9        Large                         1
#> 10 Fish_10        Small                         3
#> 11 Fish_11       Medium                         2
#> 12 Fish_12        Small                         3
#> 13 Fish_13        Large                         1
#> 14 Fish_14        Large                         1
#> 15 Fish_15        Large                         1

(Note that, by default, if categories are named by characters, R sorts them alphabetically to give them an order in the levels of the factor.)

How R prints factors

R prints factors by showing the values as the strings that they are.
And, at the bottom it prints out the levels
- Or if there are lots of levels (i.e. categories) then it prints a few of them

It looks like this:

sml$SizeCategory
#>  [1] Small  Large  Medium Large  Medium Medium Small  Small  Large  Small 
#> [11] Medium Small  Large  Large  Large 
#> Levels: Large Medium Small

So, when you print something and it says Levels: on the last line, you know you are dealing with a factor.

A different example

Another example Data Frame

We can make some bogus data

set.seed(1)
bogus <- data.frame(
  students = rep(c("Devon", "Martha", "Hilary"), 3),
  tests = rep(c("Sep","Oct", "Nov"), each = 3),
  scores = as.integer(runif(9, min = 55, max = 98))
  )

bogus # look at it
#>   students tests scores
#> 1    Devon   Sep     66
#> 2   Martha   Sep     71
#> 3   Hilary   Sep     79
#> 4    Devon   Oct     94
#> 5   Martha   Oct     63
#> 6   Hilary   Oct     93
#> 7    Devon   Nov     95
#> 8   Martha   Nov     83
#> 9   Hilary   Nov     82

str(bogus) # see what the types are. Hey there are factors!
#> 'data.frame':    9 obs. of  3 variables:
#>  $ students: Factor w/ 3 levels "Devon","Hilary",..: 1 3 2 1 3 2 1 3 2
#>  $ tests   : Factor w/ 3 levels "Nov","Oct","Sep": 3 3 3 2 2 2 1 1 1
#>  $ scores  : int  66 71 79 94 63 93 95 83 82

Important Note

The default behavior of R is to convert character vectors to factors when putting them into a data frame.

The column you get in bogus$students is the same as is returned by

factor(rep(c("Devon", "Martha", "Hilary"), 3))
#> [1] Devon  Martha Hilary Devon  Martha Hilary Devon  Martha Hilary
#> Levels: Devon Hilary Martha

So, the function factor() takes a vector and makes a factor vector out of it

What a factor consists of in R

Somewhat more tersely and technically than before:
- A factor is a vector with class attribute of factor and with another attribute called levels
- For a factor f:
```
levels(f)   # returns the levels of f
levels(f) <-  # can be used to set/modify the levels attribute of f
```
levels(f) is a character vector, that will be sorted by default.
The values of the factor variable itself are integers.
- The i-th element of the factor vector tells us which level (or category) the i-th observation falls into.

What a Factor Looks like Under the Hood

One can use the unclass function to see the actual parts of an R object without having them printed in a way that is specific to the objectsclass` attribute.

bogus$students  # printed as a factor
#> [1] Devon  Martha Hilary Devon  Martha Hilary Devon  Martha Hilary
#> Levels: Devon Hilary Martha

unclass(bogus$students)  # printed generically
#> [1] 1 3 2 1 3 2 1 3 2
#> attr(,"levels")
#> [1] "Devon"  "Hilary" "Martha"


bogus$tests   # printed as a factor
#> [1] Sep Sep Sep Oct Oct Oct Nov Nov Nov
#> Levels: Nov Oct Sep

unclass(bogus$tests)  # printed generically
#> [1] 3 3 3 2 2 2 1 1 1
#> attr(,"levels")
#> [1] "Nov" "Oct" "Sep"

Issues and such with factors

You can make R not create factors of character data in data frames

The data.frame function, as well as the read.table family of functions accept a stringsAsFactors parameter.
This can be a reasonable thing to do, since you can always explicitly make certain columns factors if you want to, using the factor function later.

Why does R use factors?

The idea of factors is central to the fitting of various statistical models.
However R seems to go overboard by wanting to squash any character vector into a factor in a data frame.
- Some of this relates to the fact that prior to a fairly late version of R, coding character vectors as factors was more space efficient.
There are numerous hassles and headaches involved in dealing with factors, but factors are here to stay in R, so we had better get comfortable with them.
There are also many good things about factors (see later).

Factors, once made, restrict allowable levels

Example:

studentsf <- bogus$students # this is a factor variable

studentsf # print it and see its values and levels
#> [1] Devon  Martha Hilary Devon  Martha Hilary Devon  Martha Hilary
#> Levels: Devon Hilary Martha

studentsf[c(1,4,7)] # return all the Devon values.
#> [1] Devon Devon Devon
#> Levels: Devon Hilary Martha
                    # note that the levels are still all three names

# Now, what if we want to change the name "Devon" to "The Dude"?
studentsf[c(1,4,7)] <- "The Dude"  # R gets upset when you do this!
#> Warning in `[<-.factor`(`*tmp*`, c(1, 4, 7), value = "The Dude"): invalid
#> factor level, NA generated

How can you change values of factors?

Two main ways:

Modify the levels. In this example we will change “Devon” to “The Dude”

# Look at bogus$students
bogus$students
#> [1] Devon  Martha Hilary Devon  Martha Hilary Devon  Martha Hilary
#> Levels: Devon Hilary Martha

# Confirm that Devon is the first element of the levels
levels(bogus$students)
#> [1] "Devon"  "Hilary" "Martha"

# Change that to "The Dude" using assignment-form indexing
levels(bogus$students)[1] <- "The Dude"

# Now look at the factor
bogus$students
#> [1] The Dude Martha   Hilary   The Dude Martha   Hilary   The Dude Martha  
#> [9] Hilary  
#> Levels: The Dude Hilary Martha

Coerce the factor to a character vector, modify, and re-factor() it

# let's change "Martha" to "Martha A"
# what happens when we coerce to character?
as.character(bogus$students)
#> [1] "The Dude" "Martha"   "Hilary"   "The Dude" "Martha"   "Hilary"  
#> [7] "The Dude" "Martha"   "Hilary"

# OK, so make a variable of that and then modify it
tmp <- as.character(bogus$students)
tmp[tmp == "Martha"] <- "Martha A"  # change every occurrence of "Martha" to "Martha A"

# When we turn tmp back into a factor, what does it look like?
factor(tmp)
#> [1] The Dude Martha A Hilary   The Dude Martha A Hilary   The Dude Martha A
#> [9] Hilary  
#> Levels: Hilary Martha A The Dude

# OK, cool, we can assign that to bogus$students
bogus$students <- factor(tmp)

# Look at the result:
bogus
#>   students tests scores
#> 1 The Dude   Sep     66
#> 2 Martha A   Sep     71
#> 3   Hilary   Sep     79
#> 4 The Dude   Oct     94
#> 5 Martha A   Oct     63
#> 6   Hilary   Oct     93
#> 7 The Dude   Nov     95
#> 8 Martha A   Nov     83
#> 9   Hilary   Nov     82

Catenating two factors

What if we have this scenario:

# imagine you have two factors
boys_f <- factor(c("Joe", "Ted", "Fred", "Joe"))
girls_f <- factor(c("Anne", "Louise", "Louise", "Lucy", "Louise"))

and, further, imagine you want to bung them together into a factor of kids_f.

This fails spectacularly:
```
kids_f <- c(boys_f, girls_f)
kids_f
#> [1] 2 3 1 2 1 2 2 3 2
```
Yikes! It has just catenated the underlying integer vectors!

To get what you want:

coerce each to character
catenate
re-factor it i.e.:

kids_f <- factor(c(as.character(boys_f), as.character(girls_f)))
kids_f
#> [1] Joe    Ted    Fred   Joe    Anne   Louise Louise Lucy   Louise
#> Levels: Anne Fred Joe Louise Lucy Ted

# check out the levels:
levels(kids_f)
#> [1] "Anne"   "Fred"   "Joe"    "Louise" "Lucy"   "Ted"

What about adding rows to data frames?

Fortunately, if you want to add rows to a data frame, you can do that with rbind() and it will update the factor columns appropriately:

extra <- rbind(bogus,
               data.frame(students = c("Hilary", "Eve"), 
                          tests = c("Jan", "Sep"),
                          scores = c(88, 97)
                          )
               )

# what was the result?
extra
#>    students tests scores
#> 1  The Dude   Sep     66
#> 2  Martha A   Sep     71
#> 3    Hilary   Sep     79
#> 4  The Dude   Oct     94
#> 5  Martha A   Oct     63
#> 6    Hilary   Oct     93
#> 7  The Dude   Nov     95
#> 8  Martha A   Nov     83
#> 9    Hilary   Nov     82
#> 10   Hilary   Jan     88
#> 11      Eve   Sep     97

# what do the levels look like:
levels(extra$students)
#> [1] "Hilary"   "Martha A" "The Dude" "Eve"

levels(extra$tests)
#> [1] "Nov" "Oct" "Sep" "Jan"

Factor levels stick around

Even if you delete all occurrences of a level in a factor vector, the levels do not automatically change:

no.dude <- bogus[ bogus$students != "The Dude", ]  # drop Devon (The Dude) and his dudeliness
no.dude  # print it out...no "The Dude"
#>   students tests scores
#> 2 Martha A   Sep     71
#> 3   Hilary   Sep     79
#> 5 Martha A   Oct     63
#> 6   Hilary   Oct     93
#> 8 Martha A   Nov     83
#> 9   Hilary   Nov     82

no.dude$students   # print that column of students
#> [1] Martha A Hilary   Martha A Hilary   Martha A Hilary  
#> Levels: Hilary Martha A The Dude

# whoa-ho!  The Dude is still a level...The Dude abides!
# check again
levels(no.dude$students)
#> [1] "Hilary"   "Martha A" "The Dude"

If you have subsetted a data frame and you want to get rid of the extra levels of all the factors, you can do like this with droplevels():

no.dude2 <- droplevels(no.dude)

no.dude2  # print it
#>   students tests scores
#> 2 Martha A   Sep     71
#> 3   Hilary   Sep     79
#> 5 Martha A   Oct     63
#> 6   Hilary   Oct     93
#> 8 Martha A   Nov     83
#> 9   Hilary   Nov     82

# check the levels
levels(no.dude2$students)  # no The Dude!
#> [1] "Hilary"   "Martha A"

In many contexts you will want the factor levels to stick around. In others you don’t.

Numeric/Character/Factor Disasters

The most common disaster that can happen with factors occurs when you think you can get back to a numeric vector by coercing a factor to as.numeric:

# here are  some integers
my.nums <- c(1,4,8,10,1,8,8,8,10)

# make them a factor
numf <- factor(my.nums)

# try to recover the original integers
as.numeric(numf)  # disaster
#> [1] 1 2 3 4 1 3 3 3 4

# 2 "correct" ways of doing it
as.numeric(as.character(numf))  # coerce to character first, then to numeric
#> [1]  1  4  8 10  1  8  8  8 10


as.numeric(levels(numf)[numf])  # slurp out the levels by numf and coerce
#> [1]  1  4  8 10  1  8  8  8 10

Why factors are super useful!

I am going to go through just one example that involves counting up occurrences of different categories.
When counting categories you usually will want to:
1. Record a zero for known categories that had no observations
2. List the categories in a particular order
Both of these desires can be accommodated by judicious use of factors!
1. Because levels “stick around” categories will be counted (as 0) even if there are no observations of them
2. The levels of a factor can be put in any order desired, and that order will be used in reporting from many different functions.

The table() function

table(x) gives the number of occurrence of each unique category in x.

set.seed(2)
x <- sample(letters, size = 100, replace = TRUE)
x  # print it
#>   [1] "e" "s" "o" "e" "y" "y" "d" "v" "m" "o" "o" "g" "t" "e" "k" "w" "z"
#>  [18] "f" "l" "b" "r" "k" "v" "d" "j" "m" "d" "j" "z" "d" "a" "e" "v" "w"
#>  [35] "n" "q" "v" "h" "r" "d" "z" "h" "c" "e" "y" "u" "z" "j" "n" "v" "a"
#>  [52] "a" "r" "y" "h" "v" "u" "z" "p" "s" "u" "x" "q" "g" "w" "l" "k" "l"
#>  [69] "f" "b" "h" "i" "b" "e" "e" "t" "h" "w" "k" "o" "j" "r" "a" "k" "f"
#>  [86] "w" "z" "i" "t" "i" "z" "k" "j" "o" "m" "f" "l" "c" "c" "l"

# count the number of each occurence
table(x)
#> x
#> a b c d e f g h i j k l m n o p q r s t u v w x y z 
#> 4 3 3 5 7 4 2 5 3 5 6 5 3 2 5 1 2 4 2 3 3 6 5 1 4 7

It also can count the number of occurrences of pairs of categories:

set.seed(20)
x <- sample(letters[1:3], size = 10, replace = TRUE)
y <- sample(LETTERS[1:3], size = 10, replace = TRUE)


cbind(x,y)  # think of lining up x and y together
#>       x   y  
#>  [1,] "c" "C"
#>  [2,] "c" "C"
#>  [3,] "a" "A"
#>  [4,] "b" "C"
#>  [5,] "c" "A"
#>  [6,] "c" "B"
#>  [7,] "a" "A"
#>  [8,] "a" "A"
#>  [9,] "a" "A"
#> [10,] "b" "C"

# how often do you see the combination a,A or a,B or c,B  etc.
table(list(x, y)) 
#>    .2
#> .1  A B C
#>   a 4 0 0
#>   b 0 0 2
#>   c 1 1 2

Some sample data from birds

Example from Mapping migration in a songbird …
393 birds from various locations in the breeding range of Wilson’s warbler

wilson’s warbler

These were genotyped, and locations were lumped into regions
Then we asked how well we could use the genetic data to assign individual birds from each location to the correct region

Here is what the output looks like (a data frame)

wiwa <- read.csv("data/bird-self-assignments.csv", row.names=1)

head(wiwa)
#>         PopulationOfOrigin NumberOfLoci AK.EastBC.AB Wa.To.NorCalCoast
#> wAKDE01              wAKDE           95      0.99999           0.00002
#> wAKDE02              wAKDE           94      0.99622           0.00327
#> wAKDE03              wAKDE           95      1.00000           0.00000
#> wAKDE04              wAKDE           95      0.99999           0.00000
#> wAKDE05              wAKDE           95      0.99984           0.00015
#> wAKDE06              wAKDE           95      0.99999           0.00000
#>         CentCalCoast CalSierra Basin.Rockies Eastern    MaxColumn
#> wAKDE01            0         0         0e+00       0 AK.EastBC.AB
#> wAKDE02            0         0         5e-04       0 AK.EastBC.AB
#> wAKDE03            0         0         0e+00       0 AK.EastBC.AB
#> wAKDE04            0         0         0e+00       0 AK.EastBC.AB
#> wAKDE05            0         0         0e+00       0 AK.EastBC.AB
#> wAKDE06            0         0         0e+00       0 AK.EastBC.AB
#>         MaxPosterior
#> wAKDE01      0.99999
#> wAKDE02      0.99622
#> wAKDE03      1.00000
#> wAKDE04      0.99999
#> wAKDE05      0.99984
#> wAKDE06      0.99999

# here are the different locations
levels(wiwa$PopulationOfOrigin)
#>  [1] "eNBFR" "eONHI" "eQCCM" "wABCA" "wAKDE" "wAKJU" "wAKUG" "wAKYA"
#>  [9] "wBCMH" "wCABS" "wCACL" "wCAEU" "wCAHM" "wCAHU" "wCASL" "wCATE"
#> [17] "wCOGM" "wCOPP" "wMTHM" "wOREL" "wORHA" "wORMB" "wWADA"

# here are the different regions
levels(wiwa$MaxColumn)
#> [1] "AK.EastBC.AB"      "Basin.Rockies"     "CalSierra"        
#> [4] "CentCalCoast"      "Eastern"           "Wa.To.NorCalCoast"

Counting up self-assignments

We can count how many birds from each location were assigned to which regions using table()

table(list(wiwa$PopulationOfOrigin, wiwa$MaxColumn))
#>        .2
#> .1      AK.EastBC.AB Basin.Rockies CalSierra CentCalCoast Eastern
#>   eNBFR            0             0         0            0       4
#>   eONHI            0             0         0            0       4
#>   eQCCM            0             0         0            0      17
#>   wABCA           24             1         0            0       0
#>   wAKDE           29             0         0            0       0
#>   wAKJU           10             0         0            0       0
#>   wAKUG           26             0         0            0       0
#>   wAKYA           21             0         0            0       0
#>   wBCMH            9             4         0            0       0
#>   wCABS            0             0         1           11       0
#>   wCACL            0             0        12            2       0
#>   wCAEU            0             0         3            0       0
#>   wCAHM            0             0         1           14       0
#>   wCAHU            0             0        15            0       0
#>   wCASL            0             0         3           19       0
#>   wCATE            0             0        21            2       0
#>   wCOGM            0            11         0            0       0
#>   wCOPP            0            19         0            0       0
#>   wMTHM            1             2         0            0       0
#>   wOREL            6            19         0            0       0
#>   wORHA            0             0         0            3       0
#>   wORMB            0             0         1            1       0
#>   wWADA            2             0         0            1       0
#>        .2
#> .1      Wa.To.NorCalCoast
#>   eNBFR                 0
#>   eONHI                 0
#>   eQCCM                 0
#>   wABCA                 0
#>   wAKDE                 0
#>   wAKJU                 0
#>   wAKUG                 0
#>   wAKYA                 0
#>   wBCMH                 0
#>   wCABS                 3
#>   wCACL                 1
#>   wCAEU                15
#>   wCAHM                 2
#>   wCAHU                 1
#>   wCASL                 1
#>   wCATE                 2
#>   wCOGM                 0
#>   wCOPP                 0
#>   wMTHM                 0
#>   wOREL                 0
#>   wORHA                20
#>   wORMB                20
#>   wWADA                 9

That is all right, but the locations and regions are not ordered very sensibly.
- They are ordered alphabetically,
- It would be better to order them geographically

We can do this by resetting the levels in the order we want:

First, get vectors that have all the categories you want in the order you want them in

# a vector of regions in a geographically sensible order
regions_ordered <- c("AK.EastBC.AB", "Wa.To.NorCalCoast", "CentCalCoast", "CalSierra", "Basin.Rockies", "Eastern")

# get a vector of locations in a good order
locations_ordered <- c("wAKDE", "wAKYA", "wAKUG", "wAKJU", "wABCA", "wBCMH", "wWADA", 
"wORHA", "wORMB", "wCAEU", "wCAHM", "wCABS", "wCASL", "wCATE", "wCACL", "wCAHU",
"wMTHM", "wOREL", "wCOPP", "wCOGM", "eQCCM", "eONHI", "eNBFR"
)

Then, this is the magical step: reset the levels to be the ordered vectors of categories you want. You do this by passing in the ordered vector to the levels argument of the factor() function:

# order the levels of the regions nicely
wiwa$MaxColumn <- factor(wiwa$MaxColumn, levels = regions_ordered)

# order the levels of the locations nicely
wiwa$PopulationOfOrigin <- factor(wiwa$PopulationOfOrigin, levels = locations_ordered)

WARNING DO NOT DO THIS!
```
levels(wiwa$MaxColumn) <- regions_ordered
levels(wiwa$PopulationOfOrigin) <- locations_ordered
```
You have to reconstitute is as a factor after changing the levels. Otherwise you can get totally wrong values.

Then use table again, and note the ordering of the categories in the output:

table(list(wiwa$PopulationOfOrigin, wiwa$MaxColumn))
#>        .2
#> .1      AK.EastBC.AB Wa.To.NorCalCoast CentCalCoast CalSierra
#>   wAKDE           29                 0            0         0
#>   wAKYA           21                 0            0         0
#>   wAKUG           26                 0            0         0
#>   wAKJU           10                 0            0         0
#>   wABCA           24                 0            0         0
#>   wBCMH            9                 0            0         0
#>   wWADA            2                 9            1         0
#>   wORHA            0                20            3         0
#>   wORMB            0                20            1         1
#>   wCAEU            0                15            0         3
#>   wCAHM            0                 2           14         1
#>   wCABS            0                 3           11         1
#>   wCASL            0                 1           19         3
#>   wCATE            0                 2            2        21
#>   wCACL            0                 1            2        12
#>   wCAHU            0                 1            0        15
#>   wMTHM            1                 0            0         0
#>   wOREL            6                 0            0         0
#>   wCOPP            0                 0            0         0
#>   wCOGM            0                 0            0         0
#>   eQCCM            0                 0            0         0
#>   eONHI            0                 0            0         0
#>   eNBFR            0                 0            0         0
#>        .2
#> .1      Basin.Rockies Eastern
#>   wAKDE             0       0
#>   wAKYA             0       0
#>   wAKUG             0       0
#>   wAKJU             0       0
#>   wABCA             1       0
#>   wBCMH             4       0
#>   wWADA             0       0
#>   wORHA             0       0
#>   wORMB             0       0
#>   wCAEU             0       0
#>   wCAHM             0       0
#>   wCABS             0       0
#>   wCASL             0       0
#>   wCATE             0       0
#>   wCACL             0       0
#>   wCAHU             0       0
#>   wMTHM             2       0
#>   wOREL            19       0
#>   wCOPP            19       0
#>   wCOGM            11       0
#>   eQCCM             0      17
#>   eONHI             0       4
#>   eNBFR             0       4

Many, many functions use the order of the levels of a factor to determine what order to output things in (like drawning legends on plots, etc.). So knowing how to set the order of the levels with factor(my.factor, levels = my.ord) is very useful.

Reproducible Research Course by Eric C. Anderson for (NOAA/SWFSC)

Page Contents