Reproducible Research Course by Eric C. Anderson for (NOAA/SWFSC)


Functions and lapply

Intro

R is known as a “functional” language in the sense that every operation it does can be be thought of a function that operates on arguments and returns a value. For the casual user of R, it is not clear whether thinking about this is helpful. BUT what is helpful to any user of R is the ability to understand how functions in R:

  1. Are called,
  2. Parse their arguments,
  3. Can be defined by the user (yes! you can make your own functions in R),
  4. Can be applied iteratively over elements of lists or vectors.

Once you get comfortable writing your own functions, you can save yourself a lot of time by:

  • recognizing what parts of your code essesntially do the same things
  • wrapping those into a function
  • calling that function with different arguments.

This can be particularly useful if you want to apply the same analysis to multiple different data sets. You can write a function that will perform the analysis on a single, “generic” data set like one that you have, and then you can apply that function to multiple data sets that you might have.

Specific goals

  1. Show how you define functions
  2. Discuss parameters and arguments, and R’s system for default values and parsing of argument lists.
  3. Take a brief sojourn into the world of overloaded functions and R’s S3 object system.
  4. Show how you can apply a function to every member of a list with lapply(), and give an actual example.

User defined functions

We could start off talking about functions, generally, but it will be more fun, and more accessible to just start writing our own functions. In the process we will learn a lot about function conventions.

the function function

  • There is a function in R called function() whose job is to return a function.
  • Let’s use it, and then talk about it.
  • Here we make a function called evensum that adds up the elements in positions 2, 4, 6, … of a vector:
    # define it 
    evensum <- function(x) sum(x[seq(2, length(x), by = 2)])

    # use it!
    evensum(x = 10:20)
#> [1] 75
  • Wow! that was easy! How did that work?

The anatomy of function()

  • The syntax of the function() function is relatively straightforward:
    1. It takes arguments which are the names of the variables that act as placeholders for the values that you will pass into the function. These are called parameters.
      • In the case of evensum() there is one parameter, x.
      • In the body of the function, which is the expression that comes after function(x) we say what we want to do to x (namely add up every other element of it, starting with the second element).
    2. When we call the function, that value that we “pass in for x” (which was the vector 10:20 in our example) gets substituted for every occurrence of x in the body of the function.
    3. When the function is exectuted it returns whatever value the expression that is its body returns.

Compound expressions

  • Most functions are going to be more complex than just a single statement like sum(x[seq(2, length(x), by = 2)]), so how do we execute muliple statements in the body of a function and return the value that we want?
  • Answer: Use a compound expression which is merely a bunch of expressions wrapped up in curly braces.
  • Here is an example of how we could have written evensum with multiple expressions and intermediate value assignments:
    # define it
    evensum <- function(x) {
      idx <- seq(2, length(x), by = 2)
      y <- x[idx]
      sum(y)
    }

    # use it!
    evensum(10:20)
#> [1] 75

    # hooray! we got the same thing as last time

The value of a compound expression

  • It is important to understand that if you have a compound expression like:

    {
     statement_1
     statement_2
     statement_3
    }
    then the value of this expression will be the value returned by the last expression inside the compound expression (in the above case, the value returned by statement_3)
  • Thus, when the body of a function is a compound expression, the value that the function returns will just be the value of the last expression in the body of the function.
  • You can also be explicit about it and wrap it in return():

    # define it using return
    evensum <- function(x) {
      idx <- seq(2, length(x), by = 2)
      y <- x[idx]
      return(sum(y))
    }
    
    # use it
    evensum(10:20)
#> [1] 75

Multiple parameters

  • Having multiple parameters that your function understands is straightforward. You just put them in the argument list of function()!
  • Imagine that we wanted to make a more general function of which evensum was a special case. Perhaps we want to be able to easily take any vector, start on the Start-th element and then sum it and every other element in steps of Step. Here goes:
    skip_sum <- function(x, Start, Step) sum(x[seq(Start, length(x), by = Step)])
  • Now, we can call that with values for x, Start and Step.
    # Try this:
    skip_sum(x = 10:20, Start = 4, Step = 3)
#> [1] 48

    # this will give us the same results as evensum:
    skip_sum(x = 10:20, Start = 2, Step = 2)
#> [1] 75
  • Notice that in the two invocations of skip_sum above, we are using named parameters. i.e., we are saying x = 10:20 and Start = 2, etc. We are being very explicit about which arguments belong to which parameters.

In class exercise

OK, everyone, you have 5 minutes to write your own function called addmult that takes two vectors, a and b and returns a named list in which the first component named “Add” is equal to a+b and the second component named “Mult” is equal to a*b.

Positional parameters and named parameters

  • R is rather interesting in that you don’t have to give it named parameters. It can figure out what you mean as long as the order of arguments you give it is in the order of the parameters in the function definition:
    # this:
    skip_sum(10:20, 4, 3)
#> [1] 48
  
    # will be the same as this:
    skip_sum(x = 10:20, Start = 4, Step = 3)
#> [1] 48
  • But, if the argument list is long, it is often easier to read (and less error-prone) to use named parameters.
  • Note that named arguments don’t have to be in any particular order, if they are named!
    # these will all be the same
    skip_sum(x = 10:20, Start = 4, Step = 3)
#> [1] 48
    skip_sum(Start = 4, Step = 3, x = 10:20)
#> [1] 48
    skip_sum(Start = 4, x = 10:20, Step = 3)
#> [1] 48

Mixing named and positional parameters

This is far out. You can mix named and positional parameters. R’s rule is this:

  1. First match all the named parameters to named arguments and then move them off the argument list and the parameter list.
  2. Then match the remaining arguments to the remaining parameters positionally.
    # for example these will all be the same
    skip_sum(x = 10:20, Start = 4, Step = 3)
#> [1] 48
    skip_sum(10:20, Start = 4, Step = 3)
#> [1] 48
    skip_sum(Start = 4, Step = 3, 10:20)
#> [1] 48
    skip_sum(Start=4, 10:20, 3)
#> [1] 48

Defining default parameter values

  • What if we realized that most the time we are using skip_sum we just want it set with Start = 2 and Step = 2 so that it behaves like evensum.
  • This is a job for defaults!
  • You can set default values for parameters by using an = sign in the function definition. If no arguments are passed in for those parameters, then the defaults are applied.
    # new definition with defaults
    skip_sum <- function(x, Start = 2, Step = 2) sum(x[seq(Start, length(x), by = Step)])

    # now see how it behaves:
    skip_sum(10:20)
#> [1] 75

    skip_sum(10:20, 2, 2)
#> [1] 75

    skip_sum(10:20, Step = 2)
#> [1] 75

    skip_sum(10:20, 4, 5)  # this overrides the defaults
#> [1] 31

The Curious “…”" parameter

  • Sometimes, it would be nice to be able to pass other named parameters to a function that is called from inside your own function. This is what the ... parameter is for.
  • You have probably seen ... in the help files for certain functions. It just means “any other named parameters that make sense”.
  • Of course, they only make sense if your function takes whatever else was passed in and does something appropriate with them.
  • Example: our skip_sum function won’t do so well with NAs:
    vec <- 10:20
    vec[c(3,5,8,9)] <- NA
    
    # here is what vec looks like
    vec
#>  [1] 10 11 NA 13 NA 15 16 NA NA 19 20

    # try it:
    skip_sum(vec)
#> [1] NA

    # Wait, I want it to treat NAs as zeroes
  • If you want it to treat NAs as zeroes you can redefine skip_sum like this:
    skip_sum <- function(x, Start = 2, Step = 2, ...) sum(x[seq(Start, length(x), by = Step)], ...)
  • Note the “…” in the argument list and in the body. It is being passed in as an argument to the sum function.
  • Try it:
    skip_sum(vec, na.rm = TRUE)  # we pass in a named argument that does not match Start, or Step.
#> [1] 58
    
    # that gave us the same as:
    skip_sum({vec[is.na(vec)]<-0; vec})  # note use of compound expression...
#> [1] 58
    
    # If we don't pass in na.rm = TRUE then it doesn't get passed to sum:
    skip_sum(vec)
#> [1] 58

Listing a function

  • If you type a function name, without the parentheses, R will list the code that went into the function definition
  • Try it:
    # list evensum
    evensum
#> function(x) {
#>       idx <- seq(2, length(x), by = 2)
#>       y <- x[idx]
#>       return(sum(y))
#>     }

    # list skip_sum
    skip_sum
#> function(x, Start = 2, Step = 2, ...) sum(x[seq(Start, length(x), by = Step)], ...)

    # have a look at read.csv, which is just read.table with some defaults:
    read.csv
#> function (file, header = TRUE, sep = ",", quote = "\"", dec = ".", 
#>     fill = TRUE, comment.char = "", ...) 
#> read.table(file = file, header = header, sep = sep, quote = quote, 
#>     dec = dec, fill = fill, comment.char = comment.char, ...)
#> <bytecode: 0x7fda9c595d58>
#> <environment: namespace:utils>

A Brief Journey into S3 function overloading

What’s this UseMethod stuff?

  • Sometimes, when you list a function, like print for example, you get something like this:
    print
#> function (x, ...) 
#> UseMethod("print")
#> <bytecode: 0x7fda9e3a2d48>
#> <environment: namespace:base>
  • Hey! I don’t see any code listing there! What is going on!
  • The UseMethod call in there actually is the function inside the print function.
  • UseMethod is the (somewhat klugie and limited) way that R overloads its functions.
  • What is overloading? It means that if you pass something to the print function, it will behave differently depending on what the class of the first argument is.
  • Aha! This is why data frames print out differently than lists, etc.

How this works

  • The print function has been defined so that when it is called it looks to see what class the first argument (corresponding to the named parameter x). let’s say that argument is of class weird. In that case, the print function looks for another function called print.weird(). If it finds it, it will execute that function on the argument.

How many class-specific print functions are there?

You can list them with methods()

methods(print)
#>   [1] print.acf*                                   
#>   [2] print.AES*                                   
#>   [3] print.anova*                                 
#>   [4] print.aov*                                   
#>   [5] print.aovlist*                               
#>   [6] print.ar*                                    
#>   [7] print.Arima*                                 
#>   [8] print.arima0*                                
#>   [9] print.AsIs                                   
#>  [10] print.aspell*                                
#>  [11] print.aspell_inspect_context*                
#>  [12] print.bibentry*                              
#>  [13] print.Bibtex*                                
#>  [14] print.browseVignettes*                       
#>  [15] print.by                                     
#>  [16] print.changedFiles*                          
#>  [17] print.check_code_usage_in_package*           
#>  [18] print.check_compiled_code*                   
#>  [19] print.check_demo_index*                      
#>  [20] print.check_depdef*                          
#>  [21] print.check_dotInternal*                     
#>  [22] print.check_make_vars*                       
#>  [23] print.check_nonAPI_calls*                    
#>  [24] print.check_package_code_assign_to_globalenv*
#>  [25] print.check_package_code_attach*             
#>  [26] print.check_package_code_data_into_globalenv*
#>  [27] print.check_package_code_startup_functions*  
#>  [28] print.check_package_code_syntax*             
#>  [29] print.check_package_code_unload_functions*   
#>  [30] print.check_package_compact_datasets*        
#>  [31] print.check_package_CRAN_incoming*           
#>  [32] print.check_package_datasets*                
#>  [33] print.check_package_depends*                 
#>  [34] print.check_package_description*             
#>  [35] print.check_package_description_encoding*    
#>  [36] print.check_package_license*                 
#>  [37] print.check_packages_used*                   
#>  [38] print.check_po_files*                        
#>  [39] print.check_Rd_contents*                     
#>  [40] print.check_Rd_line_widths*                  
#>  [41] print.check_Rd_metadata*                     
#>  [42] print.check_Rd_xrefs*                        
#>  [43] print.check_so_symbols*                      
#>  [44] print.check_T_and_F*                         
#>  [45] print.check_vignette_index*                  
#>  [46] print.checkDocFiles*                         
#>  [47] print.checkDocStyle*                         
#>  [48] print.checkFF*                               
#>  [49] print.checkRd*                               
#>  [50] print.checkReplaceFuns*                      
#>  [51] print.checkS3methods*                        
#>  [52] print.checkTnF*                              
#>  [53] print.checkVignettes*                        
#>  [54] print.citation*                              
#>  [55] print.codoc*                                 
#>  [56] print.codocClasses*                          
#>  [57] print.codocData*                             
#>  [58] print.colorConverter*                        
#>  [59] print.compactPDF*                            
#>  [60] print.condition                              
#>  [61] print.connection                             
#>  [62] print.data.frame                             
#>  [63] print.Date                                   
#>  [64] print.default                                
#>  [65] print.dendrogram*                            
#>  [66] print.density*                               
#>  [67] print.difftime                               
#>  [68] print.dist*                                  
#>  [69] print.DLLInfo                                
#>  [70] print.DLLInfoList                            
#>  [71] print.DLLRegisteredRoutines                  
#>  [72] print.dummy_coef*                            
#>  [73] print.dummy_coef_list*                       
#>  [74] print.ecdf*                                  
#>  [75] print.factanal*                              
#>  [76] print.factor                                 
#>  [77] print.family*                                
#>  [78] print.fileSnapshot*                          
#>  [79] print.findLineNumResult*                     
#>  [80] print.formula*                               
#>  [81] print.ftable*                                
#>  [82] print.function                               
#>  [83] print.getAnywhere*                           
#>  [84] print.glm*                                   
#>  [85] print.hclust*                                
#>  [86] print.help_files_with_topic*                 
#>  [87] print.hexmode                                
#>  [88] print.HoltWinters*                           
#>  [89] print.hsearch*                               
#>  [90] print.htest*                                 
#>  [91] print.html*                                  
#>  [92] print.infl*                                  
#>  [93] print.integrate*                             
#>  [94] print.isoreg*                                
#>  [95] print.kmeans*                                
#>  [96] print.knitr_kable*                           
#>  [97] print.Latex*                                 
#>  [98] print.LaTeX*                                 
#>  [99] print.libraryIQR                             
#> [100] print.listof                                 
#> [101] print.lm*                                    
#> [102] print.loadings*                              
#> [103] print.loess*                                 
#> [104] print.logLik*                                
#> [105] print.ls_str*                                
#> [106] print.medpolish*                             
#> [107] print.MethodsFunction*                       
#> [108] print.mtable*                                
#> [109] print.NativeRoutineList                      
#> [110] print.news_db*                               
#> [111] print.nls*                                   
#> [112] print.noquote                                
#> [113] print.numeric_version                        
#> [114] print.object_size*                           
#> [115] print.octmode                                
#> [116] print.packageDescription*                    
#> [117] print.packageInfo                            
#> [118] print.packageIQR*                            
#> [119] print.packageStatus*                         
#> [120] print.pairwise.htest*                        
#> [121] print.PDF_Array*                             
#> [122] print.PDF_Dictionary*                        
#> [123] print.pdf_doc*                               
#> [124] print.pdf_fonts*                             
#> [125] print.PDF_Indirect_Reference*                
#> [126] print.pdf_info*                              
#> [127] print.PDF_Keyword*                           
#> [128] print.PDF_Name*                              
#> [129] print.PDF_Stream*                            
#> [130] print.PDF_String*                            
#> [131] print.person*                                
#> [132] print.POSIXct                                
#> [133] print.POSIXlt                                
#> [134] print.power.htest*                           
#> [135] print.ppr*                                   
#> [136] print.prcomp*                                
#> [137] print.princomp*                              
#> [138] print.proc_time                              
#> [139] print.raster*                                
#> [140] print.Rd*                                    
#> [141] print.recordedplot*                          
#> [142] print.restart                                
#> [143] print.RGBcolorConverter*                     
#> [144] print.rle                                    
#> [145] print.roman*                                 
#> [146] print.sessionInfo*                           
#> [147] print.shiny.tag*                             
#> [148] print.shiny.tag.list*                        
#> [149] print.simple.list                            
#> [150] print.smooth.spline*                         
#> [151] print.socket*                                
#> [152] print.srcfile                                
#> [153] print.srcref                                 
#> [154] print.stepfun*                               
#> [155] print.stl*                                   
#> [156] print.StructTS*                              
#> [157] print.subdir_tests*                          
#> [158] print.summarize_CRAN_check_status*           
#> [159] print.summary.aov*                           
#> [160] print.summary.aovlist*                       
#> [161] print.summary.ecdf*                          
#> [162] print.summary.glm*                           
#> [163] print.summary.lm*                            
#> [164] print.summary.loess*                         
#> [165] print.summary.manova*                        
#> [166] print.summary.nls*                           
#> [167] print.summary.packageStatus*                 
#> [168] print.summary.ppr*                           
#> [169] print.summary.prcomp*                        
#> [170] print.summary.princomp*                      
#> [171] print.summary.table                          
#> [172] print.summaryDefault                         
#> [173] print.table                                  
#> [174] print.tables_aov*                            
#> [175] print.terms*                                 
#> [176] print.ts*                                    
#> [177] print.tskernel*                              
#> [178] print.TukeyHSD*                              
#> [179] print.tukeyline*                             
#> [180] print.tukeysmooth*                           
#> [181] print.undoc*                                 
#> [182] print.vignette*                              
#> [183] print.warnings                               
#> [184] print.xgettext*                              
#> [185] print.xngettext*                             
#> [186] print.xtabs*                                 
#> 
#>    Non-visible functions are asterisked

OK! That is a bunch. But notice that there is not a print.weird function.

Make a new print function

I am going to make a print function that will be invoked on objects of class weird:

print.weird <- function(x) {
  print("Object of class weird")
  print(paste("Length:", length(x)))
  print(head(x))
}

Our print.weird function doesn’t do much, it just shows the length and the first few lines, and lets the user know the object is of class weird.

Let’s watch it in action:

boing <- 1:100

print(boing) # does what we expect
#>   [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
#>  [18]  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34
#>  [35]  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51
#>  [52]  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68
#>  [69]  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85
#>  [86]  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100

# make boing an object of class weird
class(boing) <- "weird"

print(boing)
#> [1] "Object of class weird"
#> [1] "Length: 100"
#> [1] 1 2 3 4 5 6

It is good keep in mind that R is full of overloaded functions — that is functions that behave differently depending on the class of their arguments.

Applying functions to list elements

One of the great things about understanding how to define your own functions is that it lets you harness the power of the lapply() function which takes two main arguments:

  • a list (really any vector…)
  • a function

And it cycles over the list and applies the function to each of the list’s components, and returns the results in a list!

Here is its usage from its help file: lapply(X, FUN, ...)

An example: baby rockfish

To motivate our discussion of lapply() I have a simple example. In the directory data/rockfish_larvae there are 7 files, each with the genotypes of 96 larval rockfish that are the offspring of a single female. Here is what one file looks like:

cat(readLines("data/rockfish_larvae/K17larvae.txt")[1:10], sep = "\n")
#> Loc1_a Loc1_b Loc2_a Loc2_b Loc3_a Loc3_b Loc4_a Loc4_b Loc5_a Loc5_b Loc6_a Loc6_b Loc7_a Loc7_b
#> 227 267 127 127 155 187 128 133 184 184 85 87 275 278
#> 231 267 123 127 159 169 128 133 184 184 85 85 275 278
#> 217 223 127 131 159 169 123 133 184 188 0 0 275 278
#> 217 219 127 127 187 187 128 133 184 184 85 87 275 275
#> 217 227 127 131 187 187 128 133 184 186 85 85 275 278
#> 231 267 123 127 187 193 123 123 184 184 85 85 275 278
#> 217 219 127 127 155 187 128 133 184 186 85 89 275 275
#> 217 223 123 127 187 193 123 133 184 184 85 85 275 275
#> 0 0 127 127 187 187 128 133 184 184 0 0 275 278

Each pair of columns are the genotypes at a single location (a locus) in the genome. The numbers refer to different alleles. 0’s denote missing data.

One quick and dirty way of detecting whether a rockfish mother has mated with more than one male is to see if any loci have more than 4 alleles amongst the female’s offspring.

So, our goal is to cycle over the 7 files, read them in, cycle over the loci, and for each locus, count the number of each allele observed, and ultimately count up the number of alleles.

Here goes…

Step 1: Get a named vector of file names

  • We can use the dir() function to get the paths to the files we want.
  • We will throw some regex foo in there to name the elements of the vector the way we want:
    library(stringr)
    fpaths <- dir("data/rockfish_larvae", full.names = TRUE)
    
    # they look like this
    fpaths
#> [1] "data/rockfish_larvae/K17larvae.txt"
#> [2] "data/rockfish_larvae/K18larvae.txt"
#> [3] "data/rockfish_larvae/K20larvae.txt"
#> [4] "data/rockfish_larvae/K22larvae.txt"
#> [5] "data/rockfish_larvae/K23larvae.txt"
#> [6] "data/rockfish_larvae/K24larvae.txt"
#> [7] "data/rockfish_larvae/K26larvae.txt"

    # grab the "nice" part to be their names
    names(fpaths) <- str_match(fpaths, "data.*/(K[12][0-9]larvae)")[,2]

    # here is what we now have:
    fpaths
#>                            K17larvae                            K18larvae 
#> "data/rockfish_larvae/K17larvae.txt" "data/rockfish_larvae/K18larvae.txt" 
#>                            K20larvae                            K22larvae 
#> "data/rockfish_larvae/K20larvae.txt" "data/rockfish_larvae/K22larvae.txt" 
#>                            K23larvae                            K24larvae 
#> "data/rockfish_larvae/K23larvae.txt" "data/rockfish_larvae/K24larvae.txt" 
#>                            K26larvae 
#> "data/rockfish_larvae/K26larvae.txt"

Step 2: Read the files into a list of 7 data frames

Here we can use lapply().

Here is one way to do it:

# note we are lapplying over a character vector. But the result
# still comes back as a list!
dframes <- lapply(fpaths, function(x) read.table(x, header = TRUE, na.strings = "0")) 

And this is another way we could do it, using the … to pass the extra named parameters to read.table

dframes <- lapply(fpaths, read.table, header = TRUE, na.strings = "0")

Two important points to make:

  1. function(x) read.table(x, header = TRUE, na.strings = "0") returns a function which is used as lapply()’s FUN argument. That expression is as good as passing in the name of a function. Pretty cool…
    • You might see this sort of construction where a function is defined but not returned into a variable called an anonymous function.
  2. Defining a function and being explicit about passing the argument in is more flexible than passing the name of a function and extra named parameters.

Step 3: Make sure we have what we want

Note that dframes is a list of length 7, and it has names that are appropriate:

length(dframes)
#> [1] 7
names(dframes)
#> [1] "K17larvae" "K18larvae" "K20larvae" "K22larvae" "K23larvae" "K24larvae"
#> [7] "K26larvae"

This shows that lapply() propagates names to the list that it returns.

It would be nice to make sure that every component of it was a data frame of the correct size.

Class exercise: Use lapply to quickly compute the dimensions of every data frame that was just read in.

I want to pause for a moment and reiterate that each component of the list dframes is a data frame! Remember that a list can store any object of any class or structure. It is easy to forget that….But when you remember it, you can have fun throwing all manner of objects into lists if need be.

Step 4: Prepare to count alleles of different types

  • These data frames are not particularly tidy. It is particularly annoying that data for each locus are in two separate columns.
  • There are lots of ways we could deal with this. One would be to reshape the data into one big tidy data frame with five colums (Mother, Larva, Locus, GeneCopy (a or b), Allele).
  • But since we are working with lapply, we will do it differently.
  • We are just counting up the alleles, so we can just stack the first and second columns for each locus on top of each other. Then all the alleles are in a single vector. Here we will do this.
  • We can experiment with a single component first. It is convenient to call it x.
    x <- dframes[[1]]
    first_cols <- x[, c(T,F)]  # pick out the first columns of each locus
    second_cols <- x[, c(F,T)] # second columns of each locus

    # now, name the colums so they are the same, and just refer to locus
    # this involves a stringr function
    names(first_cols) <- str_replace(names(first_cols), "_a", "")
    names(second_cols) <- str_replace(names(second_cols), "_b", "")

    # now, stack those up:
    tmp <- rbind(first_cols, second_cols)

    # see how big it is and what it looks like
    dim(tmp)
#> [1] 192   7

    head(tmp)
#>   Loc1 Loc2 Loc3 Loc4 Loc5 Loc6 Loc7
#> 1  227  127  155  128  184   85  275
#> 2  231  123  159  128  184   85  275
#> 3  217  127  159  123  184   NA  275
#> 4  217  127  187  128  184   85  275
#> 5  217  127  187  128  184   85  275
#> 6  231  123  187  123  184   85  275
  • OK! That works, but it was only for a single component that we had named x. We want to apply the same series of operations to every single component of dframes.
    • This is a job for lapply and a function!
  • Here we define a function and apply it:
    # define a function of x (see how useful it was to call that thing x when we were experimenting?)
    stack_em <- function(x) {
      first_cols <- x[, c(T,F)]
      second_cols <- x[, c(F,T)]
      names(first_cols) <- str_replace(names(first_cols), "_a", "")
      names(second_cols) <- str_replace(names(second_cols), "_b", "")
      rbind(first_cols, second_cols)
    }

    # now, lapply that function over dframes:
    dframes_stacked <- lapply(dframes, stack_em)

    # note that we also could have done:
    dframes_stacked <- lapply(dframes, function(x) {
      first_cols <- x[, c(T,F)]
      second_cols <- x[, c(F,T)]
      names(first_cols) <- str_replace(names(first_cols), "_a", "")
      names(second_cols) <- str_replace(names(second_cols), "_b", "")
      rbind(first_cols, second_cols)
      }
    )

Step 5: Count up the occurrences of each allele in each column of dframes_stacked

  • Recall that a data frame is just a special kind of a list. The columns of the data frame are the components of the list. So you can lapply over them.
  • We will apply the table function to each column of each component of dframes_stacked.
  • Far out! That requires nested lapply()’s:
    alle_counts <- lapply(dframes_stacked, function(x) lapply(x, table))

    # see what the first component of the result looks like:
    alle_counts[1]
#> $K17larvae
#> $K17larvae$Loc1
#> 
#> 217 219 223 227 231 267 
#>  50  24  20  33  17  44 
#> 
#> $K17larvae$Loc2
#> 
#> 123 127 131 
#>  19 123  48 
#> 
#> $K17larvae$Loc3
#> 
#> 155 159 169 187 193 
#>  28  20  48  76  16 
#> 
#> $K17larvae$Loc4
#> 
#> 123 128 133 
#>  63  75  50 
#> 
#> $K17larvae$Loc5
#> 
#> 184 186 188 
#> 144  28  20 
#> 
#> $K17larvae$Loc6
#> 
#>  85  87  89 
#> 117  44  23 
#> 
#> $K17larvae$Loc7
#> 
#> 275 278 
#> 138  50

Step 6: Count up the total number of alleles at each locus

  • The above result shows clear evidence of having more than four alleles, at least among some loci.
  • But it is hard to look at. Can we summarize it further? Yes!
  • We can take the length of each component to see how many distinct alleles there were:
    list_result <- lapply(alle_counts, function(x) lapply(x, length))

    # what does it look like?
    list_result[1]
#> $K17larvae
#> $K17larvae$Loc1
#> [1] 6
#> 
#> $K17larvae$Loc2
#> [1] 3
#> 
#> $K17larvae$Loc3
#> [1] 5
#> 
#> $K17larvae$Loc4
#> [1] 3
#> 
#> $K17larvae$Loc5
#> [1] 3
#> 
#> $K17larvae$Loc6
#> [1] 3
#> 
#> $K17larvae$Loc7
#> [1] 2
  • OK, that is nice, but it is hard to look at as a list.

Step 6 do-over: using sapply

  • Because each locus yields just a single number, and there are exactly 7 loci per mother, we could simplify all these results into a table that is easier to look at.
  • A nice function for this is sapply() which does this:
    1. Run lapply
    2. Pass the result through the simplify2array function
  • simplify2array() looks at a list, and if it could be simply expressed as an array (like a matrix) without losing any information, then it does that, whilst preserving the names in a sensible fashion.
  • Check this out:
    array_result <- sapply(alle_counts, function(x) sapply(x, length))
  
    array_result
#>      K17larvae K18larvae K20larvae K22larvae K23larvae K24larvae K26larvae
#> Loc1         6         4         4         4         5         4         5
#> Loc2         3         5         4         4         3         4         3
#> Loc3         5         5         4         3         3         5         4
#> Loc4         3         3         3         3         3         3         3
#> Loc5         3         3         3         3         2         3         2
#> Loc6         3         2         2         2         3         2         3
#> Loc7         2         2         2         2         2         2         2

Wrap up

  • we’ve just scratched the surface of a whole family of apply-like functions that are present in R.
  • If you want to get more into them, I recommend Hadley Wickham’s plyr package.
  • If you are careful about keeping your data in a tidy format, then you can probably just use Hadley’s dplyr package which is phenomenally cool.
  • What you’ve learned here about functions will be useful all over the R world.

comments powered by Disqus