Reproducible Research Course by Eric C. Anderson for (NOAA/SWFSC)


Atomic Data Types and Coercion

Basic Data “Modes” of R

There are four main “modes” of scalar data, in order from least to most general:

  1. logical can take two values: TRUE and FALSE, which can be abbreviated, when you type them as T and F.
  2. The numeric mode comes in two flavors: “integer” and “numeric” (real numbers). Examples: 1, 3.14, 8.2, 10, etc.
  3. complex: these are complex numbers of the form a + bi where a and b are real numbers and $i=\sqrt{-1}.$ Examples: 3.2+7.3i, 4+0i
  4. character: these take values that are often called “strings” in other languages. Examples: "fred", "foo", "bar", "boing". There is also a raw mode which refers to raw bytes of data, but we won’t concern ourselves with that for now.

Atomic Vectors

A fundamental data structure in R: a vector in which every element is of the same mode. Like

x <- c(1,2,3,5,7)
x
#> [1] 1 2 3 5 7

Pretty basic stuff, until you start accidentally, or intentionally mixing modes.

x <- c(1,2,3,5,7,"11")
x
#> [1] "1"  "2"  "3"  "5"  "7"  "11"

The mode of everything is coerced to the mode of the element with the most general mode, and this can really bite you in the rear if you don’t watch out!

Coercion

  • All the data in an atomic vector must be of the same mode
  • If data are added so that modes are mixed, then the whole vector gets changed so that everything is of the most general mode
  • Example:

    # simple atomic vector of mode numeric
    x <- 1:6
    x
    #> [1] 1 2 3 4 5 6
    
    # now change one to mode character and see what happens
    x[1] <- "tweezer"
    x
    #> [1] "tweezer" "2"       "3"       "4"       "5"       "6"

Coercion Up One Step

  • logical to numeric:
    • TRUE ==> 1
    • FALSE ==> 0
  • numeric to complex:
    • 6.4 ==> 6.4+0i
    • 5 ==> 5+0i
  • complex to character:
    • 6.4+0i ==> "6.4+0i"

Coercion Up Two Or More Steps

Note that the coercion sometimes “jumps over the intermediate steps”

  • logical to complex
    • TRUE ==> 1+0i
    • FALSE ==> 0+0i
  • logical to character (it does not go FALSE ==> 0 ==> “0”)
    • TRUE ==> "TRUE"
    • FALSE ==> "FALSE"
  • numeric to character
    • 7 ==> "7"
    • 3.1415 ==> "3.1415"

Coercion down one step

Sometimes things get coerced “downards” (i.e., toward less general data types).

If the coercion doesn’t make sense you end up with NA which is how R denotes missing data

  • numeric to logical (0 ==> FALSE, anything else ==> TRUE); Always “makes sense”
    • 0 ==> FALSE
    • 1 ==> TRUE
    • 78.2 ==> TRUE
    • 0.0001 ==> TRUE
    • -563.3 ==> TRUE
  • complex to numeric (discards complex part and warns about it!)
    • 3.4+0i ==> 3.4
    • 5.6+7.6i ==> 5.6 (+ a warning)

      # witness a warning:
      as.numeric(7.4+5i)
      #> Warning: imaginary parts discarded in coercion
      #> [1] 7.4
  • character to complex
    • "3.4+4i" ==> 3.4+4i
    • "a" -> NA (you can’t coerce "a" to any number, reasonably)

Coercion down more than one step

Important point: it doesn’t necessarily go through intermediate steps:

  • complex to logical (0 ==>FALSE, anything else ==> TRUE)
    • 0+0i ==> FALSE
    • 0+2i ==> TRUE
    • 5+0i ==> TRUE
    • 5+9i ==> TRUE
  • character to logical
    • "TRUE" ==> TRUE
    • "FALSE" ==> FALSE
    • "1" ==> NA (yikes! if it went through numeric you’d get something different!)
    • "0" ==> NA
  • character to numeric
    • "56.764" ==> 56.764
    • "4+8i" ==> 4 (with a warning that the complex part was dropped)
    • "fred" -> NA

Functions For Explicit Coercion

There is a whole family for coercing objects between different modes (or different types) that take the form as.something:

  • as.logical(x)
  • as.numeric(x)
  • as.integer(x) # not a mode, (this is a subclass of the numeric mode)
  • as.complex(x)
  • as.character(x)

As expected, these are vectorized—they coerce every element of the vector to the desired mode.

Missing Data and Special Values in R

We saw NA up above. That means “Not Available” and it denotes missing data.

There are also two more interesting values:

  1. Inf (-Inf) means (or  − ∞) and arises from things like: 1/0 or log(0).
  2. NaN means “Not a Number” and it arises from situations where you can’t evaluate something and it doesn’t have an obvious limit. Like 0/0 or Inf/-Inf or 0*Inf.
  • If you wish to test whether something is NaN, or NA you have: is.na(x) and is.nan(x) which return logical vectors.
  • The same goes for testing if things are finite or infinite:

    x <- c(NA, 2, Inf, 4, NaN, 6)
    
    is.nan(x) # only the NaN
    #> [1] FALSE FALSE FALSE FALSE  TRUE FALSE
    
    is.na(x) # both NA and NaN
    #> [1]  TRUE FALSE FALSE FALSE  TRUE FALSE
    
    is.infinite(x) # only Inf or -Inf
    #> [1] FALSE FALSE  TRUE FALSE FALSE FALSE

Modes of Missing Data

Here is something to be aware of: missing values, like non-missing values, carry around their mode. Try this:

x <- c(1, 2, NA, 4, "5")
x
#> [1] "1" "2" NA  "4" "5"

x[3]  # this extracts the third element of x
#> [1] NA

c(10,20,30,x[3])
#> [1] "10" "20" "30" NA

c(10, 20, 30, NA)  # this is a "fresh" NA, no coercion
#> [1] 10 20 30 NA

Vectorization

  • In R, the term vectorization refers to the fact that, in many cases, when you apply a function to a vector, it applies the function to every element of the vector.
  • This is apparent in many of the operators and we will see it in plenty of other functions, too.

Most Operators are Vectorized

This is incredibly important! All the mathematical operators, like +, -, *, and the logical operators, like & (AND), | (OR), and the comparison operators, like < and > are hungry to operate element-wise on every element of a vector. Example:

fish.lengths <- c(121, 95, 87, 142)
fish.weights <- c(1011, 505, 702, 900)
fish.fatness <- fish.weights / fish.lengths
fish.fatness
#> [1] 8.355372 5.315789 8.068966 6.338028

Vectorization is so important…

That we are going to go to open up a whole new lecture that starts with it.


comments powered by Disqus