Atomic Data Types and Coercion
Basic Data “Modes” of R
There are four main “modes” of scalar data, in order from least to most general:
logical
can take two values:TRUE
andFALSE
, which can be abbreviated, when you type them asT
andF
.- The
numeric
mode comes in two flavors: “integer” and “numeric” (real numbers). Examples:1
,3.14
,8.2
,10
, etc. complex
: these are complex numbers of the form a + bi where a and b are real numbers and $i=\sqrt{-1}.$ Examples:3.2+7.3i
,4+0i
character
: these take values that are often called “strings” in other languages. Examples:"fred"
,"foo"
,"bar"
,"boing"
. There is also araw
mode which refers to raw bytes of data, but we won’t concern ourselves with that for now.
Atomic Vectors
A fundamental data structure in R: a vector in which every element is of the same mode. Like
x <- c(1,2,3,5,7)
x
#> [1] 1 2 3 5 7
Pretty basic stuff, until you start accidentally, or intentionally mixing modes.
x <- c(1,2,3,5,7,"11")
x
#> [1] "1" "2" "3" "5" "7" "11"
The mode of everything is coerced to the mode of the element with the most general mode, and this can really bite you in the rear if you don’t watch out!
Coercion
- All the data in an atomic vector must be of the same mode
- If data are added so that modes are mixed, then the whole vector gets changed so that everything is of the most general mode
Example:
# simple atomic vector of mode numeric x <- 1:6 x #> [1] 1 2 3 4 5 6 # now change one to mode character and see what happens x[1] <- "tweezer" x #> [1] "tweezer" "2" "3" "4" "5" "6"
Coercion Up One Step
- logical to numeric:
TRUE
==>1
FALSE
==>0
- numeric to complex:
6.4
==>6.4+0i
5
==>5+0i
- complex to character:
6.4+0i
==>"6.4+0i"
Coercion Up Two Or More Steps
Note that the coercion sometimes “jumps over the intermediate steps”
- logical to complex
TRUE
==>1+0i
FALSE
==>0+0i
- logical to character (it does not go FALSE ==> 0 ==> “0”)
TRUE
==>"TRUE"
FALSE
==>"FALSE"
- numeric to character
7
==>"7"
3.1415
==>"3.1415"
Coercion down one step
Sometimes things get coerced “downards” (i.e., toward less general data types).
If the coercion doesn’t make sense you end up with NA
which is how R denotes missing data
- numeric to logical (0 ==> FALSE, anything else ==> TRUE); Always “makes sense”
0
==>FALSE
1
==>TRUE
78.2
==>TRUE
0.0001
==>TRUE
-563.3
==>TRUE
- complex to numeric (discards complex part and warns about it!)
3.4+0i
==>3.4
5.6+7.6i
==>5.6
(+ a warning)# witness a warning: as.numeric(7.4+5i) #> Warning: imaginary parts discarded in coercion #> [1] 7.4
- character to complex
"3.4+4i"
==>3.4+4i
"a"
->NA
(you can’t coerce"a"
to any number, reasonably)
Coercion down more than one step
Important point: it doesn’t necessarily go through intermediate steps:
- complex to logical (0 ==>FALSE, anything else ==> TRUE)
0+0i
==>FALSE
0+2i
==>TRUE
5+0i
==>TRUE
5+9i
==>TRUE
- character to logical
"TRUE"
==>TRUE
"FALSE"
==>FALSE
"1"
==>NA
(yikes! if it went through numeric you’d get something different!)"0"
==>NA
- character to numeric
"56.764"
==>56.764
"4+8i"
==>4
(with a warning that the complex part was dropped)"fred"
->NA
Functions For Explicit Coercion
There is a whole family for coercing objects between different modes (or different types) that take the form as.something
:
as.logical(x)
as.numeric(x)
as.integer(x)
# not a mode, (this is a subclass of thenumeric
mode)as.complex(x)
as.character(x)
As expected, these are vectorized—they coerce every element of the vector to the desired mode.
Missing Data and Special Values in R
We saw NA
up above. That means “Not Available” and it denotes missing data.
There are also two more interesting values:
Inf
(-Inf) means ∞ (or − ∞) and arises from things like: 1/0 or log(0).NaN
means “Not a Number” and it arises from situations where you can’t evaluate something and it doesn’t have an obvious limit. Like 0/0 or Inf/-Inf or 0*Inf.
- If you wish to test whether something is NaN, or NA you have:
is.na(x)
andis.nan(x)
which return logical vectors. The same goes for testing if things are finite or infinite:
x <- c(NA, 2, Inf, 4, NaN, 6) is.nan(x) # only the NaN #> [1] FALSE FALSE FALSE FALSE TRUE FALSE is.na(x) # both NA and NaN #> [1] TRUE FALSE FALSE FALSE TRUE FALSE is.infinite(x) # only Inf or -Inf #> [1] FALSE FALSE TRUE FALSE FALSE FALSE
Modes of Missing Data
Here is something to be aware of: missing values, like non-missing values, carry around their mode. Try this:
x <- c(1, 2, NA, 4, "5")
x
#> [1] "1" "2" NA "4" "5"
x[3] # this extracts the third element of x
#> [1] NA
c(10,20,30,x[3])
#> [1] "10" "20" "30" NA
c(10, 20, 30, NA) # this is a "fresh" NA, no coercion
#> [1] 10 20 30 NA
Vectorization
- In R, the term vectorization refers to the fact that, in many cases, when you apply a function to a vector, it applies the function to every element of the vector.
- This is apparent in many of the operators and we will see it in plenty of other functions, too.
Most Operators are Vectorized
This is incredibly important! All the mathematical operators, like +
, -
, *
, and the logical operators, like &
(AND), |
(OR), and the comparison operators, like <
and >
are hungry to operate element-wise on every element of a vector. Example:
fish.lengths <- c(121, 95, 87, 142)
fish.weights <- c(1011, 505, 702, 900)
fish.fatness <- fish.weights / fish.lengths
fish.fatness
#> [1] 8.355372 5.315789 8.068966 6.338028
Vectorization is so important…
That we are going to go to open up a whole new lecture that starts with it.
comments powered by Disqus