String manipulation with R
Intro
R comes with some fairly well-developed tools for string manipulation and the application of regular expressions. If you are a fan of Perl or Python, or you have a penchant for Unix utilities like sed and awk, then you might find that you prefer those languages/workflows for doing your text processing. However, you can do quite a lot with R, and if all of your other analyses are done in R, there is some advantage to also doing your text processing within R.
If you are not yet familiar with any other text-processing systems and R is your first real programming language, then you might as well learn string manipulation in R!
If you are used to a “line-by-line” loop-based processing paradigm like that used in awk, then R’s focus on vectors of strings can take some getting used to. However, once you get comfortable with it and the sorts of operations that are useful within that framework, it can be quite powerful.
The base R package comes with a full set of string manipulation functions that are quite versatile. However, they don’t have consistent interfaces and they come with a lot of options that most people will never need to learn. So, while we are learning this stuff, we will focus on using the functions in Hadley Wickham’s package stringr
which preserves about 95% of the functionality of the base R string functions, but don’t carry around a lot of extra function-parameter shrubbery that will trip up a lot of users. For a quick intro to stringr
check out this
To illustrate these ideas we will use a real example from my own work: processing a text file containing data specifications for the coded wire tag data base.
Prerequisites
- To work through the examples you will the
stringr
package. - Please download/install this before coming to class:
install necessary packages:
install.packages(c("stringr", "zoo"))
Pull the most recent version of the rep-res-course repo just before coming to class.
Goals for this hour:
- Present example text-processing problem
- Briefly cover the idea of regular expressions
- Show how to use functions in
stringr
to solve the problem
Example problem
RMIS
- The Regional Mark Processing Center (RMPC) maintains data bases having to do with catch, sampling, release and recovery of salmon and coded wire tags. These data bases are known collectively as the Regional Mark Information System (RMIS).
- You can query the data bases and get information from http://www.rmpc.org
- Many of the fields in the data base use codes to denote certain values. For example, in the species column in each data base, a 1 = Chinook salmon and a 2 = Coho, a 3 = Steelhead, etc.
- If you are going to be analyzing these data in R, or plotting them, or just looking at them, it might be nice to recode such columns as factors with levels that are the actual species names. (This beats trying to remember all sorts of codes).
The PSC data specification
- First, the good news:
- A comprehensive list of codes and values is available at http://www.rmpc.org/files/PSC_V41_Specification.pdf
- Now the bad news:
- The RMPC does not maintain a listing of these codes in an easily parseable (i.e. reproducible!) format.
- This means that you have to extract all the codes from a PDF file, which is a rather ugly proposition.
- The RMPC does not maintain a listing of these codes in an easily parseable (i.e. reproducible!) format.
- The folks who run the RMPC are pretty cool, though, and were able to supply me with a Microsoft Word document from whence the PDF file was created.
- MS Word, as we’ve all learned in this course is not a reproducible or easily-parseable format, but we are going to do our best with it.
PSC Spec in text
- I was able to copy and paste the specification, make sure it had Unix, instead of Mac line endings, and save it.
- A copy is in the course repo at:
data/PSC-Specification-ver-4.1_7-1-14.txt
- Go ahead and have a look at it in a suitably good text editor. If you are using TextWrangler on your Mac then you can have it View -> Text Display -> Show Invisibles to see what are tabs and what are spaces, etc.
- Here is a little section of the 2,855 lines in the file:
CHAPTER 2
Release Data
PSC PSC Common Name Max Reqd Format /Use Description & Validation Rules.......................................................................................................................................
Fld # and Data Field Name Cols
1 Record Code 1 Yes Lookup Code to indicate the CWT data file classification (class) of this individual record. Must match one of the following:
record_code ’T’ =Tagged Release record
’N’ =Non-Associated Release record
See chapter 16 for further discussion of the use of this field.
2 Format Version 4 Yes ’4.1’ Format version used to report data
format_version Must have the value: ’4.1’
3 Submission Date 8 Yes YYYYMMDD Date of submission for this set of records. Date should be close to actual date when this row is sent to the Mark Center
submission_date Must have the same value for all records in this data submission
Should match submission_date in corresponding Description file
4 Reporting Agency 10 Yes Lookup Abbreviation for reporting agency of this dataset for this data exchange
reporting_agency Must contain an agency code defined in chapter 8
Must be the same for all records
5 Release Agency 10 Yes Lookup Abbreviations for tagging agencies
release_agency Must contain an agency code defined in chapter 8
6 Coordinator 2 Yes Lookup Reporting coordinator for the release group of this individual record
coordinator Must match one of the following:
’01’ =ADFG (S.E. Alaska)
’02’ =NMFS – Alaska
’03’ =CDFO
’04’ =WDFW
’05’ =ODFW
’06’ =NMFS – Columbia River
’07’ =USFWS
’08’ =CDFG
’09’ =BCFW
’10’ =IDFG
’11’ =YAKA
’12’ =ADFG (S. Central AK)
’13’ =MIC (Metlakatla, AK)
’14’ =NWIFC
’15’ =CRITFC
‘16’ =NEZP
‘17’ =QDNR
‘18’ =STIL
- What a freakshow! I can’t believe that we have to do this! (But on the other hand it is kind of fun…)
A closer look at the format
The specification is organized into a series of chapters that start with lines that look like:
CHAPTER 2
Chapters 2 through 6 have information about the five data bases I’ll be working with:
Chapter 2 – Release Data Chapter 3 – Recovery Data Chapter 4 – Catch/Sample Data Chapter 5 – Catch & Effort Data Chapter 6 – Location Data
Within every chapter there are multiple entries that give the names of the fields (columns). For example:
Shows that field 4 is4 Reporting Agency 10 Yes Lookup Abbreviation for reporting agency of this dataset for this data exchange reporting_agency Must contain an agency code defined in chapter 8 Must be the same for all records 5 Release Agency 10 Yes Lookup Abbreviations for tagging agencies release_agency Must contain an agency code defined in chapter 8
reporting_agency
, etc.Some of the fields have lists of codes that look like this:
6 Coordinator 2 Yes Lookup Reporting coordinator for the release group of this individual record coordinator Must match one of the following: ’01’ =ADFG (S.E. Alaska) ’02’ =NMFS – Alaska ’03’ =CDFO ’04’ =WDFW ’05’ =ODFW ’06’ =NMFS – Columbia River ’07’ =USFWS ’08’ =CDFG ’09’ =BCFW ’10’ =IDFG ’11’ =YAKA ’12’ =ADFG (S. Central AK) ’13’ =MIC (Metlakatla, AK) ’14’ =NWIFC ’15’ =CRITFC ‘16’ =NEZP ‘17’ =QDNR ‘18’ =STIL
These codes take the form of string (usually a single character or a number) that is quoted in some special quotes that are sometimes upside down and sometimes right side up; then a TAB, then an equals sign followed by another string with no space.
Our mission
- For every field (in each of the five data bases) that has specialized codes, we want to know what the codes are and what they means.
- In the end we will want to have a tidy data frame of these that looks something like this:
chapter | field | code | value |
---|---|---|---|
Release Data | coordinator | “01” | ADFG (S.E. Alaska) |
Release Data | coordinator | “02” | NMFS – Alaska |
Release Data | coordinator | “03” | CDFO |
Release Data | coordinator | “04” | WDFW |
- If we wanted to be old school about it we could just spend two weeks copying and pasting everything, and hope that the spec doesn’t change any time soon, because we wouldn’t want to waste a full two weeks doing that again (even if our wrists were not totally trashed after that.)
- But we are going to try to automate this.
Regular expressions
Before we can get rolling on this we are going to have to cover some regular expressions.
What is a regular expression?
- You can think of a regular expression as a “fuzzy find” pattern that let’s you match text strings while including “wildcards” and “character classes”, and maybe even capturing text that matches in certain ways to your pattern.
- Here is a simple example.
library(stringr) # load the package
# now make a character vector
vec <- c("fred", "foo", "food", "bed", "reefed")
# see if any matches "fred"
str_detect(vec, "fred")
#> [1] TRUE FALSE FALSE FALSE FALSE
# see if any have an "f" followed by a "d"
# after 0 or more other characters
str_detect(vec, "f.*d")
#> [1] TRUE FALSE TRUE FALSE TRUE
# see if any start with an "f"
str_detect(vec, "^f")
#> [1] TRUE TRUE TRUE FALSE FALSE
# see if any end with an "o"
str_detect(vec, "o$")
#> [1] FALSE TRUE FALSE FALSE FALSE
Holy cow! That looks funky
- Yes, there is a whole pantheon of regular expression “metacharacters” that you will want to get familiar with.
- Entire books have been written on regular expressions, so I won’t attempt to teach you all of it now, but I will point out some things that you will want to know.
- The simplest pattern is a single character. So, a regular expression of “
a
”" says, match anything with an “a
”" in it. - You can do things like specify how many times a pattern must occur to be considered a match. For example maybe you had strings of DNA and you wanted to detect runs of 7 “A”’s in a row in it. The regular expression that matched that would be
"A{7}"
. - A pattern can be a wildcard character which can take many different forms like:
.
= anything[a-z]
a character class consisting of all lowercase letters[[:lower:]]
another way of specifying the same thing as[a-z]
that responds to locality.
- You can group patterns together by wrapping them in
(
and)
. Then you can treat those things as patterns that must occur a certain number of times to be considered a match, etc.
We will see examples of these things as we solve our problem at hand. For now, just know that you can build up regular expressions to search for very complex patterns in strings using this regex syntax that can seem quite baffling or even daunting at first.
For now, here are some important things to know
# special meaning is given to these characters:
. \ | ( ) [ ] { } ^ $ * + ?
. # the dot matches any character (or number or punctuation, etc)
\ # backslash is often used to "escape" metacharacters (and in R that could mean two backslashes)
( ) # these guys are used to group patterns and also
# to capture the parts of a string that match grouped subpatterns
[ ] # you can specify character classes (like "any lower case letter" inside these)
? # The preceding item is optional and will be matched at most once.
* # The preceding item will be matched zero or more times.
+ # The preceding item will be matched one or more times.
{n} # The preceding item is matched exactly n times.
{n,} # The preceding item is matched n or more times.
{n,m} # The preceding item is matched at least n times, but not more than m times.
For a more complete description do ?regex
and see the “Extended Regular Expressions” section.
Getting down to business
Quick note on stringr
functions. Most of them have the syntax of:
str_something( string, pattern )
where something is like “detect” or “match” or “replace”, and pattern is a regular expression passed in as a quoted string (or even a character vector in some cases)
Step 1: read the strings in!
The delightful readLines()
function will let us read each line of PSC-Specification-ver-4.1_7-1-14.txt
into a character vector:
spec <- readLines("data/PSC-Specification-ver-4.1_7-1-14.txt")
length(spec)
#> [1] 2855
head(spec)
#> [1] ""
#> [2] "CHAPTER 1"
#> [3] "INTRODUCTION, DEFINITIONS, AND RULES"
#> [4] ""
#> [5] "CWT data must be exchanged in the form of a PSC Format Version 4.1 dataset."
#> [6] "The definition and specification of PSC Format Version 4.1 is described in this set of documents."
Step 2: figure out which chapter we are in
Here is what our strategy is going to be for dealing with this
- Start with a vector of NA’s of length(spec)
- Put values in where each “CHAPTER” is found
- Propagate those values forward, overwriting any NAs with the value that came before.
In this way we get a vector, of, for example, what chapter we are in.
Let’s do it for the chapter names:
chapter <- rep(NA, length(spec)) # make a vector of NAs
chap_positions <- which(str_detect(spec, "^CHAPTER"))
# see what that looks like:
chap_positions
#> [1] 2 159 492 835 1132 1280 1406 1466 1666 2009 2094 2189 2262 2589
#> [15] 2698 2775
# now, the titles of the chapters are on the line following "^CHAPTER"
# so let's get those and put them in and then roll them forward
chapter[chap_positions] <- spec[chap_positions + 1]
chapter_names <- na.locf(chapter, na.rm = FALSE) # from the zoo package. It rolls the last value forward over NAs
# if you are following along, do this to see what we have done:
# cbind(chapter_names)
Step 2.5: Let’s get the chapter numbers while we are at it
Now, we want to extract the number of each chapter. We can use a regular expression that finds strings in spec
that
- first have to match “^CHAPTER”
- but then also have at least one space
- and then one or more digits
AND we can use the str_match
function to pull out the part that matches the 3rd part of the expression (i.e. the one or more digits). Check out ?str_match
Here we go!
# grab them
chap_nums <- str_match(spec, "^CHAPTER +([[:digit:]]+)")
# notice that we have wrapped the part that matches one or more digits in parentheses!
# note that str_match returns a matrix:
head(chap_nums)
#> [,1] [,2]
#> [1,] NA NA
#> [2,] "CHAPTER 1" "1"
#> [3,] NA NA
#> [4,] NA NA
#> [5,] NA NA
#> [6,] NA NA
# roll them forward
chapter_numbers <- na.locf(chap_nums[,2], na.rm = FALSE)
# see what we have done so far looks like
head(cbind(chapter_numbers, chapter_names))
#> chapter_numbers chapter_names
#> [1,] NA NA
#> [2,] "1" "INTRODUCTION, DEFINITIONS, AND RULES"
#> [3,] "1" "INTRODUCTION, DEFINITIONS, AND RULES"
#> [4,] "1" "INTRODUCTION, DEFINITIONS, AND RULES"
#> [5,] "1" "INTRODUCTION, DEFINITIONS, AND RULES"
#> [6,] "1" "INTRODUCTION, DEFINITIONS, AND RULES"
Step 3: Figure out which field each line of the file is in
Our main concern is going to be extracting codes that have to do with fields that we can identify according to which chapter they are in (which corresponds to the data base) and then which field name we are in.
Happily, it is easy to find the beginnings of field heading sections. They are the only lines in the file I see that start with one or two digits, followed immediately by a TAB. (At least that is the case for the five chapters we are interested in.)
Recall that these things look like:
2 Format Version 4 Yes ’4.1’ Format version used to report data
format_version Must have the value: ’4.1’
3 Submission Date 8 Yes YYYYMMDD Date of submission for this set of records. Date should be close to actual date when this row is sent to the Mark Center
submission_date Must have the same value for all records in this data submission
Should match submission_date in corresponding Description file
So, we are going to get those positions, and then try to extract the field names, which appear on the following lines, stuck between TABS:
field_starts <- which(str_detect(spec, "^[[:digit:]]{1,2}\t"))
field_names <- str_match(spec[field_starts + 1], "^\t([_[:alnum:]]+)\t")
field_names[,2]
#> [1] "record_code" "format_version"
#> [3] "submission_date" "reporting_agency"
#> [5] "release_agency" "coordinator"
#> [7] "tag_code_or_release_id" "tag_type"
#> [9] "first_sequential_number" "last_sequential_number"
#> [11] "related_group_type" "related_group_id"
#> [13] "species" "run"
#> [15] "brood_year" "first_release_date"
#> [17] "last_release_date" "release_location_code"
#> [19] "hatchery_location_code" "stock_location_code"
#> [21] "release_stage" "rearing_type"
#> [23] "study_type" "release_strategy"
#> [25] "avg_weight" "avg_length"
#> [27] "study_integrity" "cwt_1st_mark"
#> [29] "cwt_1st_mark_count" "cwt_2nd_mark"
#> [31] "cwt_2nd_mark_count" "non_cwt_1st_mark"
#> [33] "non_cwt_1st_mark_count" "non_cwt_2nd_mark"
#> [35] "non_cwt_2nd_mark_count" "counting_method"
#> [37] "tag_loss_rate" "tag_loss_days"
#> [39] "tag_loss_sample_size" "tag_reused"
#> [41] "comments" "record_code"
#> [43] "format_version" "submission_date"
#> [45] "reporting_agency" "sampling_agency"
#> [47] "recovery_id" "species"
#> [49] "run_year" "recovery_date"
#> [51] "recovery_date_type" "period_type"
#> [53] "period" "fishery"
#> [55] "gear" "adclip_selective_fishery"
#> [57] "estimation_level" "recovery_location_code"
#> [59] "sampling_site" "recorded_mark"
#> [61] "sex" "weight"
#> [63] "weight_code" "weight_type"
#> [65] "length" "length_code"
#> [67] "length_type" "detection_method"
#> [69] "tag_status" "tag_code"
#> [71] "tag_type" "sequential_number"
#> [73] "Number" "sequential_row_number"
#> [75] "catch_sample_id" "sample_type"
#> [77] "sampled_maturity" "sampled_run"
#> [79] "sampled_length_range" "sampled_sex"
#> [81] "sampled_mark" "estimated_number"
#> [83] "record_code" "format_version"
#> [85] "submission_date" "reporting_agency"
#> [87] "sampling_agency" "catch_sample_id"
#> [89] "species" "catch_year"
#> [91] "period_type" "period"
#> [93] "first_period" "last_period"
#> [95] "fishery" "adclip_selective_fishery"
#> [97] "estimation_level" NA
#> [99] "detection_method" "sample_type"
#> [101] "sampled_maturity" "sampled_run"
#> [103] "sampled_length_range" "sampled_sex"
#> [105] "sampled_mark" "number_caught"
#> [107] NA "number_sampled"
#> [109] "number_estimated" "number_recovered_decoded"
#> [111] "number_recovered_no_cwts" "CWTs"
#> [113] "Unreadable" "Unresolved"
#> [115] NA "PseudoTags"
#> [117] "mr_1st_partition_size" "mr_1st_sample_size"
#> [119] NA "mr_1st_sample_obs_adclips"
#> [121] "mr_2nd_partition_size" "mr_2nd_sample_size"
#> [123] "Status" "mr_2nd_sample_obs_adclips"
#> [125] "mark_rate" "awareness_factor"
#> [127] NA "Adclips"
#> [129] "record_code" "format_version"
#> [131] "submission_date" "reporting_agency"
#> [133] "catch_effort_id" "catch_year"
#> [135] "period_type" "period"
#> [137] "landing_status" "catch_location_code"
#> [139] "harvest" "fisher"
#> [141] "catch_gear_group" "catch_gear"
#> [143] "species" "grade"
#> [145] "number_tickets" "catch_weight"
#> [147] "number_caught" "effort_type"
#> [149] "effort_quantity" "Adclip_selective_fishery"
#> [151] "record_code" "format_version"
#> [153] "submission_date" "reporting_agency"
#> [155] "location_code" "name"
#> [157] "latitude" "longitude"
#> [159] "psc_basin" "psc_region"
#> [161] "description" "record_code"
#> [163] "format_version" "submission_date"
#> [165] "reporting_agency" "submission_status"
#> [167] "file_type" "file_status"
#> [169] "first_year" "last_year"
#> [171] "description" NA
#> [173] NA NA
#> [175] NA NA
#> [177] NA NA
#> [179] NA NA
#> [181] NA NA
#> [183] NA NA
#> [185] NA NA
#> [187] NA NA
#> [189] NA NA
#> [191] NA NA
#> [193] NA NA
#> [195] NA NA
#> [197] NA NA
#> [199] NA NA
#> [201] NA NA
#> [203] NA NA
#> [205] NA NA
#> [207] NA NA
#> [209] NA NA
#> [211] NA NA
#> [213] NA NA
#> [215] NA NA
#> [217] NA NA
#> [219] NA NA
#> [221] NA NA
#> [223] NA NA
#> [225] NA NA
Hmmm, some of these are not looking right. The NAs in the early numbers are not so good, and the ones with uppercase letters look wrong. Let’s get the line numbers and look at the original file:
linenos_to_investigate <- field_starts[which(is.na(field_names[,2]) | str_detect(field_names[,2], "[[:upper:]]"))] + 1
linenos_to_investigate
#> [1] 750 942 1009 1039 1043 1047 1051 1055 1069 1097 1122 1127 1267 1688
#> [15] 1689 1690 1736 1737 1738 1743 1744 1830 1831 1832 1833 1834 1835 2208
#> [29] 2209 2210 2211 2212 2216 2217 2218 2219 2224 2225 2226 2227 2231 2232
#> [43] 2233 2234 2238 2239 2240 2241 2242 2246 2247 2251 2252 2253 2254 2258
#> [57] 2259 2260 2261 2575 2576 2577 2758 2761 2763 2766 2769 2771
When we take a look at all those linenumbers in the original file we see in one case a typo in the specification, and in a lot of other places you find the problem is due to lines being broken where they probably weren’t supposed to be, etc. In other words, the sort of inconsistent formatting that is probably an unavoidable consequence of trying to scrape data out of a medium (i.e. an MS Word doc) that is unsuited to actually storing data.
This mirrors one of the themes of this course that we’ve seen a few times: If your stuff is in a Word file or an Excel file, you should not expect your research to be reproducible.
It would be preferable if those PSC data specifications lived in their own SQL table within the RMIS data base.
But anyway, from the looks of it, we can safely ignore those inconsistent spots, because those fields don’t actually have codes that we are going to be scraping out in the next step.
So, now we have to put those field_names
into a fields
vector and roll them forward over the NAs:
fields_0 <- rep(NA, length(spec))
fields_0[field_starts] <- field_names[,2]
fields <- na.locf(fields_0, na.rm = FALSE)
Now, that this is done, let’s see what we are working on so far. We have something looking like this:
head(na.omit(data.frame(chapter_numbers, chapter_names, fields, stringsAsFactors = FALSE)))
#> chapter_numbers chapter_names fields
#> 164 2 Release Data record_code
#> 165 2 Release Data record_code
#> 166 2 Release Data record_code
#> 167 2 Release Data record_code
#> 168 2 Release Data record_code
#> 169 2 Release Data format_version
Rippin! That looks tidy!
Step 4: Extract the codes and values
Now we just have to extract the lines that have the code definitions on them!
When we find occurrences of these things, it always seems that they follow a pattern of:
TAB + Quote + AlphaNumericOrDash + Quote + TAB + EqualsSign + ValueWithNoSpaceAfterEqualsSign
So, we can write a regex pattern for that. Some things must be noticed though:
- The single quotes are not simple—they can be upside down or right side up and they are not consistent.
- AlphaNumericOrDash is risky. Let’s start by matching one or more of anycharacter between those quotation marks.
- All we can really do is plan to grab the value all the way out to the end of the line.
- Sometimes they do have a space after the equals sign.
Here is what our basic regex to capture each of those things looks like in the str_detect
function:
gotem <- str_detect(spec, "\t[‘’].+[‘’]\t=.*$")
# note that the quotes might not look different in the font in the HTML.
# see the actual code in RStudio to see what they look like.
sum(gotem) # how many lines of the spec match?
#> [1] 317
head(spec[gotem]) # what do some of them look like?
#> [1] "\trecord_code\t\t\t’T’\t=Tagged Release record"
#> [2] "\t\t\t\t’N’\t=Non-Associated Release record"
#> [3] "\t\t\t\t’01’\t=ADFG (S.E. Alaska)"
#> [4] "\t\t\t\t’02’\t=NMFS – Alaska"
#> [5] "\t\t\t\t’03’\t=CDFO"
#> [6] "\t\t\t\t’04’\t=WDFW"
That is all fine and well, but what we really want to do is capture particular parts of those lines that match certain subpatterns. So, prepare yourself for a parenthesis bonanza.
Recall that parentheses group patterns and we can extract the parts that match those sub-patterns explicitly with str_match()
:
codes_and_values <- str_match(spec, "\t[‘’](.+)[‘’]\t=(.*)$")
The above grabs the “1 or more occurrences of any character” between the quotes and also all the characters after the equals sign. str_match
returns a matrix. The first column is the whole match, and each successive column is a successive grouping (a different set of parentheses.)
If there isn’t a match then you get NAs.
Step 5: Trim whitspace and bung these all into a data frame
We now just need to stick these all into a data frame, but we will want to trim leading and trailing whitespace off of all of the codes and values in case someone left some in there… So we can use str_trim()
:
codes_and_values <- str_trim(codes_and_values)
Then make that data frame
code_frame <- data.frame(chapter_numbers,
chapter_names,
fields,
codes = codes_and_values[,2],
values = codes_and_values[,3],
stringsAsFactors = FALSE)
Finally, remember that we only want to grab the codes for Chapters 2 through 6 inclusive. And, of course, we can drop any rows that have NAs since there were no codes or values on those lines. So
final_codes <- na.omit(subset(code_frame, chapter_numbers >= 2 & chapter_numbers <= 6))
Phew! How does it look? Let’s look at the first 100 lines of it:
# i have to change some #'s because the renderer confuses them with links
kable(str_replace_all(string = as.matrix(final_codes[1:100,]), pattern = "#", "--"), row.names = FALSE)
chapter_numbers | chapter_names | fields | codes | values |
---|---|---|---|---|
2 | Release Data | record_code | T | Tagged Release record |
2 | Release Data | record_code | N | Non-Associated Release record |
2 | Release Data | coordinator | 01 | ADFG (S.E. Alaska) |
2 | Release Data | coordinator | 02 | NMFS – Alaska |
2 | Release Data | coordinator | 03 | CDFO |
2 | Release Data | coordinator | 04 | WDFW |
2 | Release Data | coordinator | 05 | ODFW |
2 | Release Data | coordinator | 06 | NMFS – Columbia River |
2 | Release Data | coordinator | 07 | USFWS |
2 | Release Data | coordinator | 08 | CDFG |
2 | Release Data | coordinator | 09 | BCFW |
2 | Release Data | coordinator | 10 | IDFG |
2 | Release Data | coordinator | 11 | YAKA |
2 | Release Data | coordinator | 12 | ADFG (S. Central AK) |
2 | Release Data | coordinator | 13 | MIC (Metlakatla, AK) |
2 | Release Data | coordinator | 14 | NWIFC |
2 | Release Data | coordinator | 15 | CRITFC |
2 | Release Data | tag_type | 0 | Standard binary (1mm) |
2 | Release Data | tag_type | 1 | Half tags (H type) |
2 | Release Data | tag_type | 2 | Half tags (B type) |
2 | Release Data | tag_type | 3 | 6 word half length tags |
2 | Release Data | tag_type | 4 | X-ray binary ( tag_code_or_release_id must be ’XX0500’) |
2 | Release Data | tag_type | 5 | Standard color |
2 | Release Data | tag_type | 6 | Solid color (—-) |
2 | Release Data | tag_type | 7 | Striped color ($$) |
2 | Release Data | tag_type | 8 | Rare Earth |
2 | Release Data | tag_type | 9 | Repeating series |
2 | Release Data | tag_type | 10 | Sequential 6 word binary; |
2 | Release Data | tag_type | 11 | Length & ½ Binary (1.5mm) |
2 | Release Data | tag_type | 12 | Standard Alphanumeric, includes Decimal (1 mm) |
2 | Release Data | tag_type | 13 | Length & ½ Alphanumeric, includes Decimal (1.5 mm) |
2 | Release Data | tag_type | 14 | Sequential Alphanumeric, includes Decimal |
2 | Release Data | tag_type | 15 | Half length Alphanumeric, includes Decimal (0.5mm) |
2 | Release Data | tag_type | 16 | Pseudo tag, blank wire |
2 | Release Data | related_group_type | D | Double index tag groups |
2 | Release Data | related_group_type | O | Other related groups |
2 | Release Data | species | 1 | Chinook |
2 | Release Data | species | 2 | Coho |
2 | Release Data | species | 3 | Steelhead |
2 | Release Data | species | 4 | Sockeye |
2 | Release Data | species | 5 | Chum |
2 | Release Data | species | 6 | Pink |
2 | Release Data | species | 7 | Masu |
2 | Release Data | species | 8 | Cutthroat |
2 | Release Data | species | 9 | Atlantic Salmon |
2 | Release Data | run | 1 | Spring |
2 | Release Data | run | 2 | Summer |
2 | Release Data | run | 3 | Fall (includes type S Coho) |
2 | Release Data | run | 4 | Winter |
2 | Release Data | run | 5 | Hybrid |
2 | Release Data | run | 6 | Landlocked |
2 | Release Data | run | 7 | Late Fall (includes type N Coho) |
2 | Release Data | run | 8 | Late Fall Upriver Bright Chinook |
2 | Release Data | release_stage | Z | Zygote (eyed eggs) |
2 | Release Data | release_stage | E | Emergent fry |
2 | Release Data | release_stage | F | Fed fry |
2 | Release Data | release_stage | G | Fingerling |
2 | Release Data | release_stage | V | Advanced fingerling |
2 | Release Data | release_stage | Y | Yearling |
2 | Release Data | release_stage | P | Pre-smolt |
2 | Release Data | release_stage | S | Smolt |
2 | Release Data | release_stage | A | Adult |
2 | Release Data | release_stage | M | Multiple release stages |
2 | Release Data | rearing_type | H | Hatchery reared fish (includes any wild fish reared in the hatchery) |
2 | Release Data | rearing_type | W | Wild fish |
2 | Release Data | rearing_type | M | Mixed hatchery & wild (downstream migrant or marine tagging) |
2 | Release Data | rearing_type | U | Unknown (unavailable from release agency) |
2 | Release Data | study_type | E | Experimental |
2 | Release Data | study_type | P | Production |
2 | Release Data | study_type | B | Both experimental and production |
2 | Release Data | study_type | O | Other |
2 | Release Data | study_type | K | PSC key indicator stocks |
2 | Release Data | study_type | I | Other index streams |
2 | Release Data | release_strategy | FR | Forced release |
2 | Release Data | release_strategy | MX | Mixed release strategies |
2 | Release Data | release_strategy | VO | Volitional release |
2 | Release Data | study_integrity | N | Normal range expected |
2 | Release Data | study_integrity | D | Fish destroyed; zero survival assumed |
2 | Release Data | study_integrity | W | Warning flag for serious problems |
2 | Release Data | counting_method | B | Book estimates |
2 | Release Data | counting_method | C | Actual physical counts |
2 | Release Data | counting_method | P | Petersen estimates |
2 | Release Data | counting_method | W | Weight derived estimates |
2 | Release Data | counting_method | V | Volumetric Conversion |
2 | Release Data | counting_method | F | Feed Conversion Estimates |
3 | Recovery Data | record_code | R | Recovery record |
3 | Recovery Data | species | 1 | Chinook |
3 | Recovery Data | species | 2 | Coho |
3 | Recovery Data | species | 3 | Steelhead |
3 | Recovery Data | species | 4 | Sockeye |
3 | Recovery Data | species | 5 | Chum |
3 | Recovery Data | species | 6 | Pink |
3 | Recovery Data | species | 7 | Masu |
3 | Recovery Data | species | 8 | Cutthroat |
3 | Recovery Data | species | 9 | Atlantic Salmon |
3 | Recovery Data | recovery_date_type | R | Reported date |
3 | Recovery Data | recovery_date_type | C | Calculated date |
3 | Recovery Data | period_type | 1 | Escapement period (across years possible) |
3 | Recovery Data | period_type | 2 | Bi-weekly (statistical 2 week) |
3 | Recovery Data | period_type | 3 | Semi-monthly (calendar) |
Booyah! It looks great. (There are some issues though with location codes which are hierarchical and thus require special treatment.)
Of course, it would be better if we didn’t have to go through all this rigmarole to get there. But what can you do?
Note that the above format is tidy, and if we had any of the data bases loaded as data frames then it would be pretty straightforward to replace the codes with their values and keep them as factors.
comments powered by Disqus