Reproducible Research Course by Eric C. Anderson for (NOAA/SWFSC)


String manipulation with R

Intro

R comes with some fairly well-developed tools for string manipulation and the application of regular expressions. If you are a fan of Perl or Python, or you have a penchant for Unix utilities like sed and awk, then you might find that you prefer those languages/workflows for doing your text processing. However, you can do quite a lot with R, and if all of your other analyses are done in R, there is some advantage to also doing your text processing within R.

If you are not yet familiar with any other text-processing systems and R is your first real programming language, then you might as well learn string manipulation in R!

If you are used to a “line-by-line” loop-based processing paradigm like that used in awk, then R’s focus on vectors of strings can take some getting used to. However, once you get comfortable with it and the sorts of operations that are useful within that framework, it can be quite powerful.

The base R package comes with a full set of string manipulation functions that are quite versatile. However, they don’t have consistent interfaces and they come with a lot of options that most people will never need to learn. So, while we are learning this stuff, we will focus on using the functions in Hadley Wickham’s package stringr which preserves about 95% of the functionality of the base R string functions, but don’t carry around a lot of extra function-parameter shrubbery that will trip up a lot of users. For a quick intro to stringr check out this

To illustrate these ideas we will use a real example from my own work: processing a text file containing data specifications for the coded wire tag data base.

Prerequisites

  • To work through the examples you will the stringr package.
  • Please download/install this before coming to class:
    1. install necessary packages:

      install.packages(c("stringr", "zoo"))
    2. Pull the most recent version of the rep-res-course repo just before coming to class.

Goals for this hour:

  • Present example text-processing problem
  • Briefly cover the idea of regular expressions
  • Show how to use functions in stringr to solve the problem

Example problem

RMIS

  • The Regional Mark Processing Center (RMPC) maintains data bases having to do with catch, sampling, release and recovery of salmon and coded wire tags. These data bases are known collectively as the Regional Mark Information System (RMIS).
  • You can query the data bases and get information from http://www.rmpc.org
  • Many of the fields in the data base use codes to denote certain values. For example, in the species column in each data base, a 1 = Chinook salmon and a 2 = Coho, a 3 = Steelhead, etc.
  • If you are going to be analyzing these data in R, or plotting them, or just looking at them, it might be nice to recode such columns as factors with levels that are the actual species names. (This beats trying to remember all sorts of codes).

The PSC data specification

  • First, the good news:
  • Now the bad news:
    • The RMPC does not maintain a listing of these codes in an easily parseable (i.e. reproducible!) format.
    • This means that you have to extract all the codes from a PDF file, which is a rather ugly proposition.
  • The folks who run the RMPC are pretty cool, though, and were able to supply me with a Microsoft Word document from whence the PDF file was created.
  • MS Word, as we’ve all learned in this course is not a reproducible or easily-parseable format, but we are going to do our best with it.

PSC Spec in text

  • I was able to copy and paste the specification, make sure it had Unix, instead of Mac line endings, and save it.
  • A copy is in the course repo at: data/PSC-Specification-ver-4.1_7-1-14.txt
  • Go ahead and have a look at it in a suitably good text editor. If you are using TextWrangler on your Mac then you can have it View -> Text Display -> Show Invisibles to see what are tabs and what are spaces, etc.
  • Here is a little section of the 2,855 lines in the file:
CHAPTER 2
Release Data

PSC   PSC Common Name   Max Reqd    Format /Use Description & Validation Rules.......................................................................................................................................
Fld #   and Data Field Name Cols            
1   Record Code 1   Yes Lookup  Code to indicate the CWT data file classification (class) of this individual record.  Must match one of the following:
    record_code         ’T’ =Tagged Release record
                ’N’ =Non-Associated Release record
                    See chapter 16 for further discussion of the use of this field.
                    
2   Format Version  4   Yes ’4.1’   Format version used to report data
    format_version              Must have the value:   4.13   Submission Date 8   Yes YYYYMMDD    Date of submission for this set of records.  Date should be close to actual date when this row is sent to the Mark Center
    submission_date             Must have the same value for all records in this data submission
                    Should match submission_date in corresponding Description file
                    
4   Reporting Agency    10  Yes Lookup  Abbreviation for reporting agency of this dataset for this data exchange
    reporting_agency                Must contain an agency code defined in chapter 8
                    Must be the same for all records
                    
5   Release Agency  10  Yes Lookup  Abbreviations for tagging agencies
    release_agency              Must contain an agency code defined in chapter 8
                    
6   Coordinator 2   Yes Lookup  Reporting coordinator for the release group of this individual record
    coordinator             Must match one of the following:
                01’    =ADFG (S.E. Alaska)
                ’02’    =NMFS – Alaska
                ’03’    =CDFO
                ’04’    =WDFW
                ’05’    =ODFW
                ’06’    =NMFS – Columbia River
                ’07’    =USFWS
                ’08’    =CDFG
                ’09’    =BCFW
                ’10’    =IDFG
                ’11’    =YAKA
                ’12’    =ADFG (S. Central AK)
                ’13’    =MIC (Metlakatla, AK)
                ’14’    =NWIFC
                ’15’    =CRITFC
                ‘16’              =NEZP
                ‘17’              =QDNR
                ‘18’              =STIL             
  • What a freakshow! I can’t believe that we have to do this! (But on the other hand it is kind of fun…)

A closer look at the format

  • The specification is organized into a series of chapters that start with lines that look like:

    CHAPTER 2

    Chapters 2 through 6 have information about the five data bases I’ll be working with:

    Chapter 2 – Release Data
    Chapter 3 – Recovery Data
    Chapter 4 – Catch/Sample Data
    Chapter 5 – Catch & Effort Data
    Chapter 6 – Location Data
  • Within every chapter there are multiple entries that give the names of the fields (columns). For example:

    4   Reporting Agency    10  Yes Lookup  Abbreviation for reporting agency of this dataset for this data exchange
        reporting_agency                Must contain an agency code defined in chapter 8
                    Must be the same for all records
    
    5   Release Agency  10  Yes Lookup  Abbreviations for tagging agencies
        release_agency              Must contain an agency code defined in chapter 8
    Shows that field 4 is reporting_agency, etc.
  • Some of the fields have lists of codes that look like this:

    6  Coordinator  2   Yes Lookup  Reporting coordinator for the release group of this individual record
              coordinator               Must match one of the following:
                    01’    =ADFG (S.E. Alaska)
                    ’02’    =NMFS – Alaska
                    ’03’    =CDFO
                    ’04’    =WDFW
                    ’05’    =ODFW
                    ’06’    =NMFS – Columbia River
                    ’07’    =USFWS
                    ’08’    =CDFG
                    ’09’    =BCFW
                    ’10’    =IDFG
                    ’11’    =YAKA
                    ’12’    =ADFG (S. Central AK)
                    ’13’    =MIC (Metlakatla, AK)
                    ’14’    =NWIFC
                    ’15’    =CRITFC
                    ‘16’              =NEZP
                    ‘17’              =QDNR
                    ‘18’              =STIL             

    These codes take the form of string (usually a single character or a number) that is quoted in some special quotes that are sometimes upside down and sometimes right side up; then a TAB, then an equals sign followed by another string with no space.

Our mission

  • For every field (in each of the five data bases) that has specialized codes, we want to know what the codes are and what they means.
  • In the end we will want to have a tidy data frame of these that looks something like this:
chapter field code value
Release Data coordinator “01” ADFG (S.E. Alaska)
Release Data coordinator “02” NMFS – Alaska
Release Data coordinator “03” CDFO
Release Data coordinator “04” WDFW
  • If we wanted to be old school about it we could just spend two weeks copying and pasting everything, and hope that the spec doesn’t change any time soon, because we wouldn’t want to waste a full two weeks doing that again (even if our wrists were not totally trashed after that.)
  • But we are going to try to automate this.

Regular expressions

Before we can get rolling on this we are going to have to cover some regular expressions.

What is a regular expression?

  • You can think of a regular expression as a “fuzzy find” pattern that let’s you match text strings while including “wildcards” and “character classes”, and maybe even capturing text that matches in certain ways to your pattern.
  • Here is a simple example.
library(stringr)  # load the package

# now make a character vector
vec <- c("fred", "foo", "food", "bed", "reefed")

# see if any matches "fred"
str_detect(vec, "fred")
#> [1]  TRUE FALSE FALSE FALSE FALSE

# see if any have an "f" followed by a "d"
# after 0 or more other characters
str_detect(vec, "f.*d")
#> [1]  TRUE FALSE  TRUE FALSE  TRUE

# see if any start with an "f"
str_detect(vec, "^f")
#> [1]  TRUE  TRUE  TRUE FALSE FALSE

# see if any end with an "o"
str_detect(vec, "o$")
#> [1] FALSE  TRUE FALSE FALSE FALSE

Holy cow! That looks funky

  • Yes, there is a whole pantheon of regular expression “metacharacters” that you will want to get familiar with.
  • Entire books have been written on regular expressions, so I won’t attempt to teach you all of it now, but I will point out some things that you will want to know.
  1. The simplest pattern is a single character. So, a regular expression of “a”" says, match anything with an “a”" in it.
  2. You can do things like specify how many times a pattern must occur to be considered a match. For example maybe you had strings of DNA and you wanted to detect runs of 7 “A”’s in a row in it. The regular expression that matched that would be "A{7}".
  3. A pattern can be a wildcard character which can take many different forms like:
    • . = anything
    • [a-z] a character class consisting of all lowercase letters
    • [[:lower:]] another way of specifying the same thing as [a-z] that responds to locality.
  4. You can group patterns together by wrapping them in ( and ). Then you can treat those things as patterns that must occur a certain number of times to be considered a match, etc.

We will see examples of these things as we solve our problem at hand. For now, just know that you can build up regular expressions to search for very complex patterns in strings using this regex syntax that can seem quite baffling or even daunting at first.

For now, here are some important things to know

# special meaning is given to these characters:
. \ | ( ) [ ] { } ^ $ * + ?

.  # the dot matches any character (or number or punctuation, etc)

\  # backslash is often used to "escape" metacharacters (and in R that could mean two backslashes)

( )  # these guys are used to group patterns and also
     # to capture the parts of a string that match grouped subpatterns

[ ]  # you can specify character classes (like "any lower case letter" inside these)

?    # The preceding item is optional and will be matched at most once.

*    # The preceding item will be matched zero or more times.
  
+    # The preceding item will be matched one or more times.
  
{n}  # The preceding item is matched exactly n times.

{n,} # The preceding item is matched n or more times.

{n,m} # The preceding item is matched at least n times, but not more than m times.

For a more complete description do ?regex and see the “Extended Regular Expressions” section.

Getting down to business

Quick note on stringr functions. Most of them have the syntax of:

str_something( string,  pattern )

where something is like “detect” or “match” or “replace”, and pattern is a regular expression passed in as a quoted string (or even a character vector in some cases)

Step 1: read the strings in!

The delightful readLines() function will let us read each line of PSC-Specification-ver-4.1_7-1-14.txt into a character vector:

spec <- readLines("data/PSC-Specification-ver-4.1_7-1-14.txt")
length(spec)
#> [1] 2855
head(spec)
#> [1] ""                                                                                                 
#> [2] "CHAPTER 1"                                                                                        
#> [3] "INTRODUCTION, DEFINITIONS, AND RULES"                                                             
#> [4] ""                                                                                                 
#> [5] "CWT data must be exchanged in the form of a PSC Format Version 4.1 dataset."                      
#> [6] "The definition and specification of PSC Format Version 4.1 is described in this set of documents."

Step 2: figure out which chapter we are in

Here is what our strategy is going to be for dealing with this

  1. Start with a vector of NA’s of length(spec)
  2. Put values in where each “CHAPTER” is found
  3. Propagate those values forward, overwriting any NAs with the value that came before.

In this way we get a vector, of, for example, what chapter we are in.

Let’s do it for the chapter names:

chapter <- rep(NA, length(spec))  # make a vector of NAs
chap_positions <- which(str_detect(spec, "^CHAPTER"))

# see what that looks like:
chap_positions
#>  [1]    2  159  492  835 1132 1280 1406 1466 1666 2009 2094 2189 2262 2589
#> [15] 2698 2775

# now, the titles of the chapters are on the line following "^CHAPTER"
# so let's get those and put them in and then roll them forward
chapter[chap_positions] <- spec[chap_positions + 1]
chapter_names <- na.locf(chapter, na.rm = FALSE) # from the zoo package.  It rolls the last value forward over NAs

# if you are following along, do this to see what we have done:
#   cbind(chapter_names)

Step 2.5: Let’s get the chapter numbers while we are at it

Now, we want to extract the number of each chapter. We can use a regular expression that finds strings in spec that

  1. first have to match “^CHAPTER”
  2. but then also have at least one space
  3. and then one or more digits

AND we can use the str_match function to pull out the part that matches the 3rd part of the expression (i.e. the one or more digits). Check out ?str_match

Here we go!

# grab them
chap_nums <- str_match(spec, "^CHAPTER +([[:digit:]]+)")

# notice that we have wrapped the part that matches one or more digits in parentheses!

# note that str_match returns a matrix:
head(chap_nums)
#>      [,1]        [,2]
#> [1,] NA          NA  
#> [2,] "CHAPTER 1" "1" 
#> [3,] NA          NA  
#> [4,] NA          NA  
#> [5,] NA          NA  
#> [6,] NA          NA

# roll them forward
chapter_numbers <- na.locf(chap_nums[,2], na.rm = FALSE)

# see what we have done so far looks like
head(cbind(chapter_numbers, chapter_names))
#>      chapter_numbers chapter_names                         
#> [1,] NA              NA                                    
#> [2,] "1"             "INTRODUCTION, DEFINITIONS, AND RULES"
#> [3,] "1"             "INTRODUCTION, DEFINITIONS, AND RULES"
#> [4,] "1"             "INTRODUCTION, DEFINITIONS, AND RULES"
#> [5,] "1"             "INTRODUCTION, DEFINITIONS, AND RULES"
#> [6,] "1"             "INTRODUCTION, DEFINITIONS, AND RULES"

Step 3: Figure out which field each line of the file is in

Our main concern is going to be extracting codes that have to do with fields that we can identify according to which chapter they are in (which corresponds to the data base) and then which field name we are in.

Happily, it is easy to find the beginnings of field heading sections. They are the only lines in the file I see that start with one or two digits, followed immediately by a TAB. (At least that is the case for the five chapters we are interested in.)

Recall that these things look like:

2  Format Version   4   Yes ’4.1’   Format version used to report data
    format_version              Must have the value:   4.13   Submission Date 8   Yes YYYYMMDD    Date of submission for this set of records.  Date should be close to actual date when this row is sent to the Mark Center
    submission_date             Must have the same value for all records in this data submission
                    Should match submission_date in corresponding Description file

So, we are going to get those positions, and then try to extract the field names, which appear on the following lines, stuck between TABS:

field_starts <- which(str_detect(spec, "^[[:digit:]]{1,2}\t"))
field_names <- str_match(spec[field_starts + 1], "^\t([_[:alnum:]]+)\t")
field_names[,2]
#>   [1] "record_code"               "format_version"           
#>   [3] "submission_date"           "reporting_agency"         
#>   [5] "release_agency"            "coordinator"              
#>   [7] "tag_code_or_release_id"    "tag_type"                 
#>   [9] "first_sequential_number"   "last_sequential_number"   
#>  [11] "related_group_type"        "related_group_id"         
#>  [13] "species"                   "run"                      
#>  [15] "brood_year"                "first_release_date"       
#>  [17] "last_release_date"         "release_location_code"    
#>  [19] "hatchery_location_code"    "stock_location_code"      
#>  [21] "release_stage"             "rearing_type"             
#>  [23] "study_type"                "release_strategy"         
#>  [25] "avg_weight"                "avg_length"               
#>  [27] "study_integrity"           "cwt_1st_mark"             
#>  [29] "cwt_1st_mark_count"        "cwt_2nd_mark"             
#>  [31] "cwt_2nd_mark_count"        "non_cwt_1st_mark"         
#>  [33] "non_cwt_1st_mark_count"    "non_cwt_2nd_mark"         
#>  [35] "non_cwt_2nd_mark_count"    "counting_method"          
#>  [37] "tag_loss_rate"             "tag_loss_days"            
#>  [39] "tag_loss_sample_size"      "tag_reused"               
#>  [41] "comments"                  "record_code"              
#>  [43] "format_version"            "submission_date"          
#>  [45] "reporting_agency"          "sampling_agency"          
#>  [47] "recovery_id"               "species"                  
#>  [49] "run_year"                  "recovery_date"            
#>  [51] "recovery_date_type"        "period_type"              
#>  [53] "period"                    "fishery"                  
#>  [55] "gear"                      "adclip_selective_fishery" 
#>  [57] "estimation_level"          "recovery_location_code"   
#>  [59] "sampling_site"             "recorded_mark"            
#>  [61] "sex"                       "weight"                   
#>  [63] "weight_code"               "weight_type"              
#>  [65] "length"                    "length_code"              
#>  [67] "length_type"               "detection_method"         
#>  [69] "tag_status"                "tag_code"                 
#>  [71] "tag_type"                  "sequential_number"        
#>  [73] "Number"                    "sequential_row_number"    
#>  [75] "catch_sample_id"           "sample_type"              
#>  [77] "sampled_maturity"          "sampled_run"              
#>  [79] "sampled_length_range"      "sampled_sex"              
#>  [81] "sampled_mark"              "estimated_number"         
#>  [83] "record_code"               "format_version"           
#>  [85] "submission_date"           "reporting_agency"         
#>  [87] "sampling_agency"           "catch_sample_id"          
#>  [89] "species"                   "catch_year"               
#>  [91] "period_type"               "period"                   
#>  [93] "first_period"              "last_period"              
#>  [95] "fishery"                   "adclip_selective_fishery" 
#>  [97] "estimation_level"          NA                         
#>  [99] "detection_method"          "sample_type"              
#> [101] "sampled_maturity"          "sampled_run"              
#> [103] "sampled_length_range"      "sampled_sex"              
#> [105] "sampled_mark"              "number_caught"            
#> [107] NA                          "number_sampled"           
#> [109] "number_estimated"          "number_recovered_decoded" 
#> [111] "number_recovered_no_cwts"  "CWTs"                     
#> [113] "Unreadable"                "Unresolved"               
#> [115] NA                          "PseudoTags"               
#> [117] "mr_1st_partition_size"     "mr_1st_sample_size"       
#> [119] NA                          "mr_1st_sample_obs_adclips"
#> [121] "mr_2nd_partition_size"     "mr_2nd_sample_size"       
#> [123] "Status"                    "mr_2nd_sample_obs_adclips"
#> [125] "mark_rate"                 "awareness_factor"         
#> [127] NA                          "Adclips"                  
#> [129] "record_code"               "format_version"           
#> [131] "submission_date"           "reporting_agency"         
#> [133] "catch_effort_id"           "catch_year"               
#> [135] "period_type"               "period"                   
#> [137] "landing_status"            "catch_location_code"      
#> [139] "harvest"                   "fisher"                   
#> [141] "catch_gear_group"          "catch_gear"               
#> [143] "species"                   "grade"                    
#> [145] "number_tickets"            "catch_weight"             
#> [147] "number_caught"             "effort_type"              
#> [149] "effort_quantity"           "Adclip_selective_fishery" 
#> [151] "record_code"               "format_version"           
#> [153] "submission_date"           "reporting_agency"         
#> [155] "location_code"             "name"                     
#> [157] "latitude"                  "longitude"                
#> [159] "psc_basin"                 "psc_region"               
#> [161] "description"               "record_code"              
#> [163] "format_version"            "submission_date"          
#> [165] "reporting_agency"          "submission_status"        
#> [167] "file_type"                 "file_status"              
#> [169] "first_year"                "last_year"                
#> [171] "description"               NA                         
#> [173] NA                          NA                         
#> [175] NA                          NA                         
#> [177] NA                          NA                         
#> [179] NA                          NA                         
#> [181] NA                          NA                         
#> [183] NA                          NA                         
#> [185] NA                          NA                         
#> [187] NA                          NA                         
#> [189] NA                          NA                         
#> [191] NA                          NA                         
#> [193] NA                          NA                         
#> [195] NA                          NA                         
#> [197] NA                          NA                         
#> [199] NA                          NA                         
#> [201] NA                          NA                         
#> [203] NA                          NA                         
#> [205] NA                          NA                         
#> [207] NA                          NA                         
#> [209] NA                          NA                         
#> [211] NA                          NA                         
#> [213] NA                          NA                         
#> [215] NA                          NA                         
#> [217] NA                          NA                         
#> [219] NA                          NA                         
#> [221] NA                          NA                         
#> [223] NA                          NA                         
#> [225] NA                          NA

Hmmm, some of these are not looking right. The NAs in the early numbers are not so good, and the ones with uppercase letters look wrong. Let’s get the line numbers and look at the original file:

linenos_to_investigate <- field_starts[which(is.na(field_names[,2]) | str_detect(field_names[,2], "[[:upper:]]"))] + 1 
linenos_to_investigate
#>  [1]  750  942 1009 1039 1043 1047 1051 1055 1069 1097 1122 1127 1267 1688
#> [15] 1689 1690 1736 1737 1738 1743 1744 1830 1831 1832 1833 1834 1835 2208
#> [29] 2209 2210 2211 2212 2216 2217 2218 2219 2224 2225 2226 2227 2231 2232
#> [43] 2233 2234 2238 2239 2240 2241 2242 2246 2247 2251 2252 2253 2254 2258
#> [57] 2259 2260 2261 2575 2576 2577 2758 2761 2763 2766 2769 2771

When we take a look at all those linenumbers in the original file we see in one case a typo in the specification, and in a lot of other places you find the problem is due to lines being broken where they probably weren’t supposed to be, etc. In other words, the sort of inconsistent formatting that is probably an unavoidable consequence of trying to scrape data out of a medium (i.e. an MS Word doc) that is unsuited to actually storing data.

This mirrors one of the themes of this course that we’ve seen a few times: If your stuff is in a Word file or an Excel file, you should not expect your research to be reproducible.
It would be preferable if those PSC data specifications lived in their own SQL table within the RMIS data base.

But anyway, from the looks of it, we can safely ignore those inconsistent spots, because those fields don’t actually have codes that we are going to be scraping out in the next step.

So, now we have to put those field_names into a fields vector and roll them forward over the NAs:

fields_0 <- rep(NA, length(spec))
fields_0[field_starts] <- field_names[,2]
fields <- na.locf(fields_0, na.rm = FALSE)

Now, that this is done, let’s see what we are working on so far. We have something looking like this:

head(na.omit(data.frame(chapter_numbers, chapter_names, fields, stringsAsFactors = FALSE)))
#>     chapter_numbers chapter_names         fields
#> 164               2  Release Data    record_code
#> 165               2  Release Data    record_code
#> 166               2  Release Data    record_code
#> 167               2  Release Data    record_code
#> 168               2  Release Data    record_code
#> 169               2  Release Data format_version

Rippin! That looks tidy!

Step 4: Extract the codes and values

Now we just have to extract the lines that have the code definitions on them!

When we find occurrences of these things, it always seems that they follow a pattern of:

TAB + Quote + AlphaNumericOrDash + Quote + TAB + EqualsSign + ValueWithNoSpaceAfterEqualsSign

So, we can write a regex pattern for that. Some things must be noticed though:

  1. The single quotes are not simple—they can be upside down or right side up and they are not consistent.
  2. AlphaNumericOrDash is risky. Let’s start by matching one or more of anycharacter between those quotation marks.
  3. All we can really do is plan to grab the value all the way out to the end of the line.
  4. Sometimes they do have a space after the equals sign.

Here is what our basic regex to capture each of those things looks like in the str_detect function:

gotem <- str_detect(spec, "\t[‘’].+[‘’]\t=.*$")

# note that the quotes might not look different in the font in the HTML.
# see the actual code in RStudio to see what they look like.

sum(gotem)  # how many lines of the spec match?
#> [1] 317
head(spec[gotem])  # what do some of them look like?
#> [1] "\trecord_code\t\t\t’T’\t=Tagged Release record"
#> [2] "\t\t\t\t’N’\t=Non-Associated Release record"   
#> [3] "\t\t\t\t’01’\t=ADFG (S.E. Alaska)"             
#> [4] "\t\t\t\t’02’\t=NMFS – Alaska"                  
#> [5] "\t\t\t\t’03’\t=CDFO"                           
#> [6] "\t\t\t\t’04’\t=WDFW"

That is all fine and well, but what we really want to do is capture particular parts of those lines that match certain subpatterns. So, prepare yourself for a parenthesis bonanza.

Recall that parentheses group patterns and we can extract the parts that match those sub-patterns explicitly with str_match():

codes_and_values <- str_match(spec, "\t[‘’](.+)[‘’]\t=(.*)$")

The above grabs the “1 or more occurrences of any character” between the quotes and also all the characters after the equals sign. str_match returns a matrix. The first column is the whole match, and each successive column is a successive grouping (a different set of parentheses.)

If there isn’t a match then you get NAs.

Step 5: Trim whitspace and bung these all into a data frame

We now just need to stick these all into a data frame, but we will want to trim leading and trailing whitespace off of all of the codes and values in case someone left some in there… So we can use str_trim():

codes_and_values <- str_trim(codes_and_values)

Then make that data frame

code_frame <- data.frame(chapter_numbers, 
                         chapter_names,
                         fields,
                         codes = codes_and_values[,2],
                         values = codes_and_values[,3],
                         stringsAsFactors = FALSE)

Finally, remember that we only want to grab the codes for Chapters 2 through 6 inclusive. And, of course, we can drop any rows that have NAs since there were no codes or values on those lines. So

final_codes <- na.omit(subset(code_frame, chapter_numbers >= 2 & chapter_numbers <= 6))

Phew! How does it look? Let’s look at the first 100 lines of it:

# i have to change some #'s because the renderer confuses them with links
kable(str_replace_all(string = as.matrix(final_codes[1:100,]), pattern = "#", "--"), row.names = FALSE)
chapter_numbers chapter_names fields codes values
2 Release Data record_code T Tagged Release record
2 Release Data record_code N Non-Associated Release record
2 Release Data coordinator 01 ADFG (S.E. Alaska)
2 Release Data coordinator 02 NMFS – Alaska
2 Release Data coordinator 03 CDFO
2 Release Data coordinator 04 WDFW
2 Release Data coordinator 05 ODFW
2 Release Data coordinator 06 NMFS – Columbia River
2 Release Data coordinator 07 USFWS
2 Release Data coordinator 08 CDFG
2 Release Data coordinator 09 BCFW
2 Release Data coordinator 10 IDFG
2 Release Data coordinator 11 YAKA
2 Release Data coordinator 12 ADFG (S. Central AK)
2 Release Data coordinator 13 MIC (Metlakatla, AK)
2 Release Data coordinator 14 NWIFC
2 Release Data coordinator 15 CRITFC
2 Release Data tag_type 0 Standard binary (1mm)
2 Release Data tag_type 1 Half tags (H type)
2 Release Data tag_type 2 Half tags (B type)
2 Release Data tag_type 3 6 word half length tags
2 Release Data tag_type 4 X-ray binary ( tag_code_or_release_id must be ’XX0500’)
2 Release Data tag_type 5 Standard color
2 Release Data tag_type 6 Solid color (—-)
2 Release Data tag_type 7 Striped color ($$)
2 Release Data tag_type 8 Rare Earth
2 Release Data tag_type 9 Repeating series
2 Release Data tag_type 10 Sequential 6 word binary;
2 Release Data tag_type 11 Length & ½ Binary (1.5mm)
2 Release Data tag_type 12 Standard Alphanumeric, includes Decimal (1 mm)
2 Release Data tag_type 13 Length & ½ Alphanumeric, includes Decimal (1.5 mm)
2 Release Data tag_type 14 Sequential Alphanumeric, includes Decimal
2 Release Data tag_type 15 Half length Alphanumeric, includes Decimal (0.5mm)
2 Release Data tag_type 16 Pseudo tag, blank wire
2 Release Data related_group_type D Double index tag groups
2 Release Data related_group_type O Other related groups
2 Release Data species 1 Chinook
2 Release Data species 2 Coho
2 Release Data species 3 Steelhead
2 Release Data species 4 Sockeye
2 Release Data species 5 Chum
2 Release Data species 6 Pink
2 Release Data species 7 Masu
2 Release Data species 8 Cutthroat
2 Release Data species 9 Atlantic Salmon
2 Release Data run 1 Spring
2 Release Data run 2 Summer
2 Release Data run 3 Fall (includes type S Coho)
2 Release Data run 4 Winter
2 Release Data run 5 Hybrid
2 Release Data run 6 Landlocked
2 Release Data run 7 Late Fall (includes type N Coho)
2 Release Data run 8 Late Fall Upriver Bright Chinook
2 Release Data release_stage Z Zygote (eyed eggs)
2 Release Data release_stage E Emergent fry
2 Release Data release_stage F Fed fry
2 Release Data release_stage G Fingerling
2 Release Data release_stage V Advanced fingerling
2 Release Data release_stage Y Yearling
2 Release Data release_stage P Pre-smolt
2 Release Data release_stage S Smolt
2 Release Data release_stage A Adult
2 Release Data release_stage M Multiple release stages
2 Release Data rearing_type H Hatchery reared fish (includes any wild fish reared in the hatchery)
2 Release Data rearing_type W Wild fish
2 Release Data rearing_type M Mixed hatchery & wild (downstream migrant or marine tagging)
2 Release Data rearing_type U Unknown (unavailable from release agency)
2 Release Data study_type E Experimental
2 Release Data study_type P Production
2 Release Data study_type B Both experimental and production
2 Release Data study_type O Other
2 Release Data study_type K PSC key indicator stocks
2 Release Data study_type I Other index streams
2 Release Data release_strategy FR Forced release
2 Release Data release_strategy MX Mixed release strategies
2 Release Data release_strategy VO Volitional release
2 Release Data study_integrity N Normal range expected
2 Release Data study_integrity D Fish destroyed; zero survival assumed
2 Release Data study_integrity W Warning flag for serious problems
2 Release Data counting_method B Book estimates
2 Release Data counting_method C Actual physical counts
2 Release Data counting_method P Petersen estimates
2 Release Data counting_method W Weight derived estimates
2 Release Data counting_method V Volumetric Conversion
2 Release Data counting_method F Feed Conversion Estimates
3 Recovery Data record_code R Recovery record
3 Recovery Data species 1 Chinook
3 Recovery Data species 2 Coho
3 Recovery Data species 3 Steelhead
3 Recovery Data species 4 Sockeye
3 Recovery Data species 5 Chum
3 Recovery Data species 6 Pink
3 Recovery Data species 7 Masu
3 Recovery Data species 8 Cutthroat
3 Recovery Data species 9 Atlantic Salmon
3 Recovery Data recovery_date_type R Reported date
3 Recovery Data recovery_date_type C Calculated date
3 Recovery Data period_type 1 Escapement period (across years possible)
3 Recovery Data period_type 2 Bi-weekly (statistical 2 week)
3 Recovery Data period_type 3 Semi-monthly (calendar)

Booyah! It looks great. (There are some issues though with location codes which are hierarchical and thus require special treatment.)

Of course, it would be better if we didn’t have to go through all this rigmarole to get there. But what can you do?

Note that the above format is tidy, and if we had any of the data bases loaded as data frames then it would be pretty straightforward to replace the codes with their values and keep them as factors.


comments powered by Disqus