construct(...)
?`%:)%`
, and indented.Returns a character vector of the class regexr
. The attributes
of the returned object retain the original name and comment properties.
This function is used to construct human readable regular expressions from chunks. The user may provide additional meta information about each chunk. This meta information is an optional name and comment for the chunk. This allows one to write regular expressions in a fashion similar to writing code, that is the regular expression is written top to bottom, the syntax is broken up into manageable chunks, the expression can be indented to give structural insight such as nested groups, and the chunks can be commented to provide linguistic grounding for more complex chunks.
## Minimal Example minimal <- construct("a", "b", "c") minimal[1] "abc"unglue(minimal)[[1]] [1] "a" [[2]] [1] "b" [[3]] [1] "c"comments(minimal)[[1]] NULL [[2]] NULL [[3]] NULLregex(minimal)[[1]] [1] "a" [[2]] [1] "b" [[3]] [1] "c"test(minimal)$regex [1] TRUE $chunks [1] TRUE TRUE TRUEsummary(minimal)abc === REGEX 1: a NAME : COMMENT: REGEX 2: b NAME : COMMENT: REGEX 3: c NAME : COMMENT:## Example 1 m <- construct( space = "\\s+" %:)%"I see", simp = "(?<=(foo))", or = "(;|:)\\s*" %:)%"comment on what this does", "[a]s th[atey]" ) m[1] "\\s+(?<=(foo))(;|:)\\s*[a]s th[atey]"unglue(m)$space [1] "\\s+" $simp [1] "(?<=(foo))" $or [1] "(;|:)\\s*" [[4]] [1] "[a]s th[atey]"summary(m)\s+(?<=(foo))(;|:)\s*[a]s th[atey] ================================== REGEX 1: \s+ NAME : space COMMENT: "I see" REGEX 2: (?<=(foo)) NAME : simp COMMENT: REGEX 3: (;|:)\s* NAME : or COMMENT: "comment on what this does" REGEX 4: [a]s th[atey] NAME : COMMENT:regex(m)$space [1] "\\s+" $simp [1] "(?<=(foo))" $or [1] "(;|:)\\s*" [[4]] [1] "[a]s th[atey]"comments(m)$space [1] "I see" $simp NULL $or [1] "comment on what this does" [[4]] NULLregex(m)[4] <- "(F{O}2)|(BAR)" summary(m)\s+(?<=(foo))(;|:)\s*(F{O}2)|(BAR) ================================== REGEX 1: \s+ NAME : space COMMENT: "I see" REGEX 2: (?<=(foo)) NAME : simp COMMENT: REGEX 3: (;|:)\s* NAME : or COMMENT: "comment on what this does" REGEX 4: (F{O}2)|(BAR) NAME : COMMENT:test(m)$regex [1] TRUE $chunks space simp or TRUE TRUE TRUE TRUEregex(m)[5:7] <- c("(", "([A-Z]|(\\d{5})", ")") test(m)Warning message: The concatenated regex is not valid \s+(?<=(foo))(;|:)\s*(F{O}2)|(BAR)(([A-Z]|(\d{5})) Warning message: The following regex chunks are not valid in isolation: (1) ( (2) ([A-Z]|(\d{5}) (3) )$regex [1] FALSE $chunks space simp or TRUE TRUE TRUE TRUE FALSE FALSE FALSElibrary(qdapRegex) explain(m)NODE EXPLANATION -------------------------------------------------------------------------------- \\s whitespace (\n, \r, \t, \f, and " ") -------------------------------------------------------------------------------- ' ' -------------------------------------------------------------------------------- (?<= look behind to see if there is: -------------------------------------------------------------------------------- ( group and capture to \\1: -------------------------------------------------------------------------------- foo 'foo' -------------------------------------------------------------------------------- ) end of \\1 -------------------------------------------------------------------------------- ) end of look-behind -------------------------------------------------------------------------------- ( group and capture to \\2: -------------------------------------------------------------------------------- ; ';' -------------------------------------------------------------------------------- | OR -------------------------------------------------------------------------------- : ':' -------------------------------------------------------------------------------- ) end of \\2 -------------------------------------------------------------------------------- \\s* whitespace (\n, \r, \t, \f, and " ") (0 or more times (matching the most amount possible)) -------------------------------------------------------------------------------- ( group and capture to \\3: -------------------------------------------------------------------------------- F{O}2 'F{O}2' -------------------------------------------------------------------------------- ) end of \\3 -------------------------------------------------------------------------------- | OR -------------------------------------------------------------------------------- ( group and capture to \\4: -------------------------------------------------------------------------------- BAR 'BAR' -------------------------------------------------------------------------------- ) end of \\4 -------------------------------------------------------------------------------- ( group and capture to \\5: -------------------------------------------------------------------------------- ( group and capture to \\6: -------------------------------------------------------------------------------- [A-Z] any character of: 'A' to 'Z' -------------------------------------------------------------------------------- | OR -------------------------------------------------------------------------------- ( group and capture to \\7: -------------------------------------------------------------------------------- \\d{5} digits (0-9) (5 times) -------------------------------------------------------------------------------- ) end of \\7 -------------------------------------------------------------------------------- ) end of \\6 -------------------------------------------------------------------------------- ) end of \\5## Example 2 (Twitter Handle) twitter <- construct( neg_lookbehing = "(?<![@\\w])" %:)%"Make sure the string doesn't start with @ or a word", at = "(@)" %:)%"capture starting with @ symbol", s_gr1 = "(" %:)%"Opening parenthesis group 1", handle = "([a-z0-9_]{1,15})" %:)%"Mix of 15 letters, numbers, or underscores", boundary = "\\b", e_gr1 = ")" %:)%"Closing parenthesis group 1" ) twitter[1] "(?<![@\\w])(@)(([a-z0-9_]{1,15})\\b)"unglue(twitter)$neg_lookbehing [1] "(?<![@\\w])" $at [1] "(@)" $s_gr1 [1] "(" $handle [1] "([a-z0-9_]{1,15})" $boundary [1] "\\b" $e_gr1 [1] ")"comments(twitter)$neg_lookbehing [1] "Make sure the string doesn't start with @ or a word" $at [1] "capture starting with @ symbol" $s_gr1 [1] "Opening parenthesis group 1" $handle [1] "Mix of 15 letters, numbers, or underscores" $boundary NULL $e_gr1 [1] "Closing parenthesis group 1"regex(twitter)$neg_lookbehing [1] "(?<![@\\w])" $at [1] "(@)" $s_gr1 [1] "(" $handle [1] "([a-z0-9_]{1,15})" $boundary [1] "\\b" $e_gr1 [1] ")"summary(twitter)(?<![@\w])(@)(([a-z0-9_]{1,15})\b) ================================== REGEX 1: (?<![@\w]) NAME : neg_lookbehing COMMENT: "Make sure the string doesn't start with @ or a word" REGEX 2: (@) NAME : at COMMENT: "capture starting with @ symbol" REGEX 3: ( NAME : s_gr1 COMMENT: "Opening parenthesis group 1" REGEX 4: ([a-z0-9_]{1,15}) NAME : handle COMMENT: "Mix of 15 letters, numbers, or underscores" REGEX 5: \b NAME : boundary COMMENT: REGEX 6: ) NAME : e_gr1 COMMENT: "Closing parenthesis group 1"test(twitter)Warning message: The following regex chunks are not valid in isolation: (1) ( (2) )$regex [1] TRUE $chunks neg_lookbehing at s_gr1 handle boundary e_gr1 TRUE TRUE FALSE TRUE TRUE FALSEexplain(twitter)NODE EXPLANATION -------------------------------------------------------------------------------- (?<! look behind to see if there is not: -------------------------------------------------------------------------------- [@\\w] any character of: '@', word characters (a-z, A-Z, 0-9, _) -------------------------------------------------------------------------------- ) end of look-behind -------------------------------------------------------------------------------- ( group and capture to \\1: -------------------------------------------------------------------------------- @ '@' -------------------------------------------------------------------------------- ) end of \\1 -------------------------------------------------------------------------------- ( group and capture to \\2: -------------------------------------------------------------------------------- ( group and capture to \\3: -------------------------------------------------------------------------------- [a-z0-9_]{1,15} any character of: 'a' to 'z', '0' to '9', '_' (between 1 and 15 times (matching the most amount possible)) -------------------------------------------------------------------------------- ) end of \\3 -------------------------------------------------------------------------------- \\b the boundary between a word char (\\w) and something that is not a word char -------------------------------------------------------------------------------- ) end of \\2x <- c("@hadley I like #rstats for #ggplot2 work.", "Difference between #magrittr and #pipeR, both implement pipeline operators for #rstats: http://renkun.me/r/2014/07/26/difference-between-magrittr-and-pipeR.html @timelyportfolio", "Slides from great talk: @ramnath_vaidya: Interactive slides from Interactive Visualization presentation #user2014. http://ramnathv.github.io/user2014-rcharts/#1", "tyler.rinker@gamil.com is my email", "A non valid Twitter is @abcdefghijklmnopqrstuvwxyz" ) library(qdapRegex) rm_default(x, pattern = twitter, extract = TRUE)[[1]] [1] "@hadley" [[2]] [1] "@timelyportfolio" [[3]] [1] "@ramnath_vaidya" [[4]] [1] NA [[5]] [1] NA## Example 3 (Modular Chunks) combined <- construct( twitter = twitter %:)%"Twitter regex created previously", or = "|" %:)%"Join the twitter handle regex and a hash tag regex", hash = grab("@rm_hash") %:)%"Twitter hash tag regex" ) combined[1] "(?<![@\\w])(@)(([a-z0-9_]{1,15})\\b)|(?<!/)((#)(\\w+))"unglue(combined)$twitter [1] "(?<![@\\w])(@)(([a-z0-9_]{1,15})\\b)" attr(,"regex") attr(,"regex")$neg_lookbehing [1] "(?<![@\\w])" attr(,"regex")$at [1] "(@)" attr(,"regex")$s_gr1 [1] "(" attr(,"regex")$handle [1] "([a-z0-9_]{1,15})" attr(,"regex")$boundary [1] "\\b" attr(,"regex")$e_gr1 [1] ")" attr(,"comments") attr(,"comments")$neg_lookbehing [1] "Make sure the string doesn't start with @ or a word" attr(,"comments")$at [1] "capture starting with @ symbol" attr(,"comments")$s_gr1 [1] "Opening parenthesis group 1" attr(,"comments")$handle [1] "Mix of 15 letters, numbers, or underscores" attr(,"comments")$boundary NULL attr(,"comments")$e_gr1 [1] "Closing parenthesis group 1" $or [1] "|" $hash [1] "(?<!/)((#)(\\w+))"comments(combined)$twitter [1] "Twitter regex created previously" $or [1] "Join the twitter handle regex and a hash tag regex" $hash [1] "Twitter hash tag regex"regex(combined)$twitter [1] "(?<![@\\w])(@)(([a-z0-9_]{1,15})\\b)" attr(,"regex") attr(,"regex")$neg_lookbehing [1] "(?<![@\\w])" attr(,"regex")$at [1] "(@)" attr(,"regex")$s_gr1 [1] "(" attr(,"regex")$handle [1] "([a-z0-9_]{1,15})" attr(,"regex")$boundary [1] "\\b" attr(,"regex")$e_gr1 [1] ")" attr(,"comments") attr(,"comments")$neg_lookbehing [1] "Make sure the string doesn't start with @ or a word" attr(,"comments")$at [1] "capture starting with @ symbol" attr(,"comments")$s_gr1 [1] "Opening parenthesis group 1" attr(,"comments")$handle [1] "Mix of 15 letters, numbers, or underscores" attr(,"comments")$boundary NULL attr(,"comments")$e_gr1 [1] "Closing parenthesis group 1" $or [1] "|" $hash [1] "(?<!/)((#)(\\w+))"summary(combined)(?<![@\w])(@)(([a-z0-9_]{1,15})\b)|(?<!/)((#)(\w+)) =================================================== REGEX 1: (?<![@\w])(@)(([a-z0-9_]{1,15})\b) NAME : twitter COMMENT: "Twitter regex created previously" REGEX 2: | NAME : or COMMENT: "Join the twitter handle regex and a hash tag regex" REGEX 3: (?<!/)((#)(\w+)) NAME : hash COMMENT: "Twitter hash tag regex"test(combined)$regex [1] TRUE $chunks twitter or hash TRUE TRUE TRUEexplain(combined)NODE EXPLANATION -------------------------------------------------------------------------------- (?<! look behind to see if there is not: -------------------------------------------------------------------------------- [@\\w] any character of: '@', word characters (a-z, A-Z, 0-9, _) -------------------------------------------------------------------------------- ) end of look-behind -------------------------------------------------------------------------------- ( group and capture to \\1: -------------------------------------------------------------------------------- @ '@' -------------------------------------------------------------------------------- ) end of \\1 -------------------------------------------------------------------------------- ( group and capture to \\2: -------------------------------------------------------------------------------- ( group and capture to \\3: -------------------------------------------------------------------------------- [a-z0-9_]{1,15} any character of: 'a' to 'z', '0' to '9', '_' (between 1 and 15 times (matching the most amount possible)) -------------------------------------------------------------------------------- ) end of \\3 -------------------------------------------------------------------------------- \\b the boundary between a word char (\\w) and something that is not a word char -------------------------------------------------------------------------------- ) end of \\2 -------------------------------------------------------------------------------- | OR -------------------------------------------------------------------------------- (?<! look behind to see if there is not: -------------------------------------------------------------------------------- / '/' -------------------------------------------------------------------------------- ) end of look-behind -------------------------------------------------------------------------------- ( group and capture to \\4: -------------------------------------------------------------------------------- ( group and capture to \\5: -------------------------------------------------------------------------------- ) end of \\5 -------------------------------------------------------------------------------- ) end of \\4