construct(...)
?`%:)%`
, and indented.Returns a character vector of the class regexr
. The attributes
of the returned object retain the original name and comment properties.
This function is used to construct human readable regular expressions from sub-expressions. The user may provide additional meta information about each sub-expression. This meta information is an optional name and comment for the sub-expressions. This allows one to write regular expressions in a fashion similar to writing code, that is the regular expression is written top to bottom, the syntax is broken up into manageable chunks, the sub-expressions can be indented to give structural insight such as nested groups. Finally, sub-expressions can be commented to provide linguistic grounding for more complex sub-expressions.
## Minimal Example minimal <- construct("a", "b", "c") minimal[1] "abc"unglue(minimal)[[1]] [1] "a" [[2]] [1] "b" [[3]] [1] "c"comments(minimal)[[1]] NULL [[2]] NULL [[3]] NULLsubs(minimal)[[1]] [1] "a" [[2]] [1] "b" [[3]] [1] "c"test(minimal)$regex [1] TRUE $subexpressions [1] TRUE TRUE TRUEsummary(minimal)abc ===SUB-EXPR 1: a NAME : COMMENT : SUB-EXPR 2: b NAME : COMMENT : SUB-EXPR 3: c NAME : COMMENT :## Example 1 m <- construct( space = "\\s+" %:)% "I see", simp = "(?<=(foo))", or = "(;|:)\\s*" %:)% "comment on what this does", is_then = "[ia]s th[ae]n" ) m[1] "\\s+(?<=(foo))(;|:)\\s*[ia]s th[ae]n"unglue(m)$space [1] "\\s+" $simp [1] "(?<=(foo))" $or [1] "(;|:)\\s*" $is_then [1] "[ia]s th[ae]n"summary(m)\s+(?<=(foo))(;|:)\s*[ia]s th[ae]n ==================================SUB-EXPR 1: \s+ NAME : space COMMENT : "I see" SUB-EXPR 2: (?<=(foo)) NAME : simp COMMENT : SUB-EXPR 3: (;|:)\s* NAME : or COMMENT : "comment on what this does" SUB-EXPR 4: [ia]s th[ae]n NAME : is_then COMMENT :subs(m)$space [1] "\\s+" $simp [1] "(?<=(foo))" $or [1] "(;|:)\\s*" $is_then [1] "[ia]s th[ae]n"comments(m)$space [1] "I see" $simp NULL $or [1] "comment on what this does" $is_then NULLsubs(m)[4] <- "(FO{2})|(BAR)" summary(m)\s+(?<=(foo))(;|:)\s*(FO{2})|(BAR) ==================================SUB-EXPR 1: \s+ NAME : space COMMENT : "I see" SUB-EXPR 2: (?<=(foo)) NAME : simp COMMENT : SUB-EXPR 3: (;|:)\s* NAME : or COMMENT : "comment on what this does" SUB-EXPR 4: (FO{2})|(BAR) NAME : is_then COMMENT :test(m)$regex [1] TRUE $subexpressions space simp or is_then TRUE TRUE TRUE TRUEsubs(m)[5:7] <- c("(", "([A-Z]|(\\d{5})", ")") test(m)Warning message: The concatenated regex is not valid \s+(?<=(foo))(;|:)\s*(FO{2})|(BAR)(([A-Z]|(\d{5})) Warning message: The following regex sub-expressions are not valid in isolation: (1) ( (2) ([A-Z]|(\d{5}) (3) )$regex [1] FALSE $subexpressions space simp or is_then TRUE TRUE TRUE TRUE FALSE FALSE FALSElibrary(qdapRegex) explain(m)NODE EXPLANATION -------------------------------------------------------------------------------- \\s+ whitespace (\n, \r, \t, \f, and " ") (1 or more times (matching the most amount possible)) -------------------------------------------------------------------------------- (?<= look behind to see if there is: -------------------------------------------------------------------------------- ( group and capture to \\1: -------------------------------------------------------------------------------- foo 'foo' -------------------------------------------------------------------------------- ) end of \\1 -------------------------------------------------------------------------------- ) end of look-behind -------------------------------------------------------------------------------- ( group and capture to \\2: -------------------------------------------------------------------------------- ; ';' -------------------------------------------------------------------------------- | OR -------------------------------------------------------------------------------- : ':' -------------------------------------------------------------------------------- ) end of \\2 -------------------------------------------------------------------------------- \\s* whitespace (\n, \r, \t, \f, and " ") (0 or more times (matching the most amount possible)) -------------------------------------------------------------------------------- ( group and capture to \\3: -------------------------------------------------------------------------------- F 'F' -------------------------------------------------------------------------------- O{2} 'O' (2 times) -------------------------------------------------------------------------------- ) end of \\3 -------------------------------------------------------------------------------- | OR -------------------------------------------------------------------------------- ( group and capture to \\4: -------------------------------------------------------------------------------- BAR 'BAR' -------------------------------------------------------------------------------- ) end of \\4 -------------------------------------------------------------------------------- ( group and capture to \\5: -------------------------------------------------------------------------------- ( group and capture to \\6: -------------------------------------------------------------------------------- [A-Z] any character of: 'A' to 'Z' -------------------------------------------------------------------------------- | OR -------------------------------------------------------------------------------- ( group and capture to \\7: -------------------------------------------------------------------------------- \\d{5} digits (0-9) (5 times) -------------------------------------------------------------------------------- ) end of \\7 -------------------------------------------------------------------------------- ) end of \\6 -------------------------------------------------------------------------------- ) end of \\5## Example 2 (Twitter Handle 2 ways) ## Bigger Sub-expressions twitter <- construct( no_at_wrd = "(?<![@\\w])" %:)% "Ensure doesn't start with @ or a word", at = "(@)" %:)% "Capture starting with @ symbol", handle = "(([a-z0-9_]{1,15})\\b)" %:)% "Any 15 letters, numbers, or underscores" ) ## Smaller Sub-expressions twitter <- construct( no_at_wrd = "(?<![@\\w])" %:)% "Ensure doesn't start with @ or a word", at = "(@)" %:)% "Capture starting with @ symbol", s_gr1 = "(" %:)% "GROUP 1 START", handle = "([a-z0-9_]{1,15})" %:)% "Any 15 letters, numbers, or underscores", boundary = "\\b", e_gr1 = ")" %:)%"GROUP 1 END" ) twitter[1] "(?<![@\\w])(@)(([a-z0-9_]{1,15})\\b)"unglue(twitter)$no_at_wrd [1] "(?<![@\\w])" $at [1] "(@)" $s_gr1 [1] "(" $handle [1] "([a-z0-9_]{1,15})" $boundary [1] "\\b" $e_gr1 [1] ")"comments(twitter)$no_at_wrd [1] "Ensure doesn't start with @ or a word" $at [1] "Capture starting with @ symbol" $s_gr1 [1] "GROUP 1 START" $handle [1] "Any 15 letters, numbers, or underscores" $boundary NULL $e_gr1 [1] "GROUP 1 END"subs(twitter)$no_at_wrd [1] "(?<![@\\w])" $at [1] "(@)" $s_gr1 [1] "(" $handle [1] "([a-z0-9_]{1,15})" $boundary [1] "\\b" $e_gr1 [1] ")"summary(twitter)(?<![@\w])(@)(([a-z0-9_]{1,15})\b) ==================================SUB-EXPR 1: (?<![@\w]) NAME : no_at_wrd COMMENT : "Ensure doesn't start with @ or a word" SUB-EXPR 2: (@) NAME : at COMMENT : "Capture starting with @ symbol" SUB-EXPR 3: ( NAME : s_gr1 COMMENT : "GROUP 1 START" SUB-EXPR 4: ([a-z0-9_]{1,15}) NAME : handle COMMENT : "Any 15 letters, numbers, or underscores" SUB-EXPR 5: \b NAME : boundary COMMENT : SUB-EXPR 6: ) NAME : e_gr1 COMMENT : "GROUP 1 END"test(twitter)Warning message: The following regex sub-expressions are not valid in isolation: (1) ( (2) )$regex [1] TRUE $subexpressions no_at_wrd at s_gr1 handle boundary e_gr1 TRUE TRUE FALSE TRUE TRUE FALSEexplain(twitter)NODE EXPLANATION -------------------------------------------------------------------------------- (?<! look behind to see if there is not: -------------------------------------------------------------------------------- [@\\w] any character of: '@', word characters (a-z, A-Z, 0-9, _) -------------------------------------------------------------------------------- ) end of look-behind -------------------------------------------------------------------------------- ( group and capture to \\1: -------------------------------------------------------------------------------- @ '@' -------------------------------------------------------------------------------- ) end of \\1 -------------------------------------------------------------------------------- ( group and capture to \\2: -------------------------------------------------------------------------------- ( group and capture to \\3: -------------------------------------------------------------------------------- [a-z0-9_]{1,15} any character of: 'a' to 'z', '0' to '9', '_' (between 1 and 15 times (matching the most amount possible)) -------------------------------------------------------------------------------- ) end of \\3 -------------------------------------------------------------------------------- \\b the boundary between a word char (\\w) and something that is not a word char -------------------------------------------------------------------------------- ) end of \\2x <- c("@hadley I like #rstats for #ggplot2 work.", "Difference between #magrittr and #pipeR, both implement pipeline operators for #rstats: http://renkun.me/r/2014/07/26/difference-between-magrittr-and-pipeR.html @timelyportfolio", "Slides from great talk: @ramnath_vaidya: Interactive slides from Interactive Visualization presentation #user2014. http://ramnathv.github.io/user2014-rcharts/#1", "tyler.rinker@gamil.com is my email", "A non valid Twitter is @abcdefghijklmnopqrstuvwxyz" ) library(qdapRegex) rm_default(x, pattern = twitter, extract = TRUE)[[1]] [1] "@hadley" [[2]] [1] "@timelyportfolio" [[3]] [1] "@ramnath_vaidya" [[4]] [1] NA [[5]] [1] NA## Example 3 (Modular Sub-expression Chunks) combined <- construct( twitter = twitter %:)%"Twitter regex created previously", or = "|" %:)%"Join handle regex & hash tag regex", hash = grab("@rm_hash") %:)%"Twitter hash tag regex" ) combined[1] "(?<![@\\w])(@)(([a-z0-9_]{1,15})\\b)|(?<!/)((#)(\\w+))"unglue(combined)$twitter [1] "(?<![@\\w])(@)(([a-z0-9_]{1,15})\\b)" attr(,"subs") attr(,"subs")$no_at_wrd [1] "(?<![@\\w])" attr(,"subs")$at [1] "(@)" attr(,"subs")$s_gr1 [1] "(" attr(,"subs")$handle [1] "([a-z0-9_]{1,15})" attr(,"subs")$boundary [1] "\\b" attr(,"subs")$e_gr1 [1] ")" attr(,"comments") attr(,"comments")$no_at_wrd [1] "Ensure doesn't start with @ or a word" attr(,"comments")$at [1] "Capture starting with @ symbol" attr(,"comments")$s_gr1 [1] "GROUP 1 START" attr(,"comments")$handle [1] "Any 15 letters, numbers, or underscores" attr(,"comments")$boundary NULL attr(,"comments")$e_gr1 [1] "GROUP 1 END" $or [1] "|" $hash [1] "(?<!/)((#)(\\w+))"comments(combined)$twitter [1] "Twitter regex created previously" $or [1] "Join handle regex & hash tag regex" $hash [1] "Twitter hash tag regex"subs(combined)$twitter [1] "(?<![@\\w])(@)(([a-z0-9_]{1,15})\\b)" attr(,"subs") attr(,"subs")$no_at_wrd [1] "(?<![@\\w])" attr(,"subs")$at [1] "(@)" attr(,"subs")$s_gr1 [1] "(" attr(,"subs")$handle [1] "([a-z0-9_]{1,15})" attr(,"subs")$boundary [1] "\\b" attr(,"subs")$e_gr1 [1] ")" attr(,"comments") attr(,"comments")$no_at_wrd [1] "Ensure doesn't start with @ or a word" attr(,"comments")$at [1] "Capture starting with @ symbol" attr(,"comments")$s_gr1 [1] "GROUP 1 START" attr(,"comments")$handle [1] "Any 15 letters, numbers, or underscores" attr(,"comments")$boundary NULL attr(,"comments")$e_gr1 [1] "GROUP 1 END" $or [1] "|" $hash [1] "(?<!/)((#)(\\w+))"summary(combined)(?<![@\w])(@)(([a-z0-9_]{1,15})\b)|(?<!/)((#)(\w+)) ===================================================SUB-EXPR 1: (?<![@\w])(@)(([a-z0-9_]{1,15})\b) NAME : twitter COMMENT : "Twitter regex created previously" SUB-EXPR 2: | NAME : or COMMENT : "Join handle regex & hash tag regex" SUB-EXPR 3: (?<!/)((#)(\w+)) NAME : hash COMMENT : "Twitter hash tag regex"test(combined)$regex [1] TRUE $subexpressions twitter or hash TRUE TRUE TRUEexplain(combined)NODE EXPLANATION -------------------------------------------------------------------------------- (?<! look behind to see if there is not: -------------------------------------------------------------------------------- [@\\w] any character of: '@', word characters (a-z, A-Z, 0-9, _) -------------------------------------------------------------------------------- ) end of look-behind -------------------------------------------------------------------------------- ( group and capture to \\1: -------------------------------------------------------------------------------- @ '@' -------------------------------------------------------------------------------- ) end of \\1 -------------------------------------------------------------------------------- ( group and capture to \\2: -------------------------------------------------------------------------------- ( group and capture to \\3: -------------------------------------------------------------------------------- [a-z0-9_]{1,15} any character of: 'a' to 'z', '0' to '9', '_' (between 1 and 15 times (matching the most amount possible)) -------------------------------------------------------------------------------- ) end of \\3 -------------------------------------------------------------------------------- \\b the boundary between a word char (\\w) and something that is not a word char -------------------------------------------------------------------------------- ) end of \\2 -------------------------------------------------------------------------------- | OR -------------------------------------------------------------------------------- (?<! look behind to see if there is not: -------------------------------------------------------------------------------- / '/' -------------------------------------------------------------------------------- ) end of look-behind -------------------------------------------------------------------------------- ( group and capture to \\4: -------------------------------------------------------------------------------- ( group and capture to \\5: -------------------------------------------------------------------------------- ) end of \\5 -------------------------------------------------------------------------------- ) end of \\4## Different Structure (no names): Example from Martin Fowler: ## *Note: Fowler argues for improved choices in regex representation ## and names that make the regex functionality more evident, commenting ## only where needed. See: ## browseURL("http://martinfowler.com/bliki/ComposedRegex.html") pattern <- construct( '@"^score', '\\s+', '(\\d+)' %:)% 'points', '\\s+', 'for', '\\s+', '(\\d+)' %:)% 'number of nights', '\\s+', 'night' , 's?' %:)% 'optional plural', '\\s+', 'at', '\\s+', '(.*)' %:)% 'hotel name', '";' ) summary(pattern)@"^score\s+(\d+)\s+for\s+(\d+)\s+nights?\s+at\s+(.*)"; ======================================================SUB-EXPR 1: @"^score NAME : COMMENT : SUB-EXPR 2: \s+ NAME : COMMENT : SUB-EXPR 3: (\d+) NAME : COMMENT : "points" SUB-EXPR 4: \s+ NAME : COMMENT : SUB-EXPR 5: for NAME : COMMENT : SUB-EXPR 6: \s+ NAME : COMMENT : SUB-EXPR 7: (\d+) NAME : COMMENT : "number of nights" SUB-EXPR 8: \s+ NAME : COMMENT : SUB-EXPR 9: night NAME : COMMENT : SUB-EXPR 10: s? NAME : COMMENT : "optional plural" SUB-EXPR 11: \s+ NAME : COMMENT : SUB-EXPR 12: at NAME : COMMENT : SUB-EXPR 13: \s+ NAME : COMMENT : SUB-EXPR 14: (.*) NAME : COMMENT : "hotel name" SUB-EXPR 15: "; NAME : COMMENT :