qdap Package Vignette

Tyler Rinker

qdap (Rinker, 2013) is an R package designed to assist in quantitative discourse analysis. The package stands as a bridge between qualitative transcripts of dialogue and statistical analysis and visualization. qdap was born out of a frustration with current discourse analysis programs. Packaged programs are a closed system, meaning the researcher using the method has little, if any, influence on the program applied to her data.

R already has thousands of excellent packages for statistics and visualization. qdap is designed to stand as a bridge between the qualitative discourse of a transcript and the computational power and freedom that R offers. As qdap returns the power to the researcher it will also allow the researcher to be more efficient and thus effective and productive in data analysis. The qdap package provides researchers with the tools to analyze data and more importantly is a dynamic system governed by the data, shaped by theory, and continuously refined by the field.

…if you can dream up an analysis then qdap and R can help get you there.

The following vignette is a loose chronological road map for utilizing the tools provided by qdap.


Select from sections below:


Starting a New Project [YT]

The following functions will be utilized in this section (click to view more):
- Project Template

The function new_project is designed to generate project template of multiple nested directories that organize and guide the researcher through a qualitative study, from data collection to analysis and report/presentation generation. This workflow framework will enable the researcher to be better organized and more efficient in all stages of the research process. new_project utilizes the reports package (Rinker, 2013b)

Please see the following links for PDF descriptions of the contents of the new_project and the reports directory.

Project
Workflow
Report
Workflow

click here


click here

extra_functions [YT]

The new_project template is designed to be utilized with RStudio. Upon clicking the xxx.Rproj file the template will be loaded into RStudio. The .Rprofile script will be sourced upon start up, allowing the user to automatically load packages, functions, etc. related to the project. The file extra_functions.R is sourced, loading custom functions. Already included are two functions, email and todo, used to generate project member emails and track project tasks. This auto sourcing greatly enhances efficiency in workflow.

Import/Export Discourse Data

The following functions will be utilized in this section (click to view more):
- Condense Dataframe Columns
- Map Transcript Files from a Directory to a Script
- Read/Write Multiple csv Files at a Time
- Read Transcripts Into R

Reading In Transcript Data [YT]

This subsection covers how to read in transcript data. Generally the researcher will have data stored as a .docx (Microsoft Word or Open/Libre Office) or .xlsx/.csv (spreadsheet format). It is of great importance that the researcher manually writes/parses their transcripts to avoid potential analysis problems later. All sentences should contain appropriate qdap punctuation (declarative = ., interrogative = ?, exclamatory = !, interrupted = | or imperative = *., *?, *!, *|). Additionally, if a sentence contains an end mark/punctuation it should have accompanying text/dialogue. Two functions are useful for reading in data, read.transcript and dir_map. read.transcript detects file type (.docx/.csv/.xlsx) and reads in a single transcript whereas dir_map generates code that utilizes read.transcript for each of the multiple transcripts in a single directory. Note that read.transcript expects a two column formatted transcript (usually with person on the left and dialogue on the right).

Five arguments are of particular importance to read.transcript:

file

The name of the file which the data are to be read from. Each row of the table appears as one line of the file. If it does not contain an absolute path, the file name is relative to the current working directory, getwd().

col.names

A character vector specifying the column names of the transcript columns.

header

logical. If TRUE the file contains the names of the variables as its first line.

sep

The field separator character. Values on each line of the file are separated by this character. The default of NULL instructs read.transcript to use a separator suitable for the file type being read in.

skip

Integer; the number of lines of the data file to skip before beginning to read data.

Often transcripts contain extraneous material at the top and the argument skip = ? must be used to skip these extra lines. Some sort of unique separator must also be used to separate the person column from the text column. By default sep = “:” is assumed. If your transcripts do not contain a separator one must be inserted manually. Also note that the researcher may want to prepare the transcripts with brackets to denote non spoken annotations as well dialogue that is read rather than spoken. For more on bracket parsing see Bracket/General Chunk Extraction.

Note: It is important that all sentences contain valid qdap punctuation (., ?, !, |) in your transcripts. Many qdap functions are dependent upon this assumption.

Reading In Data- read.transcript

## Location of sample transcripts from the qdap package
(doc1 <- system.file("extdata/transcripts/trans1.docx", package = "qdap"))
(doc2 <- system.file("extdata/transcripts/trans2.docx", package = "qdap"))
(doc3 <- system.file("extdata/transcripts/trans3.docx", package = "qdap"))
(doc4 <- system.file("extdata/transcripts/trans4.xlsx", package = "qdap"))
dat1 <- read.transcript(doc1)
truncdf(dat1, 40)
##                  X1                                       X2
## 1      Researcher 2                         October 7, 1892.
## 2         Teacher 4 Students it's time to learn. [Student di
## 3 Multiple Students        Yes teacher we're ready to learn.
## 4     [Cross Talk 3                                      00]
## 5         Teacher 4 Let's read this terrific book together. 
dat2 <- read.transcript(doc1, col.names = c("person", "dialogue"))
truncdf(dat2, 40)
##              person                                 dialogue
## 1      Researcher 2                         October 7, 1892.
## 2         Teacher 4 Students it's time to learn. [Student di
## 3 Multiple Students        Yes teacher we're ready to learn.
## 4     [Cross Talk 3                                      00]
## 5         Teacher 4 Let's read this terrific book together. 
dat2b <- rm_row(dat2, "person", "[C") #remove bracket row
truncdf(dat2b, 40)
##              person                                 dialogue
## 1      Researcher 2                         October 7, 1892.
## 2         Teacher 4 Students it's time to learn. [Student di
## 3 Multiple Students        Yes teacher we're ready to learn.
## 4         Teacher 4 Let's read this terrific book together. 
## Be aware of the need to `skip` non transcript lines
## Incorrect read; Needed to use `skip`
read.transcript(doc2)
Error in data.frame(X1 = speaker, X2 = pvalues, stringsAsFactors = FALSE) : 
  arguments imply differing number of rows: 7, 8
## Correct: Used `skip`
dat3 <- read.transcript(doc2, skip = 1)
truncdf(dat3, 40)
##                  X1                                       X2
## 1      Researcher 2                         October 7, 1892.
## 2         Teacher 4 Students it's time to learn. [Student di
## 3 Multiple Students        Yes teacher we're ready to learn.
## 4     [Cross Talk 3                                      00]
## 5         Teacher 4 Let's read this terrific book together. 
## Be Aware of the `sep` Used
## Incorrect Read; Wrong `sep` Provided (used default `:`)
read.transcript(doc3, skip = 1)
##Dialogue and Person Columns Mixed Inappropriately
## X1
## 1 [Cross Talk 3
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              X2
## 1 Teacher 4-Students it's time to learn. [Student discussion; unintelligible] Multiple Students-Yes teacher we're ready to learn. 00] Teacher 4-Let's read this terrific book together. It's called Moo Baa La La La and what was I going to ... Oh yes The story is by Sandra Boynton. A cow says Moo. A Sheep says Baa. Three singing pigs say LA LA LA! "No, no!" you say, that isn't right. The pigs say oink all day and night. Rhinoceroses snort and snuff. And little dogs go ruff ruff ruff! Some other dogs go bow wow wow! And cats and kittens say Meow! Quack! Says the duck. A horse says neigh. It's quiet now. What do you say?
## Correct `sep` Used
dat4 <- read.transcript(doc3, sep = "-", skip = 1)
truncdf(dat4, 40)
##                  X1                                       X2
## 1         Teacher 4 Students it's time to learn. [Student di
## 2 Multiple Students Yes teacher we're ready to learn. [Cross
## 3         Teacher 4 Let's read this terrific book together. 
## Read In .xlsx Data
dat5 <- read.transcript(doc4)
truncdf(dat5, 40)
##                   V1                                       V2
## 1      Researcher 2:                         October 7, 1892.
## 2         Teacher 4:             Students it's time to learn.
## 3                                                    
## 4 Multiple Students:        Yes teacher we're ready to learn.
## 5                                                    
## 6         Teacher 4: Let's read this terrific book together. 
## Reading In Text
trans <- "sam: Computer is fun. Not too fun.
greg: No it's not, it's dumb.
teacher: What should we do?
sam: You liar, it stinks!"

read.transcript(text=trans)
##        V1                            V2
## 1     sam Computer is fun. Not too fun.
## 2    greg         No its not, its dumb.
## 3 teacher            What should we do?
## 4     sam          You liar, it stinks!

The dir_map function enables the researcher to produce multiple lines of code, one line with read.transcript for each file in a directory, which is then optionally copied to the clipboard for easy insertion into a script. Note that setting the argument use.path = FALSE may allow the code to be more portable in that a static path is not supplied to the read.transcript scripts.

Reading In Data- dir_map

(DIR <- system.file("extdata/transcripts", package = "qdap"))
dir_map(DIR)

…will produce…

dat1 <- read.transcript('~/extdata/transcripts/trans1.docx', col.names = c('person', 'dialogue'), skip = 0)
dat2 <- read.transcript('~/extdata/transcripts/trans2.docx', col.names = c('person', 'dialogue'), skip = 0)
dat3 <- read.transcript('~/extdata/transcripts/trans3.docx', col.names = c('person', 'dialogue'), skip = 0)
dat4 <- read.transcript('~/extdata/transcripts/trans4.xlsx', col.names = c('person', 'dialogue'), skip = 0)

Reading/Writing Multiple .csv Files [YT]

The mcsv_x family of functions are utilized to read (mcsv_r) and write (mcsv_w) multiple csv files at once. mcsv_w takes an arbitrary number of dataframes and outputs them to the supplied directory( dir = ?). An attempt will be made to output the dataframes from qdap functions that output lists of dataframes. Note that dataframes that contain columns that are lists must be condensed prior to writing with other R dataframe writing functions (e.g., write.csv) using the condense function. By default mcsv_w attempts to utilize condense.

The mcsv_r function reads multiple files at once and then assigns then dataframes to identically named objects (minus the file extension) in the global environment. Additionally, all of the dataframes that are read in are also assigned to an inclusive list (name L1 by default).

Reading and Writing Multiple csvs

## Make new minimal data sets
mtcarsb <- mtcars[1:5, ]; CO2b <- CO2[1:5, ]

## Write multiple csvs and assign the directory path to `a`
a <- mcsv_w(mtcarsb, CO2b, dir="foo")

## New data sets gone from .GlobalEnv
rm("mtcarsb", "CO2b")  

## View the files in `a` and assign to `nms`
(nms <- dir(a))

## Read in and notice the dataframes have been assigned in .GlobalEnv
mcsv_r(file.path(a, nms))
mtcarsb; CO2b
L1

## The dataframe names and list of dataframe can be altered
mcsv_r(file.path(a, nms), a.name = paste0("bot", 1:2), l.name = "bots_stink")
bot1; bot2
bots_stink

## Clean up
delete("foo")

Writing Lists of Dataframes to csvs

## poldat and termco produce lists of dataframes
poldat <- with(DATA, polarity(state, person))
term <- c("the ", "she", " wh")
termdat <- with(raj.act.1,  termco(dialogue, person, term))

## View the lists of dataframes
str(poldat); str(termdat)

## Write the lists of dataframes to csv
mcsv_w(poldat, termdat, mtcars, CO2, dir="foo2")

## Clean up
delete("foo2")

View the Data

The following functions will be utilized in this section (click to view more):
- Truncated Dataframe Viewing
- Unclass qdap Object to View List of Dataframes
- Text Justification
- Search Columns of a Dataframe

The nature of dialogue data makes it large and cumbersome to view in R. This section explores qdap tools designed for more comfortable viewing of R dialogue oriented text dataframes.

Truncated Dataframe Viewing

The _truncdf family of functions (trunc + dataframe = truncdf) are designed to truncate the width of columns and number of rows in dataframes and lists of dataframes. The l and h in front of trunc stands for list and head and are extensions of truncdf. qview is a wrapper for htruncdf that also displays number of rows, columns, and the dataframe name.

Truncated Data Viewing

truncdf(raj[1:10, ])
##     person   dialogue act
## 1  Sampson Gregory, o   1
## 2  Gregory No, for th   1
## 3  Sampson I mean, an   1
## 4  Gregory Ay, while    1
## 5  Sampson I strike q   1
## 6  Gregory But thou a   1
## 7  Sampson A dog of t   1
## 8  Gregory To move is   1
## 9  Sampson A dog of t   1
## 10 Gregory That shows   1
truncdf(raj[1:10, ], 40)
##     person                                 dialogue act
## 1  Sampson Gregory, o my word, we'll not carry coal   1
## 2  Gregory      No, for then we should be colliers.   1
## 3  Sampson  I mean, an we be in choler, we'll draw.   1
## 4  Gregory Ay, while you live, draw your neck out o   1
## 5  Sampson           I strike quickly, being moved.   1
## 6  Gregory But thou art not quickly moved to strike   1
## 7  Sampson A dog of the house of Montague moves me.   1
## 8  Gregory To move is to stir; and to be valiant is   1
## 9  Sampson A dog of that house shall move me to sta   1
## 10 Gregory That shows thee a weak slave; for the we   1
htruncdf(raj)
##     person   dialogue act
## 1  Sampson Gregory, o   1
## 2  Gregory No, for th   1
## 3  Sampson I mean, an   1
## 4  Gregory Ay, while    1
## 5  Sampson I strike q   1
## 6  Gregory But thou a   1
## 7  Sampson A dog of t   1
## 8  Gregory To move is   1
## 9  Sampson A dog of t   1
## 10 Gregory That shows   1
htruncdf(raj, 20)
##     person   dialogue act
## 1  Sampson Gregory, o   1
## 2  Gregory No, for th   1
## 3  Sampson I mean, an   1
## 4  Gregory Ay, while    1
## 5  Sampson I strike q   1
## 6  Gregory But thou a   1
## 7  Sampson A dog of t   1
## 8  Gregory To move is   1
## 9  Sampson A dog of t   1
## 10 Gregory That shows   1
## 11 Sampson True; and    1
## 12 Gregory The quarre   1
## 13 Sampson 'Tis all o   1
## 14 Gregory The heads    1
## 15 Sampson Ay, the he   1
## 16 Gregory They must    1
## 17 Sampson Me they sh   1
## 18 Gregory 'Tis well    1
## 19 Sampson My naked w   1
## 20 Gregory How! turn    1
htruncdf(raj, ,20)
##     person             dialogue act
## 1  Sampson Gregory, o my word,    1
## 2  Gregory No, for then we shou   1
## 3  Sampson I mean, an we be in    1
## 4  Gregory Ay, while you live,    1
## 5  Sampson I strike quickly, be   1
## 6  Gregory But thou art not qui   1
## 7  Sampson A dog of the house o   1
## 8  Gregory To move is to stir;    1
## 9  Sampson A dog of that house    1
## 10 Gregory That shows thee a we   1
ltruncdf(rajPOS, width = 4)
## $text
##   data
## 1 Greg
## 2 No, 
## 3 I me
## 4 Ay, 
## 5 I st
## 6 But 
## 
## $POStagged
##   POSt POSt word
## 1 greg c("N    8
## 2 no/D c("D    7
## 3 i/PR c("P    9
## 4 ay/N c("N   11
## 5 i/VB c("V    5
## 6 but/ c("C    8
## 
## $POSprop
##   wrd. prop prop prop prop prop prop prop prop prop prop prop prop prop prop prop prop prop prop prop prop prop prop prop prop prop prop prop prop prop prop prop prop prop prop prop prop
## 1    8    0    0    0    0    0    0    0 12.5    0    0    0    0   25    0    0 12.5    0    0    0 12.5   25    0    0    0    0    0 12.5    0    0    0    0    0    0    0    0    0
## 2    7    0    0    0    0 14.2    0    0 14.2    0    0    0 14.2    0    0    0 14.2    0    0 14.2    0 14.2    0    0    0    0    0 14.2    0    0    0    0    0    0    0    0    0
## 3    9    0    0    0    0 11.1    0    0 11.1    0    0    0    0 11.1    0    0    0    0    0 22.2    0 11.1    0    0    0    0    0 22.2    0    0    0 11.1    0    0    0    0    0
## 4   11    0    0    0    0 9.09    0    0 27.2    0    0    0    0 27.2    0    0    0    0    0 9.09 9.09    0    0    0    0    0    0 9.09    0    0    0 9.09    0    0    0    0    0
## 5    5    0    0    0    0    0    0    0    0    0    0    0    0   20    0    0    0    0    0    0    0   20    0    0    0    0    0    0    0   20   40    0    0    0    0    0    0
## 6    8    0    0 12.5    0 12.5    0    0    0    0    0    0    0 12.5    0    0    0    0    0    0    0   25    0    0    0 12.5    0 12.5 12.5    0    0    0    0    0    0    0    0
## 
## $POSfreq
##   wrd. , . CC CD DT EX FW IN JJ JJR JJS MD NN NNP NNPS NNS PDT POS PRP PRP$ RB RBR RBS RP TO UH VB VBD VBG VBN VBP VBZ WDT WP WP$ WRB
## 1    8 0 0  0  0  0  0  0  1  0   0   0  0  2   0    0   1   0   0   0    1  2   0   0  0  0  0  1   0   0   0   0   0   0  0   0   0
## 2    7 0 0  0  0  1  0  0  1  0   0   0  1  0   0    0   1   0   0   1    0  1   0   0  0  0  0  1   0   0   0   0   0   0  0   0   0
## 3    9 0 0  0  0  1  0  0  1  0   0   0  0  1   0    0   0   0   0   2    0  1   0   0  0  0  0  2   0   0   0   1   0   0  0   0   0
## 4   11 0 0  0  0  1  0  0  3  0   0   0  0  3   0    0   0   0   0   1    1  0   0   0  0  0  0  1   0   0   0   1   0   0  0   0   0
## 5    5 0 0  0  0  0  0  0  0  0   0   0  0  1   0    0   0   0   0   0    0  1   0   0  0  0  0  0   0   1   2   0   0   0  0   0   0
## 6    8 0 0  1  0  1  0  0  0  0   0   0  0  1   0    0   0   0   0   0    0  2   0   0  0  1  0  1   1   0   0   0   0   0  0   0   0
## 
## $POSrnp
##   wrd. , .   CC CD   DT EX FW   IN JJ JJR JJS   MD   NN NNP NNPS  NNS PDT POS  PRP PRP$   RB RBR RBS RP   TO UH   VB  VBD  VBG  VBN  VBP VBZ WDT WP WP$ WRB
## 1    8 0 0    0  0    0  0  0 1(12  0   0   0    0 2(25   0    0 1(12   0   0    0 1(12 2(25   0   0  0    0  0 1(12    0    0    0    0   0   0  0   0   0
## 2    7 0 0    0  0 1(14  0  0 1(14  0   0   0 1(14    0   0    0 1(14   0   0 1(14    0 1(14   0   0  0    0  0 1(14    0    0    0    0   0   0  0   0   0
## 3    9 0 0    0  0 1(11  0  0 1(11  0   0   0    0 1(11   0    0    0   0   0 2(22    0 1(11   0   0  0    0  0 2(22    0    0    0 1(11   0   0  0   0   0
## 4   11 0 0    0  0 1(9.  0  0 3(27  0   0   0    0 3(27   0    0    0   0   0 1(9. 1(9.    0   0   0  0    0  0 1(9.    0    0    0 1(9.   0   0  0   0   0
## 5    5 0 0    0  0    0  0  0    0  0   0   0    0 1(20   0    0    0   0   0    0    0 1(20   0   0  0    0  0    0    0 1(20 2(40    0   0   0  0   0   0
## 6    8 0 0 1(12  0 1(12  0  0    0  0   0   0    0 1(12   0    0    0   0   0    0    0 2(25   0   0  0 1(12  0 1(12 1(12    0    0    0   0   0  0   0   0
## 
## $percent
##   data
## 1 TRUE
## 
## $zero.replace
##   data
## 1    0
qview(raj)
## ========================================================================
## nrow =  840           ncol =  3             raj
## ========================================================================
##     person   dialogue act
## 1  Sampson Gregory, o   1
## 2  Gregory No, for th   1
## 3  Sampson I mean, an   1
## 4  Gregory Ay, while    1
## 5  Sampson I strike q   1
## 6  Gregory But thou a   1
## 7  Sampson A dog of t   1
## 8  Gregory To move is   1
## 9  Sampson A dog of t   1
## 10 Gregory That shows   1
qview(CO2)
## ========================================================================
## nrow =  84           ncol =  5             CO2
## ========================================================================
##    Plant   Type  Treatment conc uptake
## 1    Qn1 Quebec nonchilled   95     16
## 2    Qn1 Quebec nonchilled  175   30.4
## 3    Qn1 Quebec nonchilled  250   34.8
## 4    Qn1 Quebec nonchilled  350   37.2
## 5    Qn1 Quebec nonchilled  500   35.3
## 6    Qn1 Quebec nonchilled  675   39.2
## 7    Qn1 Quebec nonchilled 1000   39.7
## 8    Qn2 Quebec nonchilled   95   13.6
## 9    Qn2 Quebec nonchilled  175   27.3
## 10   Qn2 Quebec nonchilled  250   37.1

Unclass qdap Object to View List of Dataframes

Many qdap objects are lists that print as a single dataframe, though the rest of the objects in the list are available. The lview function unclasses the object and assigns “list”.

lview(question_type(DATA.SPLIT$state, DATA.SPLIT$person))
## $raw
##        person                    raw.text n.row endmark
## 4     teacher          What should we do?     4       ?
## 7       sally      How can we be certain?     7       ?
## 10      sally What are you talking about?    10       ?
## 11 researcher           Shall we move on?    11       ?
## 15       greg                You already?    15       ?
##                      strip.text              q.type
## 4            what should we do                 what
## 7        how can we be certain                  how
## 10  what are you talking about                 what
## 11            shall we move on                shall
## 15                 you already  implied_do/does/did
## 
## $count
##       person tot.quest what how shall implied_do/does/did
## 1       greg         1    0   0     0                   1
## 2 researcher         1    0   0     1                   0
## 3      sally         2    1   1     0                   0
## 4    teacher         1    1   0     0                   0
## 5        sam         0    0   0     0                   0
## 
## $prop
##       person tot.quest what how shall implied_do/does/did
## 1       greg         1    0   0     0                 100
## 2 researcher         1    0   0   100                   0
## 3      sally         2   50  50     0                   0
## 4    teacher         1  100   0     0                   0
## 5        sam         0    0   0     0                   0
## 
## $rnp
##       person tot.quest    what    how   shall implied_do/does/did
## 1       greg         1       0      0       0             1(100%)
## 2 researcher         1       0      0 1(100%)                   0
## 3      sally         2  1(50%) 1(50%)       0                   0
## 4    teacher         1 1(100%)      0       0                   0
## 5        sam         0       0      0       0                   0
## 
## $inds
## [1]  4  7 10 11 15
## 
## $missing
## integer(0)
## 
## $percent
## [1] TRUE
## 
## $zero.replace
## [1] 0
## 
## $digits
## [1] 2

Text Justification

By default text data (character vectors) are displayed as right justified in R. This can be difficult and unnatural to read, particularly as the length of the sentences increase. The left_just function creates a more natural left justification of text. Note that left_just inserts spaces to achieve the justification. This could interfere with analysis and therefore the output from left_just should only be used for visualization purposes, not analysis.

Justified Data Viewing

## The unnatural state of R text data
DATA
##        person sex adult                                 state code
## 1         sam   m     0         Computer is fun. Not too fun.   K1
## 2        greg   m     0               No it's not, it's dumb.   K2
## 3     teacher   m     1                    What should we do?   K3
## 4         sam   m     0                  You liar, it stinks!   K4
## 5        greg   m     0               I am telling the truth!   K5
## 6       sally   f     0                How can we be certain?   K6
## 7        greg   m     0                      There is no way.   K7
## 8         sam   m     0                       I distrust you.   K8
## 9       sally   f     0           What are you talking about?   K9
## 10 researcher   f     1         Shall we move on?  Good then.  K10
## 11       greg   m     0 I'm hungry.  Let's eat.  You already?  K11
## left just to the rescue
left_just(DATA)
##    person     sex adult state                                 code
## 1  sam        m   0     Computer is fun. Not too fun.         K1  
## 2  greg       m   0     No it's not, it's dumb.               K2  
## 3  teacher    m   1     What should we do?                    K3  
## 4  sam        m   0     You liar, it stinks!                  K4  
## 5  greg       m   0     I am telling the truth!               K5  
## 6  sally      f   0     How can we be certain?                K6  
## 7  greg       m   0     There is no way.                      K7  
## 8  sam        m   0     I distrust you.                       K8  
## 9  sally      f   0     What are you talking about?           K9  
## 10 researcher f   1     Shall we move on?  Good then.         K10 
## 11 greg       m   0     I'm hungry.  Let's eat.  You already? K11
## Left just select column(s)
left_just(DATA, c("sex", "state"))
##        person sex adult state                                 code
## 1         sam m       0 Computer is fun. Not too fun.           K1
## 2        greg m       0 No it's not, it's dumb.                 K2
## 3     teacher m       1 What should we do?                      K3
## 4         sam m       0 You liar, it stinks!                    K4
## 5        greg m       0 I am telling the truth!                 K5
## 6       sally f       0 How can we be certain?                  K6
## 7        greg m       0 There is no way.                        K7
## 8         sam m       0 I distrust you.                         K8
## 9       sally f       0 What are you talking about?             K9
## 10 researcher f       1 Shall we move on?  Good then.          K10
## 11       greg m       0 I'm hungry.  Let's eat.  You already?  K11
left_just(CO2[1:15,])
##    Plant Type   Treatment  conc uptake
## 1  Qn1   Quebec nonchilled 95   16    
## 2  Qn1   Quebec nonchilled 175  30.4  
## 3  Qn1   Quebec nonchilled 250  34.8  
## 4  Qn1   Quebec nonchilled 350  37.2  
## 5  Qn1   Quebec nonchilled 500  35.3  
## 6  Qn1   Quebec nonchilled 675  39.2  
## 7  Qn1   Quebec nonchilled 1000 39.7  
## 8  Qn2   Quebec nonchilled 95   13.6  
## 9  Qn2   Quebec nonchilled 175  27.3  
## 10 Qn2   Quebec nonchilled 250  37.1  
## 11 Qn2   Quebec nonchilled 350  41.8  
## 12 Qn2   Quebec nonchilled 500  40.6  
## 13 Qn2   Quebec nonchilled 675  41.4  
## 14 Qn2   Quebec nonchilled 1000 44.3  
## 15 Qn3   Quebec nonchilled 95   16.2
right_just(left_just(CO2[1:15,]))
##    Plant   Type  Treatment conc uptake
## 1    Qn1 Quebec nonchilled   95     16
## 2    Qn1 Quebec nonchilled  175   30.4
## 3    Qn1 Quebec nonchilled  250   34.8
## 4    Qn1 Quebec nonchilled  350   37.2
## 5    Qn1 Quebec nonchilled  500   35.3
## 6    Qn1 Quebec nonchilled  675   39.2
## 7    Qn1 Quebec nonchilled 1000   39.7
## 8    Qn2 Quebec nonchilled   95   13.6
## 9    Qn2 Quebec nonchilled  175   27.3
## 10   Qn2 Quebec nonchilled  250   37.1
## 11   Qn2 Quebec nonchilled  350   41.8
## 12   Qn2 Quebec nonchilled  500   40.6
## 13   Qn2 Quebec nonchilled  675   41.4
## 14   Qn2 Quebec nonchilled 1000   44.3
## 15   Qn3 Quebec nonchilled   95   16.2

A task of many analyses is to search a dataframe for a particular phrase and return those rows/observations that contain that term. The researcher may optionally choose to specify a particular column to search (column.name) or search the entire dataframe.

Search Dataframes

(SampDF <- data.frame("islands"=names(islands)[1:32],mtcars, row.names=NULL))
##            islands  mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## 1           Africa 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## 2       Antarctica 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## 3             Asia 22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## 4        Australia 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## 5     Axel Heiberg 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## 6           Baffin 18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## 7            Banks 14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## 8           Borneo 24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## 9          Britain 22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## 10         Celebes 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## 11           Celon 17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## 12            Cuba 16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## 13           Devon 17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## 14       Ellesmere 15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## 15          Europe 10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## 16       Greenland 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## 17          Hainan 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## 18      Hispaniola 32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## 19        Hokkaido 30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## 20          Honshu 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## 21         Iceland 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## 22         Ireland 15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## 23            Java 15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## 24          Kyushu 13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## 25           Luzon 19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## 26      Madagascar 27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## 27        Melville 26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## 28        Mindanao 30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## 29        Moluccas 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## 30     New Britain 19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## 31      New Guinea 15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## 32 New Zealand (N) 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
Search(SampDF, "Cuba", "islands")
##    islands  mpg cyl  disp  hp drat   wt qsec vs am gear carb
## 12    Cuba 16.4   8 275.8 180 3.07 4.07 17.4  0  0    3    3
Search(SampDF, "New", "islands")
##            islands  mpg cyl  disp  hp drat   wt qsec vs am gear carb
## 8           Borneo 24.4   4 146.7  62 3.69 3.19 20.0  1  0    4    2
## 30     New Britain 19.7   6 145.0 175 3.62 2.77 15.5  0  1    5    6
## 31      New Guinea 15.0   8 301.0 335 3.54 3.57 14.6  0  1    5    8
## 32 New Zealand (N) 21.4   4 121.0 109 4.11 2.78 18.6  1  1    4    2
Search(SampDF, "Ho")
##         islands  mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## 5  Axel Heiberg 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## 8        Borneo 24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## 11        Celon 17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## 13        Devon 17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## 15       Europe 10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## 17       Hainan 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## 18   Hispaniola 32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## 19     Hokkaido 30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## 20       Honshu 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## 24       Kyushu 13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## 25        Luzon 19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## 28     Mindanao 30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## 29     Moluccas 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
Search(SampDF, "Ho", max.distance = 0)
##     islands  mpg cyl disp hp drat    wt  qsec vs am gear carb
## 19 Hokkaido 30.4   4 75.7 52 4.93 1.615 18.52  1  1    4    2
## 20   Honshu 33.9   4 71.1 65 4.22 1.835 19.90  1  1    4    1
Search(SampDF, "Axel Heiberg")
##        islands  mpg cyl disp  hp drat   wt  qsec vs am gear carb
## 5 Axel Heiberg 18.7   8  360 175 3.15 3.44 17.02  0  0    3    2
Search(SampDF, 19) #too much tolerance in max.distance
##            islands  mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## 1           Africa 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## 2       Antarctica 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## 3             Asia 22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## 4        Australia 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## 5     Axel Heiberg 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## 6           Baffin 18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## 7            Banks 14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## 8           Borneo 24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## 9          Britain 22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## 10         Celebes 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## 11           Celon 17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## 12            Cuba 16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## 13           Devon 17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## 14       Ellesmere 15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## 15          Europe 10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## 16       Greenland 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## 17          Hainan 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## 18      Hispaniola 32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## 19        Hokkaido 30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## 20          Honshu 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## 21         Iceland 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## 22         Ireland 15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## 23            Java 15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## 24          Kyushu 13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## 25           Luzon 19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## 26      Madagascar 27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## 27        Melville 26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## 28        Mindanao 30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## 29        Moluccas 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## 30     New Britain 19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## 31      New Guinea 15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## 32 New Zealand (N) 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
Search(SampDF, 19, max.distance = 0)
##        islands  mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## 4    Australia 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## 8       Borneo 24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## 10     Celebes 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## 18  Hispaniola 32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## 20      Honshu 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## 25       Luzon 19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## 30 New Britain 19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
Search(SampDF, 19, "qsec", max.distance = 0)
##       islands  mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## 4   Australia 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## 18 Hispaniola 32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## 20     Honshu 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1

Generic qdap Tools

This manual arranges functions into categories in the order a researcher is likely to use them. The Generic qdap Tools section does not fit this convention, however, because these tools may be used throughout all stages of analysis it is important that the reader is familiar with them. It is important to note that after reading in transcript data the researcher will likely that the next step is the need to parse the dataframe utilizing the techniques found in the Cleaning/Preparing the Data section.

The following functions will be utilized in this section (click to view more):
- Time Conversion
- Hash Table/Dictionary Lookup
- Quick Character Vector
- Download Instructional Documents

Quick Character Vector

Often it can be tedious to supply quotes to character vectors when dealing with large vectors. qcv replaces the typical c(“A”, “B”, “C”, “…”) approach to creating character vectors. Instead the user supplies qcv(A, B, C, …). This format assumes single words separated by commas. If your data/string does not fit this approach the combined terms and split argument can be utilized.

Quick Character Vector

qcv(I, like, dogs)
## [1] "I"    "like" "dogs"
qcv(terms = "I like, big dogs", split = ",")
## [1] "I like"   "big dogs"
qcv(I, like, dogs, space.wrap = TRUE)
## [1] " I "    " like " " dogs "
qcv(I, like, dogs, trailing = TRUE)
## [1] "I "    "like " "dogs "
qcv(I, like, dogs, leading = TRUE)
## [1] " I"    " like" " dogs"
qcv(terms = "mpg cyl  disp  hp drat    wt  qsec vs am gear carb")
##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"

Dictionary/Lookup

Often the researcher who deals with text data will have the need to lookup values quickly and return an accompanying value. This is often called a dictionary, hash, or lookup. This can be used to find corresponding values or recode variables etc. The lookup & %l% functions provide a fast environment lookup for single usage. The hash & hash_lookup/%hl% functions provide a fast environment lookup for multiple uses of the same hash table.

lookup- Dictionary/Look Up Examples

lookup(1:5, data.frame(1:4, 11:14))
## [1] 11 12 13 14 NA
lookup(LETTERS[1:5], data.frame(LETTERS[1:4], 11:14), missing = NULL)
## [1] "11" "12" "13" "14" "E"
lookup(LETTERS[1:5], data.frame(LETTERS[1:5], 100:104))
## [1] 100 101 102 103 104
## Fast with very large vectors
key <- data.frame(x=1:2, y=c("A", "B"))
set.seed(10)
big.vec <- sample(1:2, 3000000, T)
out <- lookup(big.vec, key)
out[1:20]
##  [1] "B" "A" "A" "B" "A" "A" "A" "A" "B" "A" "B" "B" "A"
## [14] "B" "A" "A" "A" "A" "A" "B"
## Supply a named list of vectors to key.match

codes <- list(A=c(1, 2, 4),
    B = c(3, 5),
    C = 7,
    D = c(6, 8:10)
)

lookup(1:10, codes) #or
##  [1] "A" "A" "B" "A" "B" "D" "C" "D" "D" "D"
1:10 %l% codes
##  [1] "A" "A" "B" "A" "B" "D" "C" "D" "D" "D"
## Supply a single vector to key.match and key.assign
lookup(mtcars$carb, sort(unique(mtcars$carb)),
    c('one', 'two', 'three', 'four', 'six', 'eight'))
##  [1] "four"  "four"  "one"   "one"   "two"   "one"   "four"  "two"  
##  [9] "two"   "four"  "four"  "three" "three" "three" "four"  "four" 
## [17] "four"  "one"   "two"   "one"   "one"   "two"   "two"   "four" 
## [25] "two"   "one"   "two"   "two"   "four"  "six"   "eight" "two"
lookup(mtcars$carb, sort(unique(mtcars$carb)),
    seq(10, 60, by=10))
##  [1] 40 40 10 10 20 10 40 20 20 40 40 30 30 30 40 40 40 10 20 10 10 20 20
## [24] 40 20 10 20 20 40 50 60 20

hash/hash_look- Dictionary/Look Up Examples

## Create a fake data set of hash values
(DF <- aggregate(mpg~as.character(carb), mtcars, mean))
##   as.character(carb)   mpg
## 1                  1 25.34
## 2                  2 22.40
## 3                  3 16.30
## 4                  4 15.79
## 5                  6 19.70
## 6                  8 15.00
## Use `hash` to create a lookup environment
hashTab <- hash(DF)  

## Create a vector to lookup
x <- sample(DF[, 1], 20, TRUE)

## Lookup x in the hash with `hash_look` or `%hl%`
hash_look(x, hashTab)
##  [1] 15.79 25.34 22.40 15.79 15.00 15.00 15.79 19.70 19.70 15.00 19.70
## [12] 15.00 15.79 19.70 16.30 25.34 19.70 25.34 16.30 22.40
x %hl% hashTab
##  [1] 15.79 25.34 22.40 15.79 15.00 15.00 15.79 19.70 19.70 15.00 19.70
## [12] 15.00 15.79 19.70 16.30 25.34 19.70 25.34 16.30 22.40

Time Conversion

Researchers dealing with transcripts may have the need to convert between traditional Hours:Minutes:Seconds format and seconds. The hms2sec and sec2hms functions offer this type of time conversion.

Time Conversion Examples

hms2sec(c("02:00:03", "04:03:01"))
## [1]  7203 14581
hms2sec(sec2hms(c(222, 1234, 55)))
## [1]  222 1234   55
sec2hms(c(256, 3456, 56565))
## [1] 00:04:16 00:57:36 15:42:45

Download Documents

url_dl is a function used to provide qdap users with examples taken from the Internet. It is useful for most document downloads from the Internet.

url_dl Examples

## Example 1 (download from dropbox)
# download transcript of the debate to working directory
url_dl(pres.deb1.docx, pres.deb2.docx, pres.deb3.docx)

# load multiple files with read transcript and assign to working directory
dat1 <- read.transcript("pres.deb1.docx", c("person", "dialogue"))
dat2 <- read.transcript("pres.deb2.docx", c("person", "dialogue"))
dat3 <- read.transcript("pres.deb3.docx", c("person", "dialogue"))

docs <- qcv(pres.deb1.docx, pres.deb2.docx, pres.deb3.docx)
dir() %in% docs
delete(docs)    #remove the documents
dir() %in% docs

## Example 2 (quoted string urls)
url_dl("https://dl.dropboxusercontent.com/u/61803503/qdap.pdf",
   "http://www.cran.r-project.org/doc/manuals/R-intro.pdf")

## Clean up
delete(qcv(qdap.pdf, R-intro.pdf))

Cleaning/Preparing the Data

The following functions will be utilized in this section (click to view more):
- Bracket/General Chunk Extraction
- Grab Begin/End of String to Character
- Capitalize Select Words
- Clean Imported Text: Remove Escaped Characters & Leading/Trailing White Space
- Denote Incomplete End Marks With “|”
- Multiple gsub
- Names to Gender Prediction
- Search for Potential Missing Values
- Quick Preparation of Text
- Replace Abbreviations
- Replace Contractions
- Replace Numbers With Text Representation
- Replace Symbols With Word Equivalents
- Remove Rows That Contain Markers
- Replace Spaces
- Stem Text

Bracket/General Chunk Extraction [YT]

After reading in the data the researcher may want to remove all non-dialogue text from the transcript dataframe such as transcriber annotations. This can be accomplished with the bracketX family of functions, which removes text found between two brackets (( ), { }, [ ], < >) or more generally using genX and genXtract to remove text between two character reference points.

If the bracketed text is useful to analysis it is recommended that the researcher assigns the un-bracketed text to a new column.

Extracting Chunks 1- bracketX/bracketXtract

## A fake data set
examp <- structure(list(person = structure(c(1L, 2L, 1L, 3L),
    .Label = c("bob", "greg", "sue"), class = "factor"), text =
    c("I love chicken [unintelligible]!",
    "Me too! (laughter) It's so good.[interrupting]",
    "Yep it's awesome {reading}.", "Agreed. {is so much fun}")), .Names =
    c("person", "text"), row.names = c(NA, -4L), class = "data.frame")
examp
##   person                                           text
## 1    bob               I love chicken [unintelligible]!
## 2   greg Me too! (laughter) It's so good.[interrupting]
## 3    bob                    Yep it's awesome {reading}.
## 4    sue                       Agreed. {is so much fun}
bracketX(examp$text, "square")
## [1] "I love chicken !"                 "Me too! (laughter) It's so good."
## [3] "Yep it's awesome {reading} ."     "Agreed. {is so much fun}"
bracketX(examp$text, "curly")
## [1] "I love chicken [unintelligible] !"              
## [2] "Me too! (laughter) It's so good. [interrupting]"
## [3] "Yep it's awesome ."                             
## [4] "Agreed."
bracketX(examp$text, c("square", "round"))
## [1] "I love chicken !"             "Me too! It's so good."       
## [3] "Yep it's awesome {reading} ." "Agreed. {is so much fun}"
bracketX(examp$text)
## [1] "I love chicken !"      "Me too! It's so good." "Yep it's awesome ."   
## [4] "Agreed."
bracketXtract(examp$text, "square")
## $square1
## [1] "unintelligible"
## 
## $square2
## [1] "interrupting"
## 
## $square3
## character(0)
## 
## $square4
## character(0)
bracketXtract(examp$text, "curly")
## $curly1
## character(0)
## 
## $curly2
## character(0)
## 
## $curly3
## [1] "reading"
## 
## $curly4
## [1] "is so much fun"
bracketXtract(examp$text, c("square", "round"))
## [[1]]
## [1] "unintelligible"
## 
## [[2]]
## [1] "interrupting" "laughter"    
## 
## [[3]]
## character(0)
## 
## [[4]]
## character(0)
bracketXtract(examp$text, c("square", "round"), merge = FALSE)
## $square
## $square[[1]]
## [1] "unintelligible"
## 
## $square[[2]]
## [1] "interrupting"
## 
## $square[[3]]
## character(0)
## 
## $square[[4]]
## character(0)
## 
## 
## $round
## $round[[1]]
## character(0)
## 
## $round[[2]]
## [1] "laughter"
## 
## $round[[3]]
## character(0)
## 
## $round[[4]]
## character(0)
bracketXtract(examp$text)
## $all1
## [1] "unintelligible"
## 
## $all2
## [1] "laughter"     "interrupting"
## 
## $all3
## [1] "reading"
## 
## $all4
## [1] "is so much fun"
bracketXtract(examp$text, with = TRUE)
## $all1
## [1] "[unintelligible]"
## 
## $all2
## [1] "(laughter)"     "[interrupting]"
## 
## $all3
## [1] "{reading}"
## 
## $all4
## [1] "{is so much fun}"

Often a researcher will want to extract some text from the transcript and put it back together. One example is the reconstructing of material read from a book, poem, play or other text. This information is generally dispersed throughout the dialogue (within classroom/teaching procedures). If this text is denoted with a particular identifying bracket such as curly braces this text can be extracted and then pasted back together.

Extracting Chunks 2- Recombining Chunks

paste2(bracketXtract(examp$text, "curly"), " ")
## [1] "reading is so much fun"

The researcher may need a more general extraction method that allows for any left/right boundaries to be specified. This is useful in that many qualitative transcription/coding programs have specific syntax for various dialogue markup for events that must be parsed from the data set. The genX and genXtract functions have such capabilities.

Extracting Chunks 3- genX/genXtract

DATA$state  
##  [1] "Computer is fun. Not too fun."        
##  [2] "No it's not, it's dumb."              
##  [3] "What should we do?"                   
##  [4] "You liar, it stinks!"                 
##  [5] "I am telling the truth!"              
##  [6] "How can we be certain?"               
##  [7] "There is no way."                     
##  [8] "I distrust you."                      
##  [9] "What are you talking about?"          
## [10] "Shall we move on?  Good then."        
## [11] "I'm hungry.  Let's eat.  You already?"
## Look at the difference in number 1 and 10 from above
genX(DATA$state, c("is", "we"), c("too", "on"))
##  [1] "Computer fun."                      
##  [2] "No it's not, it's dumb."            
##  [3] "What should we do?"                 
##  [4] "You liar, it stinks!"               
##  [5] "I am telling the truth!"            
##  [6] "How can we be certain?"             
##  [7] "There is no way."                   
##  [8] "I distrust you."                    
##  [9] "What are you talking about?"        
## [10] "Shall ? Good then."                 
## [11] "I'm hungry. Let's eat. You already?"
## A fake data set
x <- c("Where is the /big dog#?",
    "I think he's @arunning@b with /little cat#.")
x
## [1] "Where is the /big dog#?"                    
## [2] "I think he's @arunning@b with /little cat#."
genXtract(x, c("/", "@a"), c("#", "@b"))
## [[1]]
## [1] "big dog"
## 
## [[2]]
## [1] "little cat" "running"
## A fake data set
x2 <- c("Where is the L1big dogL2?",
    "I think he's 98running99 with L1little catL2.")
x2
## [1] "Where is the L1big dogL2?"                    
## [2] "I think he's 98running99 with L1little catL2."
genXtract(x2, c("L1", 98), c("L2", 99))
## [[1]]
## [1] "big dog"
## 
## [[2]]
## [1] "little cat" "running"

Search for Potential Missing Values

After reading in data, removing non-dialogue (via bracketX), and viewing it the researcher will want to find text rows that do not contain proper punctuation and or that contain punctuation and no text. This is accomplished with the _truncdf family of functions and potential_NA functions as the researcher manually parses the original transcripts, makes alterations and re-reads the data back into qdap. This important procedure is not an automatic process, requiring that the researcher give attention to detail in comparing the R dataframe with the original transcript.

Identifying and Coding Missing Values

## Create A Data Set With Punctuation and No Text
(DATA$state[c(3, 7, 10)] <- c(".", ".", NA))
## [1] "." "." NA
DATA
##        person sex adult                                 state code
## 1         sam   m     0         Computer is fun. Not too fun.   K1
## 2        greg   m     0               No it's not, it's dumb.   K2
## 3     teacher   m     1                                     .   K3
## 4         sam   m     0                  You liar, it stinks!   K4
## 5        greg   m     0               I am telling the truth!   K5
## 6       sally   f     0                How can we be certain?   K6
## 7        greg   m     0                                     .   K7
## 8         sam   m     0                       I distrust you.   K8
## 9       sally   f     0           What are you talking about?   K9
## 10 researcher   f     1                                  <NA>  K10
## 11       greg   m     0 I'm hungry.  Let's eat.  You already?  K11
potential_NA(DATA$state, 20)
##   row            text
## 1   3               .
## 2   7               .
## 3   8 I distrust you.
potential_NA(DATA$state)
##   row text
## 1   3    .
## 2   7    .
## Use To Selctively Replace Cells With Missing Values
DATA$state[potential_NA(DATA$state, 20)$row[-c(3)]] <- NA
DATA
##        person sex adult                                 state code
## 1         sam   m     0         Computer is fun. Not too fun.   K1
## 2        greg   m     0               No it's not, it's dumb.   K2
## 3     teacher   m     1                                  <NA>   K3
## 4         sam   m     0                  You liar, it stinks!   K4
## 5        greg   m     0               I am telling the truth!   K5
## 6       sally   f     0                How can we be certain?   K6
## 7        greg   m     0                                  <NA>   K7
## 8         sam   m     0                       I distrust you.   K8
## 9       sally   f     0           What are you talking about?   K9
## 10 researcher   f     1                                  <NA>  K10
## 11       greg   m     0 I'm hungry.  Let's eat.  You already?  K11
## Reset DATA
DATA <- qdap::DATA

Remove Rows That Contain Markers

The researcher may wish to remove empty rows (using rm_empty_row) and/or rows that contain certain markers (using rm_row). Sometimes empty rows are read into the dataframe from the transcript. These rows should be completely removed from the data set rather than denoting with NA. The rm_empty_row removes completely empty rows (those rows with only 1 or more blank spaces) from the dataframe.

Remove Empty Rows

(dat <- rbind.data.frame(DATA[, c(1, 4)], matrix(rep(" ", 4),
   ncol =2, dimnames=list(12:13, colnames(DATA)[c(1, 4)]))))
##        person                                 state
## 1         sam         Computer is fun. Not too fun.
## 2        greg               No it's not, it's dumb.
## 3     teacher                    What should we do?
## 4         sam                  You liar, it stinks!
## 5        greg               I am telling the truth!
## 6       sally                How can we be certain?
## 7        greg                      There is no way.
## 8         sam                       I distrust you.
## 9       sally           What are you talking about?
## 10 researcher         Shall we move on?  Good then.
## 11       greg I'm hungry.  Let's eat.  You already?
## 12                                                 
## 13
rm_empty_row(dat)
##        person                                 state
## 1         sam         Computer is fun. Not too fun.
## 2        greg               No it's not, it's dumb.
## 3     teacher                    What should we do?
## 4         sam                  You liar, it stinks!
## 5        greg               I am telling the truth!
## 6       sally                How can we be certain?
## 7        greg                      There is no way.
## 8         sam                       I distrust you.
## 9       sally           What are you talking about?
## 10 researcher         Shall we move on?  Good then.
## 11       greg I'm hungry.  Let's eat.  You already?

Other times the researcher may wish to use rm_row to remove rows from the dataframe/analysis based on transcription conventions or to remove demographic characteristics. For example, in the example below the transcript is read in with [Cross Talk 3. This is a transcription convention and we would want to parse these rows from the transcript. A second example shows the removal of people from the dataframe.

Remove Selected Rows

## Read in transcript
dat2 <- read.transcript(system.file("extdata/transcripts/trans1.docx", 
    package = "qdap"))
truncdf(dat2, 40)
##                  X1                                       X2
## 1      Researcher 2                         October 7, 1892.
## 2         Teacher 4 Students it's time to learn. [Student di
## 3 Multiple Students        Yes teacher we're ready to learn.
## 4     [Cross Talk 3                                      00]
## 5         Teacher 4 Let's read this terrific book together.
## Use column names to remove rows
truncdf(rm_row(dat2, "X1", "[C"), 40)
##                  X1                                       X2
## 1      Researcher 2                         October 7, 1892.
## 2         Teacher 4 Students it's time to learn. [Student di
## 3 Multiple Students        Yes teacher we're ready to learn.
## 4         Teacher 4 Let's read this terrific book together.
## Use column numbers to remove rows
truncdf(rm_row(dat2, 2, "[C"), 40)
##                  X1                                       X2
## 1      Researcher 2                         October 7, 1892.
## 2         Teacher 4 Students it's time to learn. [Student di
## 3 Multiple Students        Yes teacher we're ready to learn.
## 4     [Cross Talk 3                                      00]
## 5         Teacher 4 Let's read this terrific book together.
## Also remove people etc. from the analysis
rm_row(DATA, 1, c("sam", "greg"))
##       person sex adult                         state code
## 1    teacher   m     1            What should we do?   K3
## 2      sally   f     0        How can we be certain?   K6
## 3      sally   f     0   What are you talking about?   K9
## 4 researcher   f     1 Shall we move on?  Good then.  K10

Remove Extra Spaces and Escaped Characters

An important step in the cleaning process is the removal of extra white spaces (use Trim) and escaped characters (use clean). The scrubber function wraps both Trim and clean and adds in the functionality of some of the replace_ family of functions.

Remove Extra Spaces and Escaped Characters

x1 <- "I go \r
    to the \tnext line"
x1
## [1] "I go \r\n    to the \tnext line"
clean(x1)
## [1] "I go to the next line"
x2 <- c("  talkstats.com ", "   really? ", " yeah")
x2
## [1] "  talkstats.com " "   really? "      " yeah"
Trim(x2)
## [1] "talkstats.com" "really?"       "yeah"
x3 <- c("I like 456 dogs\t  , don't you?\"")
x3
## [1] "I like 456 dogs\t  , don't you?\""
scrubber(x3)
## [1] "I like 456 dogs, don't you?"
scrubber(x3, TRUE)
## [1] "I like 456 dogs, don't you?"

Replacement Functions

The replacement family of functions replace various text elements within the transcripts with alphabetic versions that are more suited to analysis. These alterations may affect word counts and other alphabetic dependent forms of analysis.

The replace_abbreviation replaces standard abbreviations that utilize periods with forms that do not rely on periods. This is necessary in that many sentence specific functions (e.g., sentSplit and word_stats) rely on period usage acting as sentence end marks. The researcher may augment the standard abbreviations dictionary from qdapDictionaries with field specific abbreviations.

Replace Abbreviations

## Use the standard contractions dictionary
x <- c("Mr. Jones is here at 7:30 p.m.",
    "Check it out at www.github.com/trinker/qdap",
    "i.e. He's a sr. dr.; the best in 2012 A.D.",
    "the robot at t.s. is 10ft. 3in.")
x
## [1] "Mr. Jones is here at 7:30 p.m."             
## [2] "Check it out at www.github.com/trinker/qdap"
## [3] "i.e. He's a sr. dr.; the best in 2012 A.D." 
## [4] "the robot at t.s. is 10ft. 3in."
replace_abbreviation(x)
## [1] "Mister Jones is here at 7:30 PM."                    
## [2] "Check it out at www dot github dot com /trinker/qdap"
## [3] "ie He's a Senior Doctor ; the best in 2012 AD."      
## [4] "the robot at t.s. is 10ft. 3in."
## Augment the standard dictionary with replacement vectors
abv <- c("in.", "ft.", "t.s.")
repl <- c("inch", "feet", "talkstats")
replace_abbreviation(x, abv, repl)
## [1] "Mr. Jones is here at 7:30 p.m."             
## [2] "Check it out at www.github.com/trinker/qdap"
## [3] "i.e. He's a sr. dr.; the best in 2012 A.D." 
## [4] "the robot at talkstats is 10 feet 3 inch."
## Augment the standard dictionary with a replacement dataframe
(KEY <- rbind(abbreviations, data.frame(abv = abv, rep = repl)))
##       abv       rep
## 1     Mr.    Mister
## 2    Mrs.    Misses
## 3     Ms.      Miss
## 4    .com   dot com
## 5    www.   www dot
## 6    i.e.        ie
## 7    A.D.        AD
## 8    B.C.        BC
## 9    A.M.        AM
## 10   P.M.        PM
## 11 et al.     et al
## 12    Jr.    Junior
## 13    Dr.    Doctor
## 14    Sr.    Senior
## 15    in.      inch
## 16    ft.      feet
## 17   t.s. talkstats
replace_abbreviation(x, KEY)
## [1] "Mister Jones is here at 7:30 PM."                    
## [2] "Check it out at www dot github dot com /trinker/qdap"
## [3] "ie He's a Senior Doctor ; the best in 2012 AD."      
## [4] "the robot at talkstats is 10 feet 3 inch."

The replace_contraction replaces contractions with equivalent multi-word forms. This is useful for some word/sentence statistics. The researcher may augment the contractions dictionary supplied by qdapDictionaries, however, the word list is exhaustive.

Replace Contractions

x <- c("Mr. Jones isn't going.",
    "Check it out what's going on.",
    "He's here but didn't go.",
    "the robot at t.s. wasn't nice",
    "he'd like it if i'd go away")
x
## [1] "Mr. Jones isn't going."        "Check it out what's going on."
## [3] "He's here but didn't go."      "the robot at t.s. wasn't nice"
## [5] "he'd like it if i'd go away"
replace_contraction(x)
## [1] "Mr. Jones is not going."            
## [2] "Check it out what is going on."     
## [3] "He is here but did not go."         
## [4] "The robot at t.s. was not nice"     
## [5] "He would like it if I would go away"

The replace_number function utilizes The work of John Fox (2005) to turn numeric representations of numbers into their textual equivalents. This is useful for word statistics that require the text version of dialogue.

Replace Numbers-Numeral Representation

x <- c("I like 346457 ice cream cones.", "They are 99 percent good")
replace_number(x)
## [1] "I like three hundred forty six thousand four hundred fifty seven ice cream cones."
## [2] "They are ninety nine percent good"
## Replace numbers that contain commas as well
y <- c("I like 346,457 ice cream cones.", "They are 99 percent good")
replace_number(y)
## [1] "I like three hundred forty six thousand four hundred fifty seven ice cream cones."
## [2] "They are ninety nine percent good"
## Combine numbers as one word/string
replace_number(x, FALSE)
## [1] "I like threehundredfortysixthousandfourhundredfiftyseven ice cream cones."
## [2] "They are ninetynine percent good"

The replace_symbol converts ($) to “dollar”, (%) to “percent”, (#) to “number”, (@) to “at”, (&) to “and”, (w/) to “with”. Additional substitutions can be undertaken with the multigsub function.

Replace Symbols

x <- c("I am @ Jon's & Jim's w/ Marry",
    "I owe $41 for food",
    "two is 10% of a #")
x
## [1] "I am @ Jon's & Jim's w/ Marry" "I owe $41 for food"           
## [3] "two is 10% of a #"
replace_symbol(x)
## [1] "I am at Jon's and Jim's with Marry"
## [2] "I owe dollar 41 for food"          
## [3] "two is 10 percent of a number"
replace_number(replace_symbol(x))
## [1] "I am at Jon's and Jim's with Marry"
## [2] "I owe dollar forty one for food"   
## [3] "two is ten percent of a number"

The qprep function is a wrapper for several other replacement family function that allows for more speedy cleaning of the text. This approach, while speedy, reduces the flexibility and care that is undertaken by the researcher when the individual replacement functions are utilized. The function is intended for analysis that requires less care.

General Replacement (Quick Preparation)

x <- "I like 60 (laughter) #d-bot and $6 @ the store w/o 8p.m."
x
## [1] "I like 60 (laughter) #d-bot and $6 @ the store w/o 8p.m."
qprep(x)
## [1] "I like sixty number d bot and dollar six at the store without eight PM."

Replace Spaces

Many qdap functions break sentences up into words based on the spaces between words. Often the researcher will want to keep a group of words as a single unit. The space_fill allows the researcher to replace spaces between selected phrases with ~~. By default ~~ is recognized by many qdap functions as a space separator.

Space Fill Examples

## Fake Data
x <- c("I want to hear the Dr. Martin Luther King Jr. speech.",
    "I also want to go to the white House to see President Obama speak.")
x
## [1] "I want to hear the Dr. Martin Luther King Jr. speech."             
## [2] "I also want to go to the white House to see President Obama speak."
## Words to keep as a single unit
keeps <- c("Dr. Martin Luther King Jr.", "The White House", "President Obama")
text <- space_fill(x, keeps)
text
## [1] "I want to hear the Dr.~~Martin~~Luther~~King~~Jr. speech."            
## [2] "I also want to go to The~~White~~House to see President~~Obama speak."
## strip Example
strip(text, lower=FALSE)
## [1] "I want to hear the Dr~~Martin~~Luther~~King~~Jr speech"              
## [2] "I also want to go to The~~White~~House to see President~~Obama speak"
## bag_o_words Example
bag_o_words(text, lower=FALSE)
##  [1] "i"                            "want"                        
##  [3] "to"                           "hear"                        
##  [5] "the"                          "dr~~martin~~luther~~king~~jr"
##  [7] "speech"                       "i"                           
##  [9] "also"                         "want"                        
## [11] "to"                           "go"                          
## [13] "to"                           "the~~white~~house"           
## [15] "to"                           "see"                         
## [17] "president~~obama"             "speak"
## wfm Example
wfm(text, c("greg", "bob"))
##                          bob greg
## also                       1    0
## dr martin luther king jr   0    1
## go                         1    0
## hear                       0    1
## i                          1    1
## president obama            1    0
## see                        1    0
## speak                      1    0
## speech                     0    1
## the                        0    1
## the white house            1    0
## to                         3    1
## want                       1    1
## trans_cloud Example
obs <- strip(space_fill(keeps, keeps), lower=FALSE)
trans_cloud(text, c("greg", "bob"), target.words=list(obs), caps.list=obs, 
    cloud.colors=qcv(red, gray65), expand.target = FALSE, title.padj = .7,
    legend = c("space_filled", "other"), title.cex = 2, title.color = "blue", 
    max.word.size = 3)

plot of chunk unnamed-chunk-26 plot of chunk unnamed-chunk-26

Multiple gsub

The researcher may have the need to make multiple substitutions in a text. An example of when this is needed is when a transcript is marked up with transcription coding convention specific to a particular transcription method. These codes, while useful in some contexts, may lead to inaccurate word statistics. The base R function gsub makes a single replacement of these types of coding conventions. The multigsub (alias mgsub) takes a vector of patterns to search for as well as a vector of replacements. Note that the replacements occur sequentially rather than all at once. This means a previous (first in pattern string) sub could alter or be altered by a later sub. mgsub is useful throughout multiple stages of the research process.

Multiple Substitutions

left_just(DATA[, c(1, 4)])
##    person     state                                
## 1  sam        Computer is fun. Not too fun.        
## 2  greg       No it's not, it's dumb.              
## 3  teacher    What should we do?                   
## 4  sam        You liar, it stinks!                 
## 5  greg       I am telling the truth!              
## 6  sally      How can we be certain?               
## 7  greg       There is no way.                     
## 8  sam        I distrust you.                      
## 9  sally      What are you talking about?          
## 10 researcher Shall we move on?  Good then.        
## 11 greg       I'm hungry.  Let's eat.  You already?
multigsub(c("it's", "I'm"), c("it is", "I am"), DATA$state)
##  [1] "Computer is fun. Not too fun."       
##  [2] "No it is not, it is dumb."           
##  [3] "What should we do?"                  
##  [4] "You liar, it stinks!"                
##  [5] "I am telling the truth!"             
##  [6] "How can we be certain?"              
##  [7] "There is no way."                    
##  [8] "I distrust you."                     
##  [9] "What are you talking about?"         
## [10] "Shall we move on? Good then."        
## [11] "I am hungry. Let's eat. You already?"
mgsub(c("it's", "I'm"), c("it is", "I am"), DATA$state)
##  [1] "Computer is fun. Not too fun."       
##  [2] "No it is not, it is dumb."           
##  [3] "What should we do?"                  
##  [4] "You liar, it stinks!"                
##  [5] "I am telling the truth!"             
##  [6] "How can we be certain?"              
##  [7] "There is no way."                    
##  [8] "I distrust you."                     
##  [9] "What are you talking about?"         
## [10] "Shall we move on? Good then."        
## [11] "I am hungry. Let's eat. You already?"
mgsub(c("it's", "I'm"), "SINGLE REPLACEMENT", DATA$state)
##  [1] "Computer is fun. Not too fun."                      
##  [2] "No SINGLE REPLACEMENT not, SINGLE REPLACEMENT dumb."
##  [3] "What should we do?"                                 
##  [4] "You liar, it stinks!"                               
##  [5] "I am telling the truth!"                            
##  [6] "How can we be certain?"                             
##  [7] "There is no way."                                   
##  [8] "I distrust you."                                    
##  [9] "What are you talking about?"                        
## [10] "Shall we move on? Good then."                       
## [11] "SINGLE REPLACEMENT hungry. Let's eat. You already?"
mgsub("[[:punct:]]", "PUNC", DATA$state, fixed = FALSE)
##  [1] "Computer is funPUNC Not too funPUNC"               
##  [2] "No itPUNCs notPUNC itPUNCs dumbPUNC"               
##  [3] "What should we doPUNC"                             
##  [4] "You liarPUNC it stinksPUNC"                        
##  [5] "I am telling the truthPUNC"                        
##  [6] "How can we be certainPUNC"                         
##  [7] "There is no wayPUNC"                               
##  [8] "I distrust youPUNC"                                
##  [9] "What are you talking aboutPUNC"                    
## [10] "Shall we move onPUNC Good thenPUNC"                
## [11] "IPUNCm hungryPUNC LetPUNCs eatPUNC You alreadyPUNC"
## Iterative "I'm" converts to "I am" which converts to "INTERATIVE"
mgsub(c("it's", "I'm", "I am"), c("it is", "I am", "ITERATIVE"), DATA$state)
##  [1] "Computer is fun. Not too fun."            
##  [2] "No it is not, it is dumb."                
##  [3] "What should we do?"                       
##  [4] "You liar, it stinks!"                     
##  [5] "ITERATIVE telling the truth!"             
##  [6] "How can we be certain?"                   
##  [7] "There is no way."                         
##  [8] "I distrust you."                          
##  [9] "What are you talking about?"              
## [10] "Shall we move on? Good then."             
## [11] "ITERATIVE hungry. Let's eat. You already?"

Names to Gender Prediction

A researcher may face a list of names and be uncertain about gender of the participants. The name2sex function utilizes the gender package to predict names based on Social Security Administration data, defaulting to the period from 1932-2012.

Name to Gender Prediction

name2sex(qcv(mary, jenn, linda, JAME, GABRIEL, OLIVA, tyler, jamie, JAMES, 
    tyrone, cheryl, drew))
[1] F F F M M F M F M M F M
Levels: F M

Stem Text

During the initial cleaning stage of analysis the researcher may choose to create a stemmed version of the dialogue, that is words are reduced to their root words. The stemmer family of functions allow the researcher to create stemmed text. The stem2df function wraps stemmer to quickly create a dataframe with the stemmed column added.

Stemming

## stem2df EXAMPLE:
(stemdat <- stem2df(DATA, "state", "new"))
##        person sex adult                                 state code
## 1         sam   m     0         Computer is fun. Not too fun.   K1
## 2        greg   m     0               No it's not, it's dumb.   K2
## 3     teacher   m     1                    What should we do?   K3
## 4         sam   m     0                  You liar, it stinks!   K4
## 5        greg   m     0               I am telling the truth!   K5
## 6       sally   f     0                How can we be certain?   K6
## 7        greg   m     0                      There is no way.   K7
## 8         sam   m     0                       I distrust you.   K8
## 9       sally   f     0           What are you talking about?   K9
## 10 researcher   f     1         Shall we move on?  Good then.  K10
## 11       greg   m     0 I'm hungry.  Let's eat.  You already?  K11
##                                new
## 1       Comput is fun not too fun.
## 2               No it not it dumb.
## 3               What should we do?
## 4               You liar it stink!
## 5             I am tell the truth!
## 6           How can we be certain?
## 7                 There is no way.
## 8                  I distrust you.
## 9         What are you talk about?
## 10     Shall we move on good then.
## 11 I'm hungri let eat you alreadi?
with(stemdat, trans_cloud(new, sex, title.cex = 2.5, 
    title.color = "blue", max.word.size = 5, title.padj = .7))

plot of chunk unnamed-chunk-28 plot of chunk unnamed-chunk-28

## stemmer EXAMPLE:
stemmer(DATA$state)
##  [1] "Comput is fun not too fun."      "No it not it dumb."             
##  [3] "What should we do?"              "You liar it stink!"             
##  [5] "I am tell the truth!"            "How can we be certain?"         
##  [7] "There is no way."                "I distrust you."                
##  [9] "What are you talk about?"        "Shall we move on good then."    
## [11] "I'm hungri let eat you alreadi?"
## stem_words EXAMPLE:
stem_words(doggies, jumping, swims)
## [1] "doggi" "jump"  "swim"

Grab Begin/End of String to Character

At times it is handy to be able to grab from the beginning or end of a string to a specific character. The beg2char function allows you to grab from the beginning of a string to the nth occurrence of a character. The counterpart function, char2end, grab from the nth occurrence of a character to the end of a string to. This behavior is useful if the transcript contains annotations at the beginning or end of a line that should be eliminated.

Grab From Character to Beginning/End of String

x <- c("a_b_c_d", "1_2_3_4", "<_?_._:")
beg2char(x, "_")
## [1] "a" "1" "<"
beg2char(x, "_", 4)
## [1] "a_b_c_d" "1_2_3_4" "<_?_._:"
char2end(x, "_")
## [1] "b_c_d" "2_3_4" "?_._:"
char2end(x, "_", 2)
## [1] "c_d" "3_4" "._:"
char2end(x, "_", 3, include=TRUE)
## [1] "_d" "_4" "_:"
(x2 <- gsub("_", " ", x))
## [1] "a b c d" "1 2 3 4" "< ? . :"
beg2char(x2, " ", 2)
## [1] "a b" "1 2" "< ?"
(x3 <- gsub("_", "\\^", x))
## [1] "a^b^c^d" "1^2^3^4" "<^?^.^:"
char2end(x3, "^", 2)
## [1] "c^d" "3^4" ".^:"

Denote Incomplete End Marks With “|”

Often incomplete sentences have a different function than complete sentences. The researcher may want to denote incomplete sentences for consideration in later analysis. Traditionally, incomplete sentence are denoted with the following end marks (.., …, .?, ..?, en & em). The incomplete_replace can identify and replace the traditional end marks with a standard form “|”.

Incomplete Sentence Identification

x <- c("the...",  "I.?", "you.", "threw..", "we?")
incomplete_replace(x)
## [1] "the|"   "I|"     "you."   "threw|" "we?"
incomp(x)
## [1] "the|"   "I|"     "you."   "threw|" "we?"
incomp(x, scan.mode = TRUE)
##   row.num text   
## 1       1 the... 
## 2       2 I.?    
## 3       4 threw..

Capitalize Select Words

The capitalizer functions allows the researcher to specify words within a vector to be capitalized. By default I, and contractions containing I, are capitalized. Additional words can be specified through the caps.list argument. To capitalize words within strings the mgsub can be used.

Word Capitalization

capitalizer(bag_o_words("i like it but i'm not certain"), "like")
## [1] "I"       "Like"    "it"      "but"     "I'm"     "not"     "certain"
capitalizer(bag_o_words("i like it but i'm not certain"), "like", FALSE)
## [1] "i"       "Like"    "it"      "but"     "i'm"     "not"     "certain"

Reshaping the Data

The following functions will be utilized in this section (click to view more):
- Create Adjacency Matrix
- Generate Unit Spans
- Merge Demographic Information with Person/Text Transcript
- Paste and Separate Columns
- Sentence Splitting/Combining

Sentence Splitting/Combining

Many functions in the qdap package require that the dialogue is broken apart into individual sentences, failure to do so may invalidate many of the outputs from the analysis and will lead to lead to warnings. After reading in and cleaning the data the next step should be to split the text variable into individual sentences. The sentSplit function outputs a dataframe with the text variable split into individual sentences and repeats the demographic variables as necessary. Additionally, a turn of talk (tot column) variable is added that keeps track of the original turn of talk (row number) and the sentence number per turn of talk. The researcher may also want to create a second text column that has been stemmed for future analysis by setting stem.col = TRUE, though this is more time intensive.

sentSplit Example

sentSplit(DATA, "state")
##        person  tot sex adult code                       state
## 1         sam  1.1   m     0   K1            Computer is fun.
## 2         sam  1.2   m     0   K1                Not too fun.
## 3        greg  2.1   m     0   K2     No it's not, it's dumb.
## 4     teacher  3.1   m     1   K3          What should we do?
## 5         sam  4.1   m     0   K4        You liar, it stinks!
## 6        greg  5.1   m     0   K5     I am telling the truth!
## 7       sally  6.1   f     0   K6      How can we be certain?
## 8        greg  7.1   m     0   K7            There is no way.
## 9         sam  8.1   m     0   K8             I distrust you.
## 10      sally  9.1   f     0   K9 What are you talking about?
## 11 researcher 10.1   f     1  K10           Shall we move on?
## 12 researcher 10.2   f     1  K10                  Good then.
## 13       greg 11.1   m     0  K11                 I'm hungry.
## 14       greg 11.2   m     0  K11                  Let's eat.
## 15       greg 11.3   m     0  K11                You already?
sentSplit(DATA, "state", stem.col = TRUE)
##        person  tot sex adult code                       state                stem.text
## 1         sam  1.1   m     0   K1            Computer is fun.           Comput is fun.
## 2         sam  1.2   m     0   K1                Not too fun.             Not too fun.
## 3        greg  2.1   m     0   K2     No it's not, it's dumb.       No it not it dumb.
## 4     teacher  3.1   m     1   K3          What should we do?       What should we do?
## 5         sam  4.1   m     0   K4        You liar, it stinks!       You liar it stink!
## 6        greg  5.1   m     0   K5     I am telling the truth!     I am tell the truth!
## 7       sally  6.1   f     0   K6      How can we be certain?   How can we be certain?
## 8        greg  7.1   m     0   K7            There is no way.         There is no way.
## 9         sam  8.1   m     0   K8             I distrust you.          I distrust you.
## 10      sally  9.1   f     0   K9 What are you talking about? What are you talk about?
## 11 researcher 10.1   f     1  K10           Shall we move on?        Shall we move on?
## 12 researcher 10.2   f     1  K10                  Good then.               Good then.
## 13       greg 11.1   m     0  K11                 I'm hungry.              I'm hungri.
## 14       greg 11.2   m     0  K11                  Let's eat.                 Let eat.
## 15       greg 11.3   m     0  K11                You already?             You alreadi?
sentSplit(raj, "dialogue")[1:11, ]
##     person tot act                                               dialogue
## 1  Sampson 1.1   1             Gregory, o my word, we'll not carry coals.
## 2  Gregory 2.1   1                    No, for then we should be colliers.
## 3  Sampson 3.1   1                I mean, an we be in choler, we'll draw.
## 4  Gregory 4.1   1   Ay, while you live, draw your neck out o the collar.
## 5  Sampson 5.1   1                         I strike quickly, being moved.
## 6  Gregory 6.1   1              But thou art not quickly moved to strike.
## 7  Sampson 7.1   1               A dog of the house of Montague moves me.
## 8  Gregory 8.1   1     To move is to stir; and to be valiant is to stand.
## 9  Gregory 8.2   1       therefore, if thou art moved, thou runn'st away.
## 10 Sampson 9.1   1            A dog of that house shall move me to stand.
## 11 Sampson 9.2   1 I will take the wall of any man or maid of Montague's.

sentSplit - plot Method

plot(sentSplit(DATA, "state"), grouping.var = "person")

plot of chunk unnamed-chunk-33

plot(sentSplit(DATA, "state"), grouping.var = "sex")

plot of chunk unnamed-chunk-33

TOT Example

## Convert tot column with sub sentences to turns of talk
dat <- sentSplit(DATA, "state")
TOT(dat$tot)
##  1.1  1.2  2.1  3.1  4.1  5.1  6.1  7.1  8.1  9.1 10.1 10.2 11.1 11.2 11.3 
##    1    1    2    3    4    5    6    7    8    9   10   10   11   11   11

Within dialogue (particularly classroom dialogue) several speakers may say the same speech at the same. The transcripts may lump this speech together in the form of:

Person Dialogue
John, Josh & Imani          Yes Mrs. Smith.         

The speakerSplit function attributes this text to each of the people as separate entries. The default behavior is the search for the person separators of sep = c(“and”, “&”, “,”), though other separators may be specified.

Break and Stretch if Multiple Persons per Cell

## Create data set with multiple speakers per turn of talk
DATA$person <- as.character(DATA$person)
DATA$person[c(1, 4, 6)] <- c("greg, sally, & sam",
    "greg, sally", "sam and sally")
speakerSplit(DATA)
##        person sex adult                                 state code
## 1        greg   m     0         Computer is fun. Not too fun.   K1
## 2       sally   m     0         Computer is fun. Not too fun.   K1
## 3         sam   m     0         Computer is fun. Not too fun.   K1
## 4        greg   m     0               No it's not, it's dumb.   K2
## 5     teacher   m     1                    What should we do?   K3
## 6        greg   m     0                  You liar, it stinks!   K4
## 7       sally   m     0                  You liar, it stinks!   K4
## 8        greg   m     0               I am telling the truth!   K5
## 9         sam   f     0                How can we be certain?   K6
## 10      sally   f     0                How can we be certain?   K6
## 11       greg   m     0                      There is no way.   K7
## 12        sam   m     0                       I distrust you.   K8
## 13      sally   f     0           What are you talking about?   K9
## 14 researcher   f     1         Shall we move on?  Good then.  K10
## 15       greg   m     0 I'm hungry.  Let's eat.  You already?  K11
## Change the separator
DATA$person[c(1, 4, 6)] <- c("greg_sally_sam",
    "greg.sally", "sam; sally")
speakerSplit(DATA, sep = c(".", "_", ";"))
##        person sex adult                                 state code
## 1        greg   m     0         Computer is fun. Not too fun.   K1
## 2       sally   m     0         Computer is fun. Not too fun.   K1
## 3         sam   m     0         Computer is fun. Not too fun.   K1
## 4        greg   m     0               No it's not, it's dumb.   K2
## 5     teacher   m     1                    What should we do?   K3
## 6        greg   m     0                  You liar, it stinks!   K4
## 7       sally   m     0                  You liar, it stinks!   K4
## 8        greg   m     0               I am telling the truth!   K5
## 9         sam   f     0                How can we be certain?   K6
## 10      sally   f     0                How can we be certain?   K6
## 11       greg   m     0                      There is no way.   K7
## 12        sam   m     0                       I distrust you.   K8
## 13      sally   f     0           What are you talking about?   K9
## 14 researcher   f     1         Shall we move on?  Good then.  K10
## 15       greg   m     0 I'm hungry.  Let's eat.  You already?  K11
## Reset DATA
DATA <- qdap::DATA  

The sentCombine function is the opposite of the sentSplit, combining sentences into a single turn of talk per grouping variable.

Sentence Combining

dat <- sentSplit(DATA, "state")
## Combine by person
sentCombine(dat$state, dat$person)
##        person                            text.var
## 1         sam       Computer is fun. Not too fun.
## 2        greg             No it's not, it's dumb.
## 3     teacher                  What should we do?
## 4         sam                You liar, it stinks!
## 5        greg             I am telling the truth!
## 6       sally              How can we be certain?
## 7        greg                    There is no way.
## 8         sam                     I distrust you.
## 9       sally         What are you talking about?
## 10 researcher        Shall we move on? Good then.
## 11       greg I'm hungry. Let's eat. You already?
## Combine by sex
truncdf(sentCombine(dat$state, dat$sex), 65)
##   sex                                                          text.var
## 1   m Computer is fun. Not too fun. No it's not, it's dumb. What should
## 2   f                                            How can we be certain?
## 3   m                                  There is no way. I distrust you.
## 4   f          What are you talking about? Shall we move on? Good then.
## 5   m                               I'm hungry. Let's eat. You already?

Merge Demographic Information with Person/Text Transcript

It is more efficient to maintain a dialogue dataframe (consisting of a column for people and a column for dialogue) and a separate demographics dataframe (a person column and demographic column(s)) and then merge the two during analysis. The key_merge function is a wrapper for the merge function from R's base install that merges the dialogue and demographics dataframe. key_merge attempts to guess the person column and outputs a qdap friendly dataframe.

Merging Demographic Information

## A dialogue dataframe and a demographics dataframe
ltruncdf(list(dialogue=raj, demographics=raj.demographics), 10, 50)
## $dialogue
##     person                                           dialogue act
## 1  Sampson         Gregory, o my word, we'll not carry coals.   1
## 2  Gregory                No, for then we should be colliers.   1
## 3  Sampson            I mean, an we be in choler, we'll draw.   1
## 4  Gregory Ay, while you live, draw your neck out o the colla   1
## 5  Sampson                     I strike quickly, being moved.   1
## 6  Gregory          But thou art not quickly moved to strike.   1
## 7  Sampson           A dog of the house of Montague moves me.   1
## 8  Gregory To move is to stir; and to be valiant is to stand.   1
## 9  Sampson A dog of that house shall move me to stand. I will   1
## 10 Gregory That shows thee a weak slave; for the weakest goes   1
## 
## $demographics
##            person  sex fam.aff  died
## 1         Abraham    m    mont FALSE
## 2      Apothecary    m    none FALSE
## 3       Balthasar    m    mont FALSE
## 4        Benvolio    m    mont FALSE
## 5         Capulet    f     cap FALSE
## 6          Chorus none    none FALSE
## 7   First Citizen none    none FALSE
## 8  First Musician    m    none FALSE
## 9   First Servant    m    none FALSE
## 10 First Watchman    m    none FALSE
## Merge the two
merged.raj <- key_merge(raj, raj.demographics)
htruncdf(merged.raj, 10, 40)
##     person act sex fam.aff  died                                 dialogue
## 1  Sampson   1   m     cap FALSE Gregory, o my word, we'll not carry coal
## 2  Gregory   1   m     cap FALSE      No, for then we should be colliers.
## 3  Sampson   1   m     cap FALSE  I mean, an we be in choler, we'll draw.
## 4  Gregory   1   m     cap FALSE Ay, while you live, draw your neck out o
## 5  Sampson   1   m     cap FALSE           I strike quickly, being moved.
## 6  Gregory   1   m     cap FALSE But thou art not quickly moved to strike
## 7  Sampson   1   m     cap FALSE A dog of the house of Montague moves me.
## 8  Gregory   1   m     cap FALSE To move is to stir; and to be valiant is
## 9  Sampson   1   m     cap FALSE A dog of that house shall move me to sta
## 10 Gregory   1   m     cap FALSE That shows thee a weak slave; for the we

Paste and Split Columns

Many functions in qdap utilize the paste2 function, which pastes multiple columns/lists of vectors. paste2 differs from base R's paste function in that paste2 can paste unspecified columns or a list of vectors together. The colpaste2df function, a wrapper for paste2, pastes multiple columns together and outputs an appropriately named dataframe. The colsplit2df and lcolsplit2df are useful because they can split the output from qdap functions that contain dataframes with pasted columns.

Using paste2 and colSplit: Pasting & Splitting Vectors and Dataframes

## Pasting a list of vectors
paste2(rep(list(state.abb[1:8],  month.abb[1:8]) , 2), sep = "|_|")
## [1] "AL|_|Jan|_|AL|_|Jan" "AK|_|Feb|_|AK|_|Feb" "AZ|_|Mar|_|AZ|_|Mar"
## [4] "AR|_|Apr|_|AR|_|Apr" "CA|_|May|_|CA|_|May" "CO|_|Jun|_|CO|_|Jun"
## [7] "CT|_|Jul|_|CT|_|Jul" "DE|_|Aug|_|DE|_|Aug"
## Pasting a dataframe
foo1 <- paste2(CO2[, 1:3])
head(foo1, 12)
##  [1] "Qn1.Quebec.nonchilled" "Qn1.Quebec.nonchilled"
##  [3] "Qn1.Quebec.nonchilled" "Qn1.Quebec.nonchilled"
##  [5] "Qn1.Quebec.nonchilled" "Qn1.Quebec.nonchilled"
##  [7] "Qn1.Quebec.nonchilled" "Qn2.Quebec.nonchilled"
##  [9] "Qn2.Quebec.nonchilled" "Qn2.Quebec.nonchilled"
## [11] "Qn2.Quebec.nonchilled" "Qn2.Quebec.nonchilled"
## Splitting a pasted column
bar1 <- colSplit(foo1)
head(bar1, 10)
##     X1     X2         X3
## 1  Qn1 Quebec nonchilled
## 2  Qn1 Quebec nonchilled
## 3  Qn1 Quebec nonchilled
## 4  Qn1 Quebec nonchilled
## 5  Qn1 Quebec nonchilled
## 6  Qn1 Quebec nonchilled
## 7  Qn1 Quebec nonchilled
## 8  Qn2 Quebec nonchilled
## 9  Qn2 Quebec nonchilled
## 10 Qn2 Quebec nonchilled

colpaste2df & colsplit2df: Splitting Columns in Dataframes

## Create a dataset with a pasted column
(dat <- colpaste2df(head(CO2), 1:3, keep.orig = FALSE)[, c(3, 1:2)])
##    Plant&Type&Treatment conc uptake
## 1 Qn1.Quebec.nonchilled   95   16.0
## 2 Qn1.Quebec.nonchilled  175   30.4
## 3 Qn1.Quebec.nonchilled  250   34.8
## 4 Qn1.Quebec.nonchilled  350   37.2
## 5 Qn1.Quebec.nonchilled  500   35.3
## 6 Qn1.Quebec.nonchilled  675   39.2
## Split column
colsplit2df(dat)
##   Plant   Type  Treatment conc uptake
## 1   Qn1 Quebec nonchilled   95   16.0
## 2   Qn1 Quebec nonchilled  175   30.4
## 3   Qn1 Quebec nonchilled  250   34.8
## 4   Qn1 Quebec nonchilled  350   37.2
## 5   Qn1 Quebec nonchilled  500   35.3
## 6   Qn1 Quebec nonchilled  675   39.2
## Specify names
colsplit2df(dat, new.names = qcv(A, B, C))
##     A      B          C conc uptake
## 1 Qn1 Quebec nonchilled   95   16.0
## 2 Qn1 Quebec nonchilled  175   30.4
## 3 Qn1 Quebec nonchilled  250   34.8
## 4 Qn1 Quebec nonchilled  350   37.2
## 5 Qn1 Quebec nonchilled  500   35.3
## 6 Qn1 Quebec nonchilled  675   39.2
## Keep the original pasted column
colsplit2df(dat, new.names = qcv(A, B, C), keep.orig = TRUE)
##    Plant&Type&Treatment   A      B          C conc uptake
## 1 Qn1.Quebec.nonchilled Qn1 Quebec nonchilled   95   16.0
## 2 Qn1.Quebec.nonchilled Qn1 Quebec nonchilled  175   30.4
## 3 Qn1.Quebec.nonchilled Qn1 Quebec nonchilled  250   34.8
## 4 Qn1.Quebec.nonchilled Qn1 Quebec nonchilled  350   37.2
## 5 Qn1.Quebec.nonchilled Qn1 Quebec nonchilled  500   35.3
## 6 Qn1.Quebec.nonchilled Qn1 Quebec nonchilled  675   39.2
## Pasting columns and output a dataframe
colpaste2df(head(mtcars)[, 1:5], qcv(mpg, cyl, disp), sep ="_", name.sep = "|")
##                    mpg cyl disp  hp drat mpg|cyl|disp
## Mazda RX4         21.0   6  160 110 3.90     21_6_160
## Mazda RX4 Wag     21.0   6  160 110 3.90     21_6_160
## Datsun 710        22.8   4  108  93 3.85   22.8_4_108
## Hornet 4 Drive    21.4   6  258 110 3.08   21.4_6_258
## Hornet Sportabout 18.7   8  360 175 3.15   18.7_8_360
## Valiant           18.1   6  225 105 2.76   18.1_6_225
colpaste2df(head(CO2)[, -3], list(1:2, qcv("conc", "uptake")))
##   Plant   Type conc uptake Plant&Type conc&uptake
## 1   Qn1 Quebec   95   16.0 Qn1.Quebec       95.16
## 2   Qn1 Quebec  175   30.4 Qn1.Quebec    175.30.4
## 3   Qn1 Quebec  250   34.8 Qn1.Quebec    250.34.8
## 4   Qn1 Quebec  350   37.2 Qn1.Quebec    350.37.2
## 5   Qn1 Quebec  500   35.3 Qn1.Quebec    500.35.3
## 6   Qn1 Quebec  675   39.2 Qn1.Quebec    675.39.2

lcolsplit2df: Splitting Columns in Lists of Dataframes

## A list with dataframes that contain pasted columns
x <- question_type(DATA.SPLIT$state, list(DATA.SPLIT$sex, DATA.SPLIT$adult))
ltruncdf(x[1:4])
## $raw
##   sex&adult   raw.text n.row endmark strip.text     q.type
## 1       m.1 What shoul     4       ?  what shou       what
## 2       f.0 How can we     7       ?  how can w        how
## 3       f.0 What are y    10       ?  what are        what
## 4       f.1 Shall we m    11       ?  shall we       shall
## 5       m.0 You alread    15       ?  you alrea implied_do
## 
## $count
##   sex&adult tot.quest what how shall implied_do
## 1       f.0         2    1   1     0          0
## 2       f.1         1    0   0     1          0
## 3       m.0         1    0   0     0          1
## 4       m.1         1    1   0     0          0
## 
## $prop
##   sex&adult tot.quest what how shall implied_do
## 1       f.0         2   50  50     0          0
## 2       f.1         1    0   0   100          0
## 3       m.0         1    0   0     0        100
## 4       m.1         1  100   0     0          0
## 
## $rnp
##   sex&adult tot.quest    what    how   shall implied_do
## 1       f.0         2  1(50%) 1(50%)       0          0
## 2       f.1         1       0      0 1(100%)          0
## 3       m.0         1       0      0       0    1(100%)
## 4       m.1         1 1(100%)      0       0          0
z <- lcolsplit2df(x)
ltruncdf(z[1:4])
## $raw
##   sex adult   raw.text n.row endmark strip.text     q.type
## 1   m     1 What shoul     4       ?  what shou       what
## 2   f     0 How can we     7       ?  how can w        how
## 3   f     0 What are y    10       ?  what are        what
## 4   f     1 Shall we m    11       ?  shall we       shall
## 5   m     0 You alread    15       ?  you alrea implied_do
## 
## $count
##   sex adult tot.quest what how shall implied_do
## 1   f     0         2    1   1     0          0
## 2   f     1         1    0   0     1          0
## 3   m     0         1    0   0     0          1
## 4   m     1         1    1   0     0          0
## 
## $prop
##   sex adult tot.quest what how shall implied_do
## 1   f     0         2   50  50     0          0
## 2   f     1         1    0   0   100          0
## 3   m     0         1    0   0     0        100
## 4   m     1         1  100   0     0          0
## 
## $rnp
##   sex adult tot.quest    what    how   shall implied_do
## 1   f     0         2  1(50%) 1(50%)       0          0
## 2   f     1         1       0      0 1(100%)          0
## 3   m     0         1       0      0       0    1(100%)
## 4   m     1         1 1(100%)      0       0          0

Generate Unit Spans

Often a researcher will want to view the patterns of the discourse by grouping variables over time. This requires the data to have start and end times based on units (sentence, turn of talk, or word). The gantt function provides the user with unit spans (start and end times) with the gantt_rep extending this capability to repeated measures. The gantt function has a basic plotting method to allow visualization of the unit span data, however, the gantt_wrap function extends the gantt and gantt_rep functions to plot precise depictions (Gantt plots) of the unit span data. Note that if the researcher is only interested in the plotting the data as a Gantt plot, the gantt_plot function combines the gantt/gantt_rep functions with the gantt function

Unit Spans

## Unit Span Dataframe
dat <- gantt(mraja1$dialogue, mraja1$person) 
head(dat, 12)
##     person  n start end
## 1  Sampson  8     0   8
## 2  Gregory  7     8  15
## 3  Sampson  9    15  24
## 4  Gregory 11    24  35
## 5  Sampson  5    35  40
## 6  Gregory  8    40  48
## 7  Sampson  9    48  57
## 8  Gregory 20    57  77
## 9  Sampson 22    77  99
## 10 Gregory 13    99 112
## 11 Sampson 30   112 142
## 12 Gregory 10   142 152
plot(dat)

plot of chunk unnamed-chunk-41

plot(dat, base = TRUE)

plot of chunk unnamed-chunk-41

Repeated Measures Unit Spans

## Repeated Measures Unit Span Dataframe
dat2 <- with(rajSPLIT, gantt_rep(act, dialogue, list(fam.aff, sex)))
head(dat2, 12)
##    act fam.aff_sex   n start end
## 1    1       cap_m 327     0 327
## 2    1      mont_m   8   327 335
## 3    1       cap_m   6   335 341
## 4    1      mont_m   8   341 349
## 5    1       cap_m  32   349 381
## 6    1      mont_m   4   381 385
## 7    1       cap_m  16   385 401
## 8    1      mont_m   2   401 403
## 9    1       cap_m  14   403 417
## 10   1      mont_m   2   417 419
## 11   1       cap_m  10   419 429
## 12   1      mont_m  12   429 441
## Plotting Repeated Measures Unit Span Dataframe
plot(dat2)

plot of chunk unnamed-chunk-42

gantt_wrap(dat2, "fam.aff_sex", facet.vars = "act",
    title = "Repeated Measures Gantt Plot")

plot of chunk unnamed-chunk-42

Create Adjacency Matrix

It is useful to convert data to an adjacency matrix for examining relationships between grouping variables in word usage. The adjaceny_matrix (aka: adjmat) provide this capability, interacting with a termco or wfm object. In the first example below Sam and Greg share 4 words in common, whereas, the Teacher and Greg share no words. The adjacency matrix can be passed to a network graphing package such as the igraph package for visualization of the data structure as seen in Example 3.

Adjacency Matrix: Example 1

adjacency_matrix(wfm(DATA$state, DATA$person))
## Adjacency Matrix:
## 
##            greg researcher sally sam
## researcher    0                     
## sally         1          1          
## sam           4          0     1    
## teacher       0          1     2   0
## 
## 
## Summed occurrences:
## 
##       greg researcher      sally        sam    teacher 
##         18          6         10         11          4 

Adjacency Matrix: Example 2

words <- c(" education", " war ", " econom", " job", "governor ")
(terms <- with(pres_debates2012, termco(dialogue, person, words)))
adjmat(terms)
## Adjacency Matrix:
## 
##           OBAMA ROMNEY CROWLEY LEHRER QUESTION
## ROMNEY        5                               
## CROWLEY       2      2                        
## LEHRER        4      4       2                
## QUESTION      4      4       2      4         
## SCHIEFFER     2      2       1      1        1
## 
## 
## Summed occurrences:
## 
##     OBAMA    ROMNEY   CROWLEY    LEHRER  QUESTION SCHIEFFER 
##         5         5         2         4         4         2 

It is often useful to plot the adjacency matrix as a network. The igraph package provides this functionality.

Plotting an Adjacency Matrix: Example 1

library(igraph)
dat <- adjacency_matrix(wfm(DATA$state, DATA$person, stopword = Top25Words))
g <- graph.adjacency(dat$adjacency, weighted=TRUE, mode ="undirected")
g <- simplify(g)
V(g)$label <- V(g)$name
V(g)$degree <- igraph::degree(g)
plot(g, layout=layout.auto(g))

plot of chunk unnamed-chunk-45

The following example will visualize the presidential debates data as a network plot.

Plotting an Adjacency Matrix: Example 2

library(igraph)

## Subset the presidential debates data set
subpres <- pres_debates2012[pres_debates2012$person %in% qcv(ROMNEY, OBAMA), ]

## Create a word frequency matrix
dat <- with(subpres, wfm(dialogue, list(person, time), stopword = Top200Words))

## Generate an adjacency matrix
adjdat <- adjacency_matrix(dat)
X <- adjdat$adjacency

g <- graph.adjacency(X, weighted=TRUE, mode ="undirected")
g <- simplify(g)
V(g)$label <- V(g)$name
V(g)$degree <- igraph::degree(g)
plot(g, layout=layout.auto(g))

plot of chunk unnamed-chunk-46

We can easily add information to the network plot utilizing the Dissimilarity function to obtain weights and distance measures for use with the plot.

Plotting an Adjacency Matrix: Example 2b

edge.weight <- 15  #a maximizing thickness constant
d <- as.matrix(Dissimilarity(dat))
d2 <- d[lower.tri(d)]
z1 <- edge.weight*d2^2/max(d2)
z2 <- c(round(d2, 3))
E(g)$width <- c(z1)[c(z1) != 0] 
E(g)$label <- c(z2)[c(z2) != 0]
plot(g, layout=layout.auto(g))

plot of chunk unnamed-chunk-47

plot(g, layout=layout.auto(g), edge.curved =TRUE)

plot of chunk unnamed-chunk-47

Plotting an Adjacency Matrix: Try the plot interactively!

tkplot(g)

Extract Words

The following functions will be utilized in this section (click to view more):
- Searches Text Column for Words
- Bag of Words
- Find Common Words Between Groups
- Exclude Elements From a Vector
- Generate ngrams
- Remove Stopwords
- Strip Text of Unwanted Characters/Capitalization
- Search For Synonyms
- Find Associated Words
- Differences In Word Use Between Groups
- Raw Word Lists/Frequency Counts

This section overviews functions that can extract words and word lists from dialogue text. The subsections describing function use are in alphabetical order as there is no set chronology for use.

Searches Text Column for Words

The all_words breaks the dialogue into a bag of words and searches based on the criteria arguments begins.with and contains. The resulting word list can be useful for analysis or to pass to qdap functions that deal with Word Counts and Descriptive Statistics.

all_words

## Words starting with `re`
x1 <- all_words(raj$dialogue, begins.with="re")
head(x1, 10)
##    WORD       FREQ
## 1  re            2
## 2  reach         1
## 3  read          6
## 4  ready         5
## 5  rearward      1
## 6  reason        5
## 7  reason's      1
## 8  rebeck        1
## 9  rebellious    1
## 10 receipt       1
## Words containing with `conc`
all_words(raj$dialogue, contains = "conc")
##   WORD      FREQ
## 1 conceal'd    1
## 2 conceit      2
## 3 conceive     1
## 4 concludes    1
## 5 reconcile    1
## All words ordered by frequency
x2 <- all_words(raj$dialogue, alphabetical = FALSE)
head(x2, 10)
##    WORD FREQ
## 1  and   666
## 2  the   656
## 3  i     573
## 4  to    517
## 5  a     445
## 6  of    378
## 7  my    358
## 8  is    344
## 9  that  344
## 10 in    312

Word Splitting (Bag of Words)

The qdap package utilizes the following functions to turn text into a bag of words (word order is preserved):

bag_o_words Reduces a text column to a single vector bag of words.
breaker Reduces a text column to a single vector bag of words and qdap recognized end marks.
word.split Reduces a text column to a list of vectors of bag of words and qdap recognized end marks (i.e., “.”, “!”, “?”, “*”, “-”).

Bag of words can be useful for any number of reasons within the scope of analyzing discourse. Many other qdap functions employ or mention these three functions as seen in the following counts for the three word splitting functions.

Function bag_o_words breaker word.split
1 all_words.R 1 - -
2 automated_readability_index.R - - 2
3 bag_o_words.R 10 6 3
4 capitalizer.R 3 1 -
5 imperative.R - 3 -
6 ngrams.R 1 - -
7 polarity.R 2 - -
8 rm_stopwords.R 1 3 -
9 textLISTER.R - - 2
10 trans_cloud.R 1 1 -
11 wfm.R 1 - -


Word Splitting Examples

bag_o_words("I'm going home!")
## [1] "i'm"   "going" "home"
bag_o_words("I'm going home!", apostrophe.remove = TRUE)
## [1] "im"    "going" "home"
bag_o_words(DATA$state)
##  [1] "computer" "is"       "fun"      "not"      "too"      "fun"     
##  [7] "no"       "it's"     "not"      "it's"     "dumb"     "what"    
## [13] "should"   "we"       "do"       "you"      "liar"     "it"      
## [19] "stinks"   "i"        "am"       "telling"  "the"      "truth"   
## [25] "how"      "can"      "we"       "be"       "certain"  "there"   
## [31] "is"       "no"       "way"      "i"        "distrust" "you"     
## [37] "what"     "are"      "you"      "talking"  "about"    "shall"   
## [43] "we"       "move"     "on"       "good"     "then"     "i'm"     
## [49] "hungry"   "let's"    "eat"      "you"      "already"
by(DATA$state, DATA$person, bag_o_words)
## DATA$person: greg
##  [1] "no"      "it's"    "not"     "it's"    "dumb"    "i"       "am"     
##  [8] "telling" "the"     "truth"   "there"   "is"      "no"      "way"    
## [15] "i'm"     "hungry"  "let's"   "eat"     "you"     "already"
## -------------------------------------------------------- 
## DATA$person: researcher
## [1] "shall" "we"    "move"  "on"    "good"  "then" 
## -------------------------------------------------------- 
## DATA$person: sally
##  [1] "how"     "can"     "we"      "be"      "certain" "what"    "are"    
##  [8] "you"     "talking" "about"  
## -------------------------------------------------------- 
## DATA$person: sam
##  [1] "computer" "is"       "fun"      "not"      "too"      "fun"     
##  [7] "you"      "liar"     "it"       "stinks"   "i"        "distrust"
## [13] "you"     
## -------------------------------------------------------- 
## DATA$person: teacher
## [1] "what"   "should" "we"     "do"
lapply(DATA$state,  bag_o_words)
## [[1]]
## [1] "computer" "is"       "fun"      "not"      "too"      "fun"     
## 
## [[2]]
## [1] "no"   "it's" "not"  "it's" "dumb"
## 
## [[3]]
## [1] "what"   "should" "we"     "do"    
## 
## [[4]]
## [1] "you"    "liar"   "it"     "stinks"
## 
## [[5]]
## [1] "i"       "am"      "telling" "the"     "truth"  
## 
## [[6]]
## [1] "how"     "can"     "we"      "be"      "certain"
## 
## [[7]]
## [1] "there" "is"    "no"    "way"  
## 
## [[8]]
## [1] "i"        "distrust" "you"     
## 
## [[9]]
## [1] "what"    "are"     "you"     "talking" "about"  
## 
## [[10]]
## [1] "shall" "we"    "move"  "on"    "good"  "then" 
## 
## [[11]]
## [1] "i'm"     "hungry"  "let's"   "eat"     "you"     "already"
breaker(DATA$state)
##  [1] "Computer" "is"       "fun"      "."        "Not"      "too"     
##  [7] "fun"      "."        "No"       "it's"     "not,"     "it's"    
## [13] "dumb"     "."        "What"     "should"   "we"       "do"      
## [19] "?"        "You"      "liar,"    "it"       "stinks"   "!"       
## [25] "I"        "am"       "telling"  "the"      "truth"    "!"       
## [31] "How"      "can"      "we"       "be"       "certain"  "?"       
## [37] "There"    "is"       "no"       "way"      "."        "I"       
## [43] "distrust" "you"      "."        "What"     "are"      "you"     
## [49] "talking"  "about"    "?"        "Shall"    "we"       "move"    
## [55] "on"       "?"        "Good"     "then"     "."        "I'm"     
## [61] "hungry"   "."        "Let's"    "eat"      "."        "You"     
## [67] "already"  "?"
by(DATA$state, DATA$person, breaker)
## DATA$person: greg
##  [1] "No"      "it's"    "not,"    "it's"    "dumb"    "."       "I"      
##  [8] "am"      "telling" "the"     "truth"   "!"       "There"   "is"     
## [15] "no"      "way"     "."       "I'm"     "hungry"  "."       "Let's"  
## [22] "eat"     "."       "You"     "already" "?"      
## -------------------------------------------------------- 
## DATA$person: researcher
## [1] "Shall" "we"    "move"  "on"    "?"     "Good"  "then"  "."    
## -------------------------------------------------------- 
## DATA$person: sally
##  [1] "How"     "can"     "we"      "be"      "certain" "?"       "What"   
##  [8] "are"     "you"     "talking" "about"   "?"      
## -------------------------------------------------------- 
## DATA$person: sam
##  [1] "Computer" "is"       "fun"      "."        "Not"      "too"     
##  [7] "fun"      "."        "You"      "liar,"    "it"       "stinks"  
## [13] "!"        "I"        "distrust" "you"      "."       
## -------------------------------------------------------- 
## DATA$person: teacher
## [1] "What"   "should" "we"     "do"     "?"
lapply(DATA$state,  breaker)
## [[1]]
## [1] "Computer" "is"       "fun"      "."        "Not"      "too"     
## [7] "fun"      "."       
## 
## [[2]]
## [1] "No"   "it's" "not," "it's" "dumb" "."   
## 
## [[3]]
## [1] "What"   "should" "we"     "do"     "?"     
## 
## [[4]]
## [1] "You"    "liar,"  "it"     "stinks" "!"     
## 
## [[5]]
## [1] "I"       "am"      "telling" "the"     "truth"   "!"      
## 
## [[6]]
## [1] "How"     "can"     "we"      "be"      "certain" "?"      
## 
## [[7]]
## [1] "There" "is"    "no"    "way"   "."    
## 
## [[8]]
## [1] "I"        "distrust" "you"      "."       
## 
## [[9]]
## [1] "What"    "are"     "you"     "talking" "about"   "?"      
## 
## [[10]]
## [1] "Shall" "we"    "move"  "on"    "?"     "Good"  "then"  "."    
## 
## [[11]]
## [1] "I'm"     "hungry"  "."       "Let's"   "eat"     "."       "You"    
## [8] "already" "?"
word_split(c(NA, DATA$state))
## $<NA>
## [1] NA
## 
## $`Computer is fun. Not too fun.`
## [1] "Computer" "is"       "fun"      "."        "Not"      "too"     
## [7] "fun"      "."       
## 
## $`No it's not, it's dumb.`
## [1] "No"   "it's" "not," "it's" "dumb" "."   
## 
## $`What should we do?`
## [1] "What"   "should" "we"     "do"     "?"     
## 
## $`You liar, it stinks!`
## [1] "You"    "liar,"  "it"     "stinks" "!"     
## 
## $`I am telling the truth!`
## [1] "I"       "am"      "telling" "the"     "truth"   "!"      
## 
## $`How can we be certain?`
## [1] "How"     "can"     "we"      "be"      "certain" "?"      
## 
## $`There is no way.`
## [1] "There" "is"    "no"    "way"   "."    
## 
## $`I distrust you.`
## [1] "I"        "distrust" "you"      "."       
## 
## $`What are you talking about?`
## [1] "What"    "are"     "you"     "talking" "about"   "?"      
## 
## $`Shall we move on? Good then.`
## [1] "Shall" "we"    "move"  "on"    "?"     "Good"  "then"  "."    
## 
## $`I'm hungry. Let's eat. You already?`
## [1] "I'm"     "hungry"  "."       "Let's"   "eat"     "."       "You"    
## [8] "already" "?"

Find Common Words Between Groups

The common function finds items that are common between n vectors (i.e., subjects or grouping variables). This is useful for determining common language choices shared across participants in a conversation.

Words in Common Examples

## Create vectors of words
a <- c("a", "cat", "dog", "the", "the")
b <- c("corn", "a", "chicken", "the")
d <- c("house", "feed", "a", "the", "chicken")

## Supply individual vectors
common(a, b, d, overlap=2)
##      word freq
## 1       a    3
## 2     the    3
## 3 chicken    2
common(a, b, d, overlap=3)
##   word freq
## 1    a    3
## 2  the    3
## Supply a list of vectors
common(list(a, b, d))
##   word freq
## 1    a    3
## 2  the    3
## Using to find common words between subjects
common(word_list(DATA$state, DATA$person)$cwl, overlap = 2)
##   word freq
## 1   we    3
## 2  you    3
## 3    I    2
## 4   is    2
## 5  not    2
## 6 what    2

Exclude Elements From a Vector

It is often useful and more efficient to start with a preset vector of words and eliminate or exclude the words you do not wish to include. Examples could range from excluding an individual(s) from a column of participant names or excluding a few select word(s) from a pre-defined qdap word list. This is particularly useful for passing terms or stopwords to word counting functions like termco or trans_cloud.

exclude Examples

exclude(1:10, 3, 4)
## [1]  1  2  5  6  7  8  9 10
exclude(Top25Words, qcv(the, of, and))
##  [1] "a"    "to"   "in"   "is"   "you"  "that" "it"   "he"   "was"  "for" 
## [11] "on"   "are"  "as"   "with" "his"  "they" "I"    "at"   "be"   "this"
## [21] "have" "from"
exclude(Top25Words, "the", "of", "an")
##  [1] "and"  "a"    "to"   "in"   "is"   "you"  "that" "it"   "he"   "was" 
## [11] "for"  "on"   "are"  "as"   "with" "his"  "they" "I"    "at"   "be"  
## [21] "this" "have" "from"
#Using with `term_match` and `termco`
MTCH.LST <- exclude(term_match(DATA$state, qcv(th, i)), qcv(truth, stinks))
termco(DATA$state, DATA$person, MTCH.LST)
##       person word.count        th          i
## 1       greg         20 3(15.00%) 13(65.00%)
## 2 researcher          6 2(33.33%)          0
## 3      sally         10         0  4(40.00%)
## 4        sam         13         0 11(84.62%)
## 5    teacher          4         0          0

Generate ngrams

Utilizing ngrams can be useful for gaining a sense of what terms are used in conjunction with other terms. This is particularly useful in the analysis of dialogue when the combination of a particular vocabulary is meaningful. The ngrams function provides a list of ngram related output that can be utilize in various analyses.

ngrams Example note that the output is only partial

out <- ngrams(DATA$state, DATA$person, 2)
lapply(out[["all_n"]], function(x) sapply(x, paste, collapse = " "))
## $n_1
##  [1] "about"    "already"  "am"       "are"      "be"       "can"     
##  [7] "certain"  "computer" "distrust" "do"       "dumb"     "eat"     
## [13] "fun"      "fun"      "good"     "how"      "hungry"   "i"       
## [19] "i"        "i'm"      "is"       "is"       "it"       "it's"    
## [25] "it's"     "let's"    "liar"     "move"     "no"       "no"      
## [31] "not"      "not"      "on"       "shall"    "should"   "stinks"  
## [37] "talking"  "telling"  "the"      "then"     "there"    "too"     
## [43] "truth"    "way"      "we"       "we"       "we"       "what"    
## [49] "what"     "you"      "you"      "you"      "you"     
## 
## $n_2
##  [1] "am telling"    "are you"       "be certain"    "can we"       
##  [5] "computer is"   "distrust you"  "eat you"       "fun not"      
##  [9] "good then"     "how can"       "hungry let's"  "i'm hungry"   
## [13] "i am"          "i distrust"    "is fun"        "is no"        
## [17] "it's dumb"     "it's not"      "it stinks"     "let's eat"    
## [21] "liar it"       "move on"       "no it's"       "no way"       
## [25] "not it's"      "not too"       "on good"       "shall we"     
## [29] "should we"     "talking about" "telling the"   "the truth"    
## [33] "there is"      "too fun"       "we be"         "we do"        
## [37] "we move"       "what are"      "what should"   "you already"  
## [41] "you liar"      "you talking"

Remove Stopwords

In analyzing discourse it may be helpful to remove certain words from the analysis as the words may not be meaningful or may overshadow the impact of other words. The rm_stopwords function can be utilized to remove rm_stopwords from the dialogue before passing to further analysis. It should be noted that many functions have a stopwords argument that allows for the removal of the stopwords within the function environment rather than altering the text in the primary discourse dataframe. Careful researcher consideration must be given as to the functional impact of removing words from an analysis.

Stopword Removal Examples

## The data
DATA$state
##  [1] "Computer is fun. Not too fun."        
##  [2] "No it's not, it's dumb."              
##  [3] "What should we do?"                   
##  [4] "You liar, it stinks!"                 
##  [5] "I am telling the truth!"              
##  [6] "How can we be certain?"               
##  [7] "There is no way."                     
##  [8] "I distrust you."                      
##  [9] "What are you talking about?"          
## [10] "Shall we move on?  Good then."        
## [11] "I'm hungry.  Let's eat.  You already?"
rm_stopwords(DATA$state, Top200Words)
## [[1]]
## [1] "computer" "fun"      "."        "fun"      "."       
## 
## [[2]]
## [1] "it's" ","    "it's" "dumb" "."   
## 
## [[3]]
## [1] "?"
## 
## [[4]]
## [1] "liar"   ","      "stinks" "!"     
## 
## [[5]]
## [1] "am"      "telling" "truth"   "!"      
## 
## [[6]]
## [1] "certain" "?"      
## 
## [[7]]
## [1] "."
## 
## [[8]]
## [1] "distrust" "."       
## 
## [[9]]
## [1] "talking" "?"      
## 
## [[10]]
## [1] "shall" "?"     "."    
## 
## [[11]]
## [1] "i'm"     "hungry"  "."       "let's"   "eat"     "."       "already"
## [8] "?"
rm_stopwords(DATA$state, Top200Words, strip = TRUE)
## [[1]]
## [1] "computer" "fun"      "fun"     
## 
## [[2]]
## [1] "it's" "it's" "dumb"
## 
## [[3]]
## character(0)
## 
## [[4]]
## [1] "liar"   "stinks"
## 
## [[5]]
## [1] "am"      "telling" "truth"  
## 
## [[6]]
## [1] "certain"
## 
## [[7]]
## character(0)
## 
## [[8]]
## [1] "distrust"
## 
## [[9]]
## [1] "talking"
## 
## [[10]]
## [1] "shall"
## 
## [[11]]
## [1] "i'm"     "hungry"  "let's"   "eat"     "already"
rm_stopwords(DATA$state, Top200Words, separate = FALSE)
##  [1] "computer fun. fun."              "it's, it's dumb."               
##  [3] "?"                               "liar, stinks!"                  
##  [5] "am telling truth!"               "certain?"                       
##  [7] "."                               "distrust."                      
##  [9] "talking?"                        "shall?."                        
## [11] "i'm hungry. let's eat. already?"
rm_stopwords(DATA$state, Top200Words, unlist = TRUE, unique = TRUE)
##  [1] "computer" "fun"      "."        "it's"     ","        "dumb"    
##  [7] "?"        "liar"     "stinks"   "!"        "am"       "telling" 
## [13] "truth"    "certain"  "distrust" "talking"  "shall"    "i'm"     
## [19] "hungry"   "let's"    "eat"      "already"

Strip Text of Unwanted Characters/Capitalization

It is often useful to remove capitalization and punctuation from the dialogue in order to standardize the text. R is case sensitive. By removing capital letters and extra punctuation with the strip function the text is more comparable. In the following output we can see, through the == comparison operator and outer function that the use of strip makes the different forms of Dan comparable.

x <- c("Dan", "dan", "dan.", "DAN")
y <- outer(x, x, "==")
dimnames(y) <- list(x, x); y
##        Dan   dan  dan.   DAN
## Dan   TRUE FALSE FALSE FALSE
## dan  FALSE  TRUE FALSE FALSE
## dan. FALSE FALSE  TRUE FALSE
## DAN  FALSE FALSE FALSE  TRUE
x <- strip(c("Dan", "dan", "dan.", "DAN"))
y <- outer(x, x, "==")
dimnames(y) <- list(x, x); y
##      dan  dan  dan  dan
## dan TRUE TRUE TRUE TRUE
## dan TRUE TRUE TRUE TRUE
## dan TRUE TRUE TRUE TRUE
## dan TRUE TRUE TRUE TRUE

As seen in the examples below, strip comes with multiple arguments to adjust the flexibility of the degree of text standardization.

strip Examples

## Demonstrating the standardization of 
## The data
DATA$state
##  [1] "Computer is fun. Not too fun."        
##  [2] "No it's not, it's dumb."              
##  [3] "What should we do?"                   
##  [4] "You liar, it stinks!"                 
##  [5] "I am telling the truth!"              
##  [6] "How can we be certain?"               
##  [7] "There is no way."                     
##  [8] "I distrust you."                      
##  [9] "What are you talking about?"          
## [10] "Shall we move on?  Good then."        
## [11] "I'm hungry.  Let's eat.  You already?"
strip(DATA$state)
##  [1] "computer is fun not too fun"    "no its not its dumb"           
##  [3] "what should we do"              "you liar it stinks"            
##  [5] "i am telling the truth"         "how can we be certain"         
##  [7] "there is no way"                "i distrust you"                
##  [9] "what are you talking about"     "shall we move on good then"    
## [11] "im hungry lets eat you already"
strip(DATA$state, apostrophe.remove=FALSE)
##  [1] "computer is fun not too fun"      "no it's not it's dumb"           
##  [3] "what should we do"                "you liar it stinks"              
##  [5] "i am telling the truth"           "how can we be certain"           
##  [7] "there is no way"                  "i distrust you"                  
##  [9] "what are you talking about"       "shall we move on good then"      
## [11] "i'm hungry let's eat you already"
strip(DATA$state, char.keep = c("?", "."))
##  [1] "computer is fun. not too fun."    
##  [2] "no its not its dumb."             
##  [3] "what should we do?"               
##  [4] "you liar it stinks"               
##  [5] "i am telling the truth"           
##  [6] "how can we be certain?"           
##  [7] "there is no way."                 
##  [8] "i distrust you."                  
##  [9] "what are you talking about?"      
## [10] "shall we move on? good then."     
## [11] "im hungry. lets eat. you already?"

Search For Synonyms

It is useful in discourse analysis to analyze vocabulary use. This may mean searching for words similar to your initial word list. The synonyms (aka syn) function generates synonyms from the qdapDictionaries' SYNONYM dictionary. These synonyms can be returned as a list or a vector that can then be passed to other qdap functions.

Synonyms Examples

synonyms(c("the", "cat", "teach"))
## no match for the following:
## 
## the
## ========================
## $cat.def_1
## [1] "feline"    "gib"       "grimalkin" "kitty"     "malkin"   
## 
## $cat.def_2
## [1] "moggy"
## 
## $cat.def_3
## [1] "mouser" "puss"  
## 
## $cat.def_4
## [1] "pussy"
## 
## $cat.def_5
## [1] "tabby"
## 
## $teach.def_1
##  [1] "advise"          "coach"           "demonstrate"    
##  [4] "direct"          "discipline"      "drill"          
##  [7] "edify"           "educate"         "enlighten"      
## [10] "give lessons in" "guide"           "impart"         
## [13] "implant"         "inculcate"       "inform"         
## [16] "instil"          "instruct"        "school"         
## [19] "show"            "train"           "tutor"
syn(c("the", "cat", "teach"), return.list = FALSE)
## no match for the following:
## 
## the
## ========================
##  [1] "feline"          "gib"             "grimalkin"      
##  [4] "kitty"           "malkin"          "moggy"          
##  [7] "mouser"          "puss"            "pussy"          
## [10] "tabby"           "advise"          "coach"          
## [13] "demonstrate"     "direct"          "discipline"     
## [16] "drill"           "edify"           "educate"        
## [19] "enlighten"       "give lessons in" "guide"          
## [22] "impart"          "implant"         "inculcate"      
## [25] "inform"          "instil"          "instruct"       
## [28] "school"          "show"            "train"          
## [31] "tutor"
syn(c("the", "cat", "teach"), multiwords = FALSE)
## no match for the following:
## 
## the
## ========================
## $cat.def_1
## [1] "feline"    "gib"       "grimalkin" "kitty"     "malkin"   
## 
## $cat.def_2
## [1] "moggy"
## 
## $cat.def_3
## [1] "mouser" "puss"  
## 
## $cat.def_4
## [1] "pussy"
## 
## $cat.def_5
## [1] "tabby"
## 
## $teach.def_1
##  [1] "advise"      "coach"       "demonstrate" "direct"      "discipline" 
##  [6] "drill"       "edify"       "educate"     "enlighten"   "guide"      
## [11] "impart"      "implant"     "inculcate"   "inform"      "instil"     
## [16] "instruct"    "school"      "show"        "train"       "tutor"

Find Associated Words

Word Association Examples

ms <- c(" I ", "you")
et <- c(" it", " tell", "tru")
word_associate(DATA2$state, DATA2$person, match.string = ms,
    wordcloud = TRUE,  proportional = TRUE,
    network.plot = TRUE,  nw.label.proportional = TRUE, extra.terms = et,
    cloud.legend =c("A", "B", "C"),
    title.color = "blue", cloud.colors = c("red", "purple", "gray70"))

plot of chunk unnamed-chunk-58 plot of chunk unnamed-chunk-58 plot of chunk unnamed-chunk-58 plot of chunk unnamed-chunk-58 plot of chunk unnamed-chunk-58 plot of chunk unnamed-chunk-58

##    row group unit text                             
## 1    4   sam    4 You liar, it stinks!             
## 2    5  greg    5 I am telling the truth!          
## 3    8   sam    8 I distrust you.                  
## 4    9 sally    9 What are you talking about?      
## 5   11  greg   11 Im hungry. Lets eat. You already?
## 6   12   sam   12 I distrust you.                  
## 7   15  greg   15 I am telling the truth!          
## 8   18  greg   18 Im hungry. Lets eat. You already?
## 9   19 sally   19 What are you talking about?      
## 10  20   sam   20 You liar, it stinks!             
## 11  21  greg   21 I am telling the truth!          
## 12  22   sam   22 You liar, it stinks!             
## 13  24  greg   24 Im hungry. Lets eat. You already?
## 14  25  greg   25 I am telling the truth!          
## 15  30   sam   30 I distrust you.                  
## 16  31  greg   31 Im hungry. Lets eat. You already?
## 17  33   sam   33 I distrust you.                  
## 18  36   sam   36 You liar, it stinks!             
## 19  40  greg   40 I am telling the truth!          
## 20  41   sam   41 You liar, it stinks!             
## 21  42  greg   42 I am telling the truth!          
## 22  44   sam   44 You liar, it stinks!             
## 23  47   sam   47 I distrust you.                  
## 24  49   sam   49 You liar, it stinks!             
## 25  52 sally   52 What are you talking about?      
## 26  53 sally   53 What are you talking about?      
## 27  54  greg   54 I am telling the truth!          
## 28  55   sam   55 I distrust you.                  
## 29  56  greg   56 Im hungry. Lets eat. You already?
## 30  57  greg   57 I am telling the truth!          
## 31  58  greg   58 I am telling the truth!          
## 32  59  greg   59 Im hungry. Lets eat. You already?
## 33  62   sam   62 You liar, it stinks!             
## 34  63 sally   63 What are you talking about?      
## 35  65   sam   65 I distrust you.                  
## 36  67 sally   67 What are you talking about?      
## 37  68   sam   68 I distrust you.
## 
## Match Terms
## ===========
## 
## List 1:
## i, you

Differences In Word Use Between Groups

Word Difference Examples

out <- with(DATA, word_diff_list(text.var = state,
    grouping.var = list(sex, adult)))

ltruncdf(unlist(out, recursive = FALSE), n=4)
## $f.0_vs_f.1.unique_to_f.0
##    word freq       prop
## 1 about    1        0.1
## 2   are    1       0.25
## 3    be    1        0.1
## 4   can    1 0.16666666
## 
## $f.0_vs_f.1.unique_to_f.1
##    word freq       prop
## 1  good    1 0.03030303
## 2  move    1        0.1
## 3    on    1       0.25
## 4 shall    1        0.1
## 
## $f.0_vs_m.0.unique_to_f.0
##    word freq       prop
## 1 about    1        0.1
## 2   are    1       0.25
## 3    be    1        0.1
## 4   can    1 0.16666666
## 
## $f.0_vs_m.0.unique_to_m.0
##   word freq       prop
## 1  fun    2 0.06060606
## 2    i    2 0.06060606
## 3   is    2        0.2
## 4 it's    2 0.06060606
## 
## $f.1_vs_m.0.unique_to_f.1
##    word freq       prop
## 1  good    1 0.03030303
## 2  move    1        0.1
## 3    on    1       0.25
## 4 shall    1        0.1
## 
## $f.1_vs_m.0.unique_to_m.0
##   word freq       prop
## 1  you    3 0.09090909
## 2  fun    2 0.06060606
## 3    i    2 0.06060606
## 4   is    2        0.2
## 
## $f.0_vs_m.1.unique_to_f.0
##    word freq       prop
## 1 about    1        0.1
## 2   are    1       0.25
## 3    be    1        0.1
## 4   can    1 0.16666666
## 
## $f.0_vs_m.1.unique_to_m.1
##     word freq prop
## 1     do    1  0.1
## 2 should    1 0.25
## 
## $f.1_vs_m.1.unique_to_f.1
##    word freq       prop
## 1  good    1 0.03030303
## 2  move    1        0.1
## 3    on    1       0.25
## 4 shall    1        0.1
## 
## $f.1_vs_m.1.unique_to_m.1
##     word freq       prop
## 1     do    1        0.1
## 2 should    1       0.25
## 3   what    1 0.03030303
## 
## $m.0_vs_m.1.unique_to_m.0
##   word freq       prop
## 1  you    3 0.09090909
## 2  fun    2 0.06060606
## 3    i    2 0.06060606
## 4   is    2        0.2
## 
## $m.0_vs_m.1.unique_to_m.1
##     word freq       prop
## 1     do    1        0.1
## 2 should    1       0.25
## 3     we    1 0.16666666
## 4   what    1 0.03030303

Raw Word Lists/Frequency Counts

word_list Examples

with(DATA, word_list(state, person))
## $greg
##       WORD FREQ
## 1     it's    2
## 2       no    2
## 3  already    1
## 4       am    1
## 5     dumb    1
## 6      eat    1
## 7   hungry    1
## 8        I    1
## 9      I'm    1
## 10      is    1
## 11   let's    1
## 12     not    1
## 13 telling    1
## 14     the    1
## 15   there    1
## 16   truth    1
## 17     way    1
## 18     you    1
## 
## $researcher
##    WORD FREQ
## 1  good    1
## 2  move    1
## 3    on    1
## 4 shall    1
## 5  then    1
## 6    we    1
## 
## $sally
##       WORD FREQ
## 1    about    1
## 2      are    1
## 3       be    1
## 4      can    1
## 5  certain    1
## 6      how    1
## 7  talking    1
## 8       we    1
## 9     what    1
## 10     you    1
## 
## $sam
##        WORD FREQ
## 1       fun    2
## 2       you    2
## 3  computer    1
## 4  distrust    1
## 5         I    1
## 6        is    1
## 7        it    1
## 8      liar    1
## 9       not    1
## 10   stinks    1
## 11      too    1
## 
## $teacher
##     WORD FREQ
## 1     do    1
## 2 should    1
## 3     we    1
## 4   what    1
with(DATA, word_list(state, person, stopwords = Top25Words))
## $greg
##       WORD FREQ
## 1     it's    2
## 2       no    2
## 3  already    1
## 4       am    1
## 5     dumb    1
## 6      eat    1
## 7   hungry    1
## 8      I'm    1
## 9    let's    1
## 10     not    1
## 11 telling    1
## 12   there    1
## 13   truth    1
## 14     way    1
## 
## $researcher
##    WORD FREQ
## 1  good    1
## 2  move    1
## 3 shall    1
## 4  then    1
## 5    we    1
## 
## $sally
##      WORD FREQ
## 1   about    1
## 2     can    1
## 3 certain    1
## 4     how    1
## 5 talking    1
## 6      we    1
## 7    what    1
## 
## $sam
##       WORD FREQ
## 1      fun    2
## 2 computer    1
## 3 distrust    1
## 4     liar    1
## 5      not    1
## 6   stinks    1
## 7      too    1
## 
## $teacher
##     WORD FREQ
## 1     do    1
## 2 should    1
## 3     we    1
## 4   what    1
with(DATA, word_list(state, person, cap = FALSE, cap.list=c("do", "we")))
## $greg
##       WORD FREQ
## 1     it's    2
## 2       no    2
## 3  already    1
## 4       am    1
## 5     dumb    1
## 6      eat    1
## 7   hungry    1
## 8        I    1
## 9      I'm    1
## 10      is    1
## 11   let's    1
## 12     not    1
## 13 telling    1
## 14     the    1
## 15   there    1
## 16   truth    1
## 17     way    1
## 18     you    1
## 
## $researcher
##    WORD FREQ
## 1  good    1
## 2  move    1
## 3    on    1
## 4 shall    1
## 5  then    1
## 6    We    1
## 
## $sally
##       WORD FREQ
## 1    about    1
## 2      are    1
## 3       be    1
## 4      can    1
## 5  certain    1
## 6      how    1
## 7  talking    1
## 8       We    1
## 9     what    1
## 10     you    1
## 
## $sam
##        WORD FREQ
## 1       fun    2
## 2       you    2
## 3  computer    1
## 4  distrust    1
## 5         I    1
## 6        is    1
## 7        it    1
## 8      liar    1
## 9       not    1
## 10   stinks    1
## 11      too    1
## 
## $teacher
##     WORD FREQ
## 1     Do    1
## 2 should    1
## 3     We    1
## 4   what    1

Qualitative Coding System

The following functions will be utilized in this section (click to view more):
- Combine, Exclude, and Overlap Codes
- Coding Words: .csv Approach
- Coding Words: Transcript & List Approach
- Coding Words: Time Spans Approach
- Distance Matrix Between Codes

A major task in qualitative work is coding either time or words with selected coding structures. For example a researcher may code the teacher's dialogue as related to the resulting behavior of a student in a classroom as “high”, “medium” or “low” engagement. The researcher may choose to apply the coding to:

The coding process in qdap starts with the decision of whether to code the dialogue and/or the time spans. After that the researcher may follow the sequential subsections in the Qualitative Coding System section outlined in these steps:

  1. Making a template for coding dialogue/time spans
  2. The actual coding dialogue/time spans
  3. Reading in the dialogue/time spans
  4. Transforming codes (finding overlap and/or differences between word span/time span of codes)
  5. Initial analysis

If you choose the route of coding words qdap gives two approaches. Each has distinct benefits and disadvantages dependent upon the situation. If you chose the coding of time spans qdap provides one option.

If you chose the coding of words you may choose to code a csv file or to code the transcript directly (perhaps with markers or other forms of markup), record the ranges in a text list and then read in the data. Both approaches can result in the same data being read back into qdap. The csv approach may allow for extended capabilities (beyond the scope of this vignette) while the transcript/list approach is generally more efficient and takes the approach many qualitative researchers typically utilize in qualitative coding (it also has the added benefit of producing a hard copy).

The next three subsections will walk the reader through how to make a template, code in the template, and read the data back into R/qdap. Subsections 4-5 will cover reshaping and initial analysis after the data has been read in (this approach is generally the same for all three coded data types).

  1. Coding Words - The .csv Approach - How to template, code, read in and reshape the data
  2. Coding Words - The Transcript/List Approach - How to template, code, read in and reshape the data
  3. Coding Time Spans - How to template, code, read in and reshape the data
  4. Transforming Codes
  5. Initial Coding Analysis

Before getting started with subsections 1-3 the reader will want to know the naming scheme of the code matrix (cm_) functions used. The initial cm_ is utilized for any code matrix family of functions. The functions containing cm_temp are template functions. The df, range, or time determine whether the csv (df), Transcript/List (range), or Time Span (time) approach is being utilized. cm_ functions that bear 2long transform a read in list to a usable long format.

Coding Words - The .csv Approach [YT]

The csv approach utilizes cm_df.temp and cm_2long functions. To utilize the csv template approach simply supply the dataframe, specify the text variable and provide a list of anticipated codes.

Coding Words (csv approach): The Template

## Codes
codes <- qcv(dc, sf, wes, pol, rejk, lk, azx, mmm)

## The csv template
X <- cm_df.temp(DATA, text.var = "state", codes = codes, file = "DATA.csv")
qview(X)
========================================================================
nrow =  56           ncol =  14             X
========================================================================
   person sex adult code     text word.num dc sf wes pol rejk lk azx mmm
1     sam   m     0   K1 Computer        1  0  0   0   0    0  0   0   0
2     sam   m     0   K1       is        2  0  0   0   0    0  0   0   0
3     sam   m     0   K1     fun.        3  0  0   0   0    0  0   0   0
4     sam   m     0   K1      Not        4  0  0   0   0    0  0   0   0
5     sam   m     0   K1      too        5  0  0   0   0    0  0   0   0
6     sam   m     0   K1     fun.        6  0  0   0   0    0  0   0   0
7    greg   m     0   K2       No        7  0  0   0   0    0  0   0   0
8    greg   m     0   K2     it's        8  0  0   0   0    0  0   0   0
9    greg   m     0   K2     not,        9  0  0   0   0    0  0   0   0
10   greg   m     0   K2     it's       10  0  0   0   0    0  0   0   0

After coding the data (see the YouTube video) the data can be read back in with read.csv.

Coding Words (csv approach): Read In and Reshape

## Read in the data
dat <- read.csv("DATA.csv")

## Reshape to long format with word durations
cm_2long(dat)
    code     person sex adult code.1     text word.num start end variable
1     dc        sam   m     0     K1 Computer        1     0   1      dat
2    wes        sam   m     0     K1 Computer        1     0   1      dat
3   rejk        sam   m     0     K1 Computer        1     0   1      dat
4    mmm        sam   m     0     K1 Computer        1     0   1      dat
5     lk        sam   m     0     K1       is        2     1   2      dat
6    azx        sam   m     0     K1       is        2     1   2      dat
.
.
.
198  wes       greg   m     0    K11 already?       56    55  56      dat
199 rejk       greg   m     0    K11 already?       56    55  56      dat
200   lk       greg   m     0    K11 already?       56    55  56      dat
201  azx       greg   m     0    K11 already?       56    55  56      dat
202  mmm       greg   m     0    K11 already?       56    55  56      dat

Coding Words - The Transcript/List Approach [YT]

The Transcript/List approach utilizes cm_df.transcript, cm_range.temp and cm_2long functions. To use the transcript template simply supply the dataframe, specify the text variable and provide a list of anticipated codes.

Coding Words (Transcript/List approach): Transcript Template

## Codes
codes <- qcv(AA, BB, CC)

## Transcript template
X <- cm_df.transcript(DATA$state, DATA$person, file="DATA.txt")
sam:

                                  
     1        2  3    4   5   6   
     Computer is fun. Not too fun.

greg:

                            
     7  8    9    10   11   
     No it's not, it's dumb.

teacher:

                       
     12   13     14 15 
     What should we do?

sam:

                         
     16  17    18 19     
     You liar, it stinks!

Coding Words (Transcript/List approach): List Template 1

### List template
cm_range.temp(codes, file = "foo1.txt")
list(
    AA = qcv(terms=''),
    BB = qcv(terms=''),
    CC = qcv(terms='')
)

This list below contains demographic variables. If the researcher has demographic variables it is recommended to supply them at this point. The demographic variables will be generated with durations automatically.

Coding Words (Transcript/List approach): List Template 2

### List template with demographic variables
with(DATA, cm_range.temp(codes = codes, text.var = state, 
    grouping.var = list(person, adult), file = "foo2.txt"))
list(
    person_greg = qcv(terms='7:11, 20:24, 30:33, 49:56'),
    person_researcher = qcv(terms='42:48'),
    person_sally = qcv(terms='25:29, 37:41'),
    person_sam = qcv(terms='1:6, 16:19, 34:36'),
    person_teacher = qcv(terms='12:15'),
    adult_0 = qcv(terms='1:11, 16:41, 49:56'),
    adult_1 = qcv(terms='12:15, 42:48'),
    AA = qcv(terms=''),
    BB = qcv(terms=''),
    CC = qcv(terms='')
)

After coding the data (see the YouTube video) the data can be read back in with source. Be sure to assign list to an object (e.g., dat <- list()).

Coding Words (Transcript/List approach): Read in the data

## Read it in
source("foo1.txt")

### View it
Time1
$AA
[1] "1"

$BB
[1] "1:2,"  "3:10," "19"   

$CC
[1] "1:9,"    "100:150"

This format is not particularly useful. The data can be reshaped to long format with durations via cm_2long:

Coding Words (Transcript/List approach): Long format

## Long format with durations
datL <- cm_2long(Time1)
datL
  code start end variable
1   AA     0   1    Time1
2   BB     0   2    Time1
3   BB     2  10    Time1
4   BB    18  19    Time1
5   CC     0   9    Time1
6   CC    99 150    Time1

Coding Time Spans [YT]

The Time Span approach utilizes the cm_time.temp and cm_2long functions. To generate the timespan template approach simply supply the list of anticipated codes and a start/end time.

Coding Times Spans: Time Span Template

## Codes
## Time span template
X <- cm_time.temp(start = ":14", end = "7:40", file="timespans.txt")
X <- cm_time.temp(start = ":14", end = "7:40", file="timespans.doc")
[0]                                14 15 16 ... 51 52 53 54 55 56 57 58 59
[1]0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ... 51 52 53 54 55 56 57 58 59
[2]0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ... 51 52 53 54 55 56 57 58 59
[3]0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ... 51 52 53 54 55 56 57 58 59
[4]0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ... 51 52 53 54 55 56 57 58 59
[5]0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ... 51 52 53 54 55 56 57 58 59
[6]0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ... 51 52 53 54 55 56 57 58 59
[7]0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ... 51 52 53                                                

Coding Times Spans: List Template 1

### List template
codes <- qcv(AA, BB, CC)
cm_time.temp(codes, file = "codelist.txt")
 list(                                                 
     transcript_time_span = qcv(terms="00:00 - 00:00"),
     AA = qcv(terms=""),                               
     BB = qcv(terms=""),                               
     CC = qcv(terms="")                                
 )  

This list below contains demographic variables. If the researcher has demographic variables it is recommended to supply them at this point.

Coding Times Spans: List Template 2

### List template with demographic variables
with(DATA, cm_time.temp(codes, list(person, adult), file = "codelist.txt"))
list(
    transcript_time_span = qcv(terms="00:00 - 00:00"),
    person_sam = qcv(terms=""),
    person_greg = qcv(terms=""),
    person_teacher = qcv(terms=""),
    person_sally = qcv(terms=""),
    person_researcher = qcv(terms=""),
    adult_0 = qcv(terms=""),
    adult_1 = qcv(terms=""),
    AA = qcv(terms=""),
    BB = qcv(terms=""),
    CC = qcv(terms="")
)

After coding the data (see the YouTube video) the data can be read back in with source. Be sure to assign list to an object (e.g., dat <- list()).

Coding Times Spans: Read in the data

## Read it in
source("codelist.txt")

### View it
Time1
$transcript_time_span
[1] "00:00"   "-"       "1:12:00"

$A
[1] "2.40:3.00," "5.01,"      "6.52:7.00," "9.00"      

$B
[1] "2.40,"      "3.01:3.40," "5.01,"      "6.52:7.00," "9.00"      

$C
[1] "2.40:4.00,"  "5.01,"       "6.52:7.00,"  "9.00,"       "13.00:17.01"

This format is not particularly useful. The data can be reshaped to long format with durations via cm_2long:

Coding Times Spans: Long format

## Long format with durations
datL <- cm_2long(Time1, v.name = "time")
datL
   code start  end    Start      End variable
1     A   159  180 00:02:39 00:03:00    Time1
2     A   300  301 00:05:00 00:05:01    Time1
3     A   411  420 00:06:51 00:07:00    Time1
4     A   539  540 00:08:59 00:09:00    Time1
5     B   159  160 00:02:39 00:02:40    Time1
6     B   180  220 00:03:00 00:03:40    Time1
7     B   300  301 00:05:00 00:05:01    Time1
8     B   411  420 00:06:51 00:07:00    Time1
9     B   539  540 00:08:59 00:09:00    Time1
10    C   159  240 00:02:39 00:04:00    Time1
11    C   300  301 00:05:00 00:05:01    Time1
12    C   411  420 00:06:51 00:07:00    Time1
13    C   539  540 00:08:59 00:09:00    Time1
14    C   779 1021 00:12:59 00:17:01    Time1

Transforming Codes

The researcher may want to determine where codes do and do not overlap with one other. The cm_ family of functions bearing (cm_code.) perform various transformative functions (Boolean search). cm_code.combine will merge the spans (time or word) for given codes. cm_code.exclude will give provide spans that exclude given codes. cm_code.overlap will yield the spans where all of the given codes co-occur. cm_code.transform is a wrapper for the previous three functions that produces one dataframe in a single call. Lastly, cm_code.blank provides a more flexible framework that allows for the introduction of multiple logical operators between codes. Most tasks can be handled with the cm_code.transform function.

For Examples of each click the links below:
1. cm_code.combine Examples
2. cm_code.exclude Examples
3. cm_code.overlap Examples
4. cm_code.transform Examples
5. cm_code.blank Examples

For the sake of simplicity the uses of these functions will be demonstrated via a gantt plot for a visual comparison of the data sets.

The reader should note that all of the above functions utilize two helper functions (cm_long2dummy and cm_dummy2long) to stretch the spans into single units of measure (word or second) perform a calculation and then condense back to spans. More advanced needs may require the explicit use of these functions, though they are beyond the scope of this vignette.

The following data sets will be utilized throughout the demonstrations of the cm_code. family of functions:

Common Data Sets - Word Approach

foo <- list(
    AA = qcv(terms="1:10"),
    BB = qcv(terms="1:2, 3:10, 19"),
    CC = qcv(terms="1:3, 5:6")
)

foo2 <- list(
    AA = qcv(terms="4:8"),
    BB = qcv(terms="1:4, 10:12"),
    CC = qcv(terms="1, 11, 15:20"),
    DD = qcv(terms="")
)
## Single time, long word approach
(x <- cm_2long(foo))
##   code start end variable
## 1   AA     0  10      foo
## 2   BB     0   2      foo
## 3   BB     2  10      foo
## 4   BB    18  19      foo
## 5   CC     0   3      foo
## 6   CC     4   6      foo

plot of chunk unnamed-chunk-63

## Repeated measures, long word approach
(z <- cm_2long(foo, foo2, v.name="time"))
##    code start end time
## 1    AA     0  10  foo
## 2    BB     0   2  foo
## 3    BB     2  10  foo
## 4    BB    18  19  foo
## 5    CC     0   3  foo
## 6    CC     4   6  foo
## 7    AA     3   8 foo2
## 8    BB     0   4 foo2
## 9    BB     9  12 foo2
## 10   CC     0   1 foo2
## 11   CC    10  11 foo2
## 12   CC    14  20 foo2

plot of chunk unnamed-chunk-65

Common Data Sets - Time Span Approach

bar1 <- list(
    transcript_time_span = qcv(00:00 - 1:12:00),
    A = qcv(terms = "2.40:3.00, 5.01, 6.02:7.00, 9.00"),
    B = qcv(terms = "2.40, 3.01:3.02, 5.01, 6.02:7.00, 9.00,
        1.12.00:1.19.01"),
    C = qcv(terms = "2.40:3.00, 5.01, 6.02:7.00, 9.00, 16.25:17.01")
)

bar2 <- list(
    transcript_time_span = qcv(00:00 - 1:12:00),
    A = qcv(terms = "2.40:3.00, 5.01, 6.02:7.00, 9.00"),
    B = qcv(terms = "2.40, 3.01:3.02, 5.01, 6.02:7.00, 9.00,
        1.12.00:1.19.01"),
    C = qcv(terms = "2.40:3.00, 5.01, 6.02:7.00, 9.00, 17.01")
)
## Single time, long time approach
(dat <- cm_2long(bar1))
##    code start  end    Start      End variable
## 1     A   159  180 00:02:39 00:03:00     bar1
## 2     A   300  301 00:05:00 00:05:01     bar1
## 3     A   361  420 00:06:01 00:07:00     bar1
## 4     A   539  540 00:08:59 00:09:00     bar1
## 5     B   159  160 00:02:39 00:02:40     bar1
## 6     B   180  182 00:03:00 00:03:02     bar1
## 7     B   300  301 00:05:00 00:05:01     bar1
## 8     B   361  420 00:06:01 00:07:00     bar1
## 9     B   539  540 00:08:59 00:09:00     bar1
## 10    B  4319 4741 01:11:59 01:19:01     bar1
## 11    C   159  180 00:02:39 00:03:00     bar1
## 12    C   300  301 00:05:00 00:05:01     bar1
## 13    C   361  420 00:06:01 00:07:00     bar1
## 14    C   539  540 00:08:59 00:09:00     bar1
## 15    C   984 1021 00:16:24 00:17:01     bar1

plot of chunk unnamed-chunk-68

## Repeated measures, long time approach
(dats <- cm_2long(bar1, bar2, v.name = "time"))
##    code start  end    Start      End time
## 1     A   159  180 00:02:39 00:03:00 bar1
## 2     A   300  301 00:05:00 00:05:01 bar1
## 3     A   361  420 00:06:01 00:07:00 bar1
## 4     A   539  540 00:08:59 00:09:00 bar1
## 5     B   159  160 00:02:39 00:02:40 bar1
## 6     B   180  182 00:03:00 00:03:02 bar1
## 7     B   300  301 00:05:00 00:05:01 bar1
## 8     B   361  420 00:06:01 00:07:00 bar1
## 9     B   539  540 00:08:59 00:09:00 bar1
## 10    B  4319 4741 01:11:59 01:19:01 bar1
## 11    C   159  180 00:02:39 00:03:00 bar1
## 12    C   300  301 00:05:00 00:05:01 bar1
## 13    C   361  420 00:06:01 00:07:00 bar1
## 14    C   539  540 00:08:59 00:09:00 bar1
## 15    C   984 1021 00:16:24 00:17:01 bar1
## 16    A   159  180 00:02:39 00:03:00 bar2
## 17    A   300  301 00:05:00 00:05:01 bar2
## 18    A   361  420 00:06:01 00:07:00 bar2
## 19    A   539  540 00:08:59 00:09:00 bar2
## 20    B   159  160 00:02:39 00:02:40 bar2
## 21    B   180  182 00:03:00 00:03:02 bar2
## 22    B   300  301 00:05:00 00:05:01 bar2
## 23    B   361  420 00:06:01 00:07:00 bar2
## 24    B   539  540 00:08:59 00:09:00 bar2
## 25    B  4319 4741 01:11:59 01:19:01 bar2
## 26    C   159  180 00:02:39 00:03:00 bar2
## 27    C   300  301 00:05:00 00:05:01 bar2
## 28    C   361  420 00:06:01 00:07:00 bar2
## 29    C   539  540 00:08:59 00:09:00 bar2
## 30    C  1020 1021 00:17:00 00:17:01 bar2

plot of chunk unnamed-chunk-70

cm_code.combine Examples

cm_code.combine provides all the spans (time/words) that are occupied by one or more of the combined codes. For example, if we utilized cm_code.combine on code list X and Y the result would be any span where X or Y is located. This is the OR of the Boolean search. Note that combine.code.list must be supplied as a list of named character vectors.

cm_code.combine Single Time Word Example

(cc1 <- cm_code.combine(x, list(ALL=qcv(AA, BB, CC))))
##   code start end
## 1   AA     0  10
## 2   BB     0  10
## 3   BB    18  19
## 4   CC     0   3
## 5   CC     4   6
## 6  ALL     0  10
## 7  ALL    18  19

plot of chunk unnamed-chunk-72

cm_code.combine Repeated Measures Word Example

combines <- list(AB=qcv(AA, BB), ABC=qcv(AA, BB, CC))
(cc2 <- cm_code.combine(z, combines, rm.var = "time"))
##    code start end time
## 1    AA     0  10  foo
## 2    BB     0  10  foo
## 3    BB    18  19  foo
## 4    CC     0   3  foo
## 5    CC     4   6  foo
## 6    AB     0  10  foo
## 7    AB    18  19  foo
## 8   ABC     0  10  foo
## 9   ABC    18  19  foo
## 10   AA     3   8 foo2
## 11   BB     0   4 foo2
## 12   BB     9  12 foo2
## 13   CC     0   1 foo2
## 14   CC    10  11 foo2
## 15   CC    14  20 foo2
## 16   AB     0   8 foo2
## 17   AB     9  12 foo2
## 18  ABC     0   8 foo2
## 19  ABC     9  12 foo2
## 20  ABC    14  20 foo2

plot of chunk unnamed-chunk-74

cm_code.combine Single Time Time Span Example

combines2 <- list(AB=qcv(A, B), BC=qcv(B, C), ABC=qcv(A, B, C))
(cc3 <- cm_code.combine(dat, combines2))
##    code start  end    Start      End
## 1     A   159  180 00:02:39 00:03:00
## 2     A   300  301 00:05:00 00:05:01
## 3     A   361  420 00:06:01 00:07:00
## 4     A   539  540 00:08:59 00:09:00
## 5     B   159  160 00:02:39 00:02:40
## 6     B   180  182 00:03:00 00:03:02
## 7     B   300  301 00:05:00 00:05:01
## 8     B   361  420 00:06:01 00:07:00
## 9     B   539  540 00:08:59 00:09:00
## 10    B  4319 4741 01:11:59 01:19:01
## 11    C   159  180 00:02:39 00:03:00
## 12    C   300  301 00:05:00 00:05:01
## 13    C   361  420 00:06:01 00:07:00
## 14    C   539  540 00:08:59 00:09:00
## 15    C   984 1021 00:16:24 00:17:01
## 16   AB   159  182 00:02:39 00:03:02
## 17   AB   300  301 00:05:00 00:05:01
## 18   AB   361  420 00:06:01 00:07:00
## 19   AB   539  540 00:08:59 00:09:00
## 20   AB  4319 4741 01:11:59 01:19:01
## 21   BC   159  182 00:02:39 00:03:02
## 22   BC   300  301 00:05:00 00:05:01
## 23   BC   361  420 00:06:01 00:07:00
## 24   BC   539  540 00:08:59 00:09:00
## 25   BC   984 1021 00:16:24 00:17:01
## 26   BC  4319 4741 01:11:59 01:19:01
## 27  ABC   159  182 00:02:39 00:03:02
## 28  ABC   300  301 00:05:00 00:05:01
## 29  ABC   361  420 00:06:01 00:07:00
## 30  ABC   539  540 00:08:59 00:09:00
## 31  ABC   984 1021 00:16:24 00:17:01
## 32  ABC  4319 4741 01:11:59 01:19:01

plot of chunk unnamed-chunk-76

cm_code.exclude Examples

cm_code.exclude provides all the spans (time/words) that are occupied by one or more of the combined codes with the exclusion of another code. For example, if we utilized cm_code.combine on code list X and Y the result would be any span where X is located but Y is not. This is the NOT of the Boolean search. The last term supplied to exclude.code.list is the excluded term. All other terms are combined and the final code term is partitioned out. Note that exclude.code.list must be supplied as a list of named character vectors.

cm_code.exclude Single Time Word Example

(ce1 <- cm_code.exclude(x, list(BnoC=qcv(BB, CC))))
##   code start end
## 1   AA     0  10
## 2   BB     0  10
## 3   BB    18  19
## 4   CC     0   3
## 5   CC     4   6
## 6 BnoC     3   4
## 7 BnoC     6  10
## 8 BnoC    18  19

plot of chunk unnamed-chunk-78

cm_code.exclude Repeated Measures Word Example

exlist <- list(AnoB=qcv(AA, BB), ABnoC=qcv(AA, BB, CC))
(ce2 <- cm_code.exclude(z, exlist, rm.var = "time"))
##     code start end time
## 1     AA     0  10  foo
## 2     BB     0  10  foo
## 3     BB    18  19  foo
## 4     CC     0   3  foo
## 5     CC     4   6  foo
## 6  ABnoC     3   4  foo
## 7  ABnoC     6  10  foo
## 8  ABnoC    18  19  foo
## 9     AA     3   8 foo2
## 10    BB     0   4 foo2
## 11    BB     9  12 foo2
## 12    CC     0   1 foo2
## 13    CC    10  11 foo2
## 14    CC    14  20 foo2
## 15  AnoB     4   8 foo2
## 16 ABnoC     1   8 foo2
## 17 ABnoC     9  10 foo2
## 18 ABnoC    11  12 foo2

plot of chunk unnamed-chunk-80

cm_code.exclude Repeated Measures Time Span Example

exlist2 <- list(AnoB=qcv(A, B), BnoC=qcv(B, C), ABnoC=qcv(A, B, C))
(ce3 <- cm_code.exclude(dats, exlist2, "time"))
##     code start  end    Start      End time
## 1      A   159  180 00:02:39 00:03:00 bar1
## 2      A   300  301 00:05:00 00:05:01 bar1
## 3      A   361  420 00:06:01 00:07:00 bar1
## 4      A   539  540 00:08:59 00:09:00 bar1
## 5      B   159  160 00:02:39 00:02:40 bar1
## 6      B   180  182 00:03:00 00:03:02 bar1
## 7      B   300  301 00:05:00 00:05:01 bar1
## 8      B   361  420 00:06:01 00:07:00 bar1
## 9      B   539  540 00:08:59 00:09:00 bar1
## 10     B  4319 4741 01:11:59 01:19:01 bar1
## 11     C   159  180 00:02:39 00:03:00 bar1
## 12     C   300  301 00:05:00 00:05:01 bar1
## 13     C   361  420 00:06:01 00:07:00 bar1
## 14     C   539  540 00:08:59 00:09:00 bar1
## 15     C   984 1021 00:16:24 00:17:01 bar1
## 16  AnoB   160  180 00:02:40 00:03:00 bar1
## 17  BnoC   180  182 00:03:00 00:03:02 bar1
## 18  BnoC  4319 4741 01:11:59 01:19:01 bar1
## 19 ABnoC   180  182 00:03:00 00:03:02 bar1
## 20 ABnoC  4319 4741 01:11:59 01:19:01 bar1
## 21     A   159  180 00:02:39 00:03:00 bar2
## 22     A   300  301 00:05:00 00:05:01 bar2
## 23     A   361  420 00:06:01 00:07:00 bar2
## 24     A   539  540 00:08:59 00:09:00 bar2
## 25     B   159  160 00:02:39 00:02:40 bar2
## 26     B   180  182 00:03:00 00:03:02 bar2
## 27     B   300  301 00:05:00 00:05:01 bar2
## 28     B   361  420 00:06:01 00:07:00 bar2
## 29     B   539  540 00:08:59 00:09:00 bar2
## 30     B  4319 4741 01:11:59 01:19:01 bar2
## 31     C   159  180 00:02:39 00:03:00 bar2
## 32     C   300  301 00:05:00 00:05:01 bar2
## 33     C   361  420 00:06:01 00:07:00 bar2
## 34     C   539  540 00:08:59 00:09:00 bar2
## 35     C  1020 1021 00:17:00 00:17:01 bar2
## 36  AnoB   160  180 00:02:40 00:03:00 bar2
## 37  BnoC   180  182 00:03:00 00:03:02 bar2
## 38  BnoC  4319 4741 01:11:59 01:19:01 bar2
## 39 ABnoC   180  182 00:03:00 00:03:02 bar2
## 40 ABnoC  4319 4741 01:11:59 01:19:01 bar2

plot of chunk unnamed-chunk-82

cm_code.exclude Single Time Time Span Combined Exclude Example

(ce4.1 <- cm_code.combine(dat, list(AB = qcv(A, B))))
##    code start  end    Start      End
## 1     A   159  180 00:02:39 00:03:00
## 2     A   300  301 00:05:00 00:05:01
## 3     A   361  420 00:06:01 00:07:00
## 4     A   539  540 00:08:59 00:09:00
## 5     B   159  160 00:02:39 00:02:40
## 6     B   180  182 00:03:00 00:03:02
## 7     B   300  301 00:05:00 00:05:01
## 8     B   361  420 00:06:01 00:07:00
## 9     B   539  540 00:08:59 00:09:00
## 10    B  4319 4741 01:11:59 01:19:01
## 11    C   159  180 00:02:39 00:03:00
## 12    C   300  301 00:05:00 00:05:01
## 13    C   361  420 00:06:01 00:07:00
## 14    C   539  540 00:08:59 00:09:00
## 15    C   984 1021 00:16:24 00:17:01
## 16   AB   159  182 00:02:39 00:03:02
## 17   AB   300  301 00:05:00 00:05:01
## 18   AB   361  420 00:06:01 00:07:00
## 19   AB   539  540 00:08:59 00:09:00
## 20   AB  4319 4741 01:11:59 01:19:01
(ce4.2 <- cm_code.exclude(ce4.1, list(CnoAB = qcv(C, AB))))
##     code start  end    Start      End
## 1      A   159  180 00:02:39 00:03:00
## 2      A   300  301 00:05:00 00:05:01
## 3      A   361  420 00:06:01 00:07:00
## 4      A   539  540 00:08:59 00:09:00
## 5      B   159  160 00:02:39 00:02:40
## 6      B   180  182 00:03:00 00:03:02
## 7      B   300  301 00:05:00 00:05:01
## 8      B   361  420 00:06:01 00:07:00
## 9      B   539  540 00:08:59 00:09:00
## 10     B  4319 4741 01:11:59 01:19:01
## 11     C   159  180 00:02:39 00:03:00
## 12     C   300  301 00:05:00 00:05:01
## 13     C   361  420 00:06:01 00:07:00
## 14     C   539  540 00:08:59 00:09:00
## 15     C   984 1021 00:16:24 00:17:01
## 16    AB   159  182 00:02:39 00:03:02
## 17    AB   300  301 00:05:00 00:05:01
## 18    AB   361  420 00:06:01 00:07:00
## 19    AB   539  540 00:08:59 00:09:00
## 20    AB  4319 4741 01:11:59 01:19:01
## 21 CnoAB   984 1021 00:16:24 00:17:01

plot of chunk unnamed-chunk-84

cm_code.overlap Examples

cm_code.overlap provides all the spans (time/words) that are occupied by all of the given codes. For example, if we utilized cm_code.overlap on code list X and Y the result would be any span where X and Y are both located. This is the AND of the Boolean search. Note that overlap.code.list must be supplied as a list of named character vectors.

cm_code.overlap Single Time Word Example

(co1 <- cm_code.overlap(x, list(BC=qcv(BB, CC))))
##   code start end
## 1   AA     0  10
## 2   BB     0  10
## 3   BB    18  19
## 4   CC     0   3
## 5   CC     4   6
## 6   BC     0   3
## 7   BC     4   6

plot of chunk unnamed-chunk-86

cm_code.overlap Repeated Measures Word Example

overlist <- list(AB=qcv(AA, BB), ABC=qcv(AA, BB, CC))
(co2 <- cm_code.overlap(z, overlist, rm.var = "time"))
##    code start end time
## 1    AA     0  10  foo
## 2    BB     0  10  foo
## 3    BB    18  19  foo
## 4    CC     0   3  foo
## 5    CC     4   6  foo
## 6    AB     0  10  foo
## 7   ABC     0   3  foo
## 8   ABC     4   6  foo
## 9    AA     3   8 foo2
## 10   BB     0   4 foo2
## 11   BB     9  12 foo2
## 12   CC     0   1 foo2
## 13   CC    10  11 foo2
## 14   CC    14  20 foo2
## 15   AB     3   4 foo2

plot of chunk unnamed-chunk-88

cm_code.overlap Repeated Measures Time Span Example

overlist2 <- list(AB=qcv(A, B), BC=qcv(B, C), ABC=qcv(A, B, C))
(co3 <- cm_code.overlap(dats, overlist2, "time"))
##    code start  end    Start      End time
## 1     A   159  180 00:02:39 00:03:00 bar1
## 2     A   300  301 00:05:00 00:05:01 bar1
## 3     A   361  420 00:06:01 00:07:00 bar1
## 4     A   539  540 00:08:59 00:09:00 bar1
## 5     B   159  160 00:02:39 00:02:40 bar1
## 6     B   180  182 00:03:00 00:03:02 bar1
## 7     B   300  301 00:05:00 00:05:01 bar1
## 8     B   361  420 00:06:01 00:07:00 bar1
## 9     B   539  540 00:08:59 00:09:00 bar1
## 10    B  4319 4741 01:11:59 01:19:01 bar1
## 11    C   159  180 00:02:39 00:03:00 bar1
## 12    C   300  301 00:05:00 00:05:01 bar1
## 13    C   361  420 00:06:01 00:07:00 bar1
## 14    C   539  540 00:08:59 00:09:00 bar1
## 15    C   984 1021 00:16:24 00:17:01 bar1
## 16   AB   159  160 00:02:39 00:02:40 bar1
## 17   AB   300  301 00:05:00 00:05:01 bar1
## 18   AB   361  420 00:06:01 00:07:00 bar1
## 19   AB   539  540 00:08:59 00:09:00 bar1
## 20   BC   159  160 00:02:39 00:02:40 bar1
## 21   BC   300  301 00:05:00 00:05:01 bar1
## 22   BC   361  420 00:06:01 00:07:00 bar1
## 23   BC   539  540 00:08:59 00:09:00 bar1
## 24  ABC   159  160 00:02:39 00:02:40 bar1
## 25  ABC   300  301 00:05:00 00:05:01 bar1
## 26  ABC   361  420 00:06:01 00:07:00 bar1
## 27  ABC   539  540 00:08:59 00:09:00 bar1
## 28    A   159  180 00:02:39 00:03:00 bar2
## 29    A   300  301 00:05:00 00:05:01 bar2
## 30    A   361  420 00:06:01 00:07:00 bar2
## 31    A   539  540 00:08:59 00:09:00 bar2
## 32    B   159  160 00:02:39 00:02:40 bar2
## 33    B   180  182 00:03:00 00:03:02 bar2
## 34    B   300  301 00:05:00 00:05:01 bar2
## 35    B   361  420 00:06:01 00:07:00 bar2
## 36    B   539  540 00:08:59 00:09:00 bar2
## 37    B  4319 4741 01:11:59 01:19:01 bar2
## 38    C   159  180 00:02:39 00:03:00 bar2
## 39    C   300  301 00:05:00 00:05:01 bar2
## 40    C   361  420 00:06:01 00:07:00 bar2
## 41    C   539  540 00:08:59 00:09:00 bar2
## 42    C  1020 1021 00:17:00 00:17:01 bar2
## 43   AB   159  160 00:02:39 00:02:40 bar2
## 44   AB   300  301 00:05:00 00:05:01 bar2
## 45   AB   361  420 00:06:01 00:07:00 bar2
## 46   AB   539  540 00:08:59 00:09:00 bar2
## 47   BC   159  160 00:02:39 00:02:40 bar2
## 48   BC   300  301 00:05:00 00:05:01 bar2
## 49   BC   361  420 00:06:01 00:07:00 bar2
## 50   BC   539  540 00:08:59 00:09:00 bar2
## 51  ABC   159  160 00:02:39 00:02:40 bar2
## 52  ABC   300  301 00:05:00 00:05:01 bar2
## 53  ABC   361  420 00:06:01 00:07:00 bar2
## 54  ABC   539  540 00:08:59 00:09:00 bar2

plot of chunk unnamed-chunk-90

cm_code.transform Examples

cm_code.transform is merely a wrapper for cm_code.combine, cm_code.exclude, and cm_code.overlap.

cm_code.transform - Example 1

ct1 <- cm_code.transform(x, 
    overlap.code.list = list(oABC=qcv(AA, BB, CC)),
    combine.code.list = list(ABC=qcv(AA, BB, CC)), 
    exclude.code.list = list(ABnoC=qcv(AA, BB, CC))
)
ct1
##     code start end
## 1     AA     0  10
## 2     BB     0  10
## 3     BB    18  19
## 4     CC     0   3
## 5     CC     4   6
## 6   oABC     0   3
## 7   oABC     4   6
## 8    ABC     0  10
## 9    ABC    18  19
## 10 ABnoC     3   4
## 11 ABnoC     6  10
## 12 ABnoC    18  19

plot of chunk unnamed-chunk-92

cm_code.transform - Example 2

ct2 <-cm_code.transform(z, 
    overlap.code.list = list(oABC=qcv(AA, BB, CC)),
    combine.code.list = list(ABC=qcv(AA, BB, CC)), 
    exclude.code.list = list(ABnoC=qcv(AA, BB, CC)), "time"
)
ct2
##     code start end time
## 1     AA     0  10  foo
## 2     BB     0  10  foo
## 3     BB    18  19  foo
## 4     CC     0   3  foo
## 5     CC     4   6  foo
## 6   oABC     0   3  foo
## 7   oABC     4   6  foo
## 14   ABC     0  10  foo
## 15   ABC    18  19  foo
## 19 ABnoC     3   4  foo
## 20 ABnoC     6  10  foo
## 21 ABnoC    18  19  foo
## 8     AA     3   8 foo2
## 9     BB     0   4 foo2
## 10    BB     9  12 foo2
## 11    CC     0   1 foo2
## 12    CC    10  11 foo2
## 13    CC    14  20 foo2
## 16   ABC     0   8 foo2
## 17   ABC     9  12 foo2
## 18   ABC    14  20 foo2
## 22 ABnoC     1   8 foo2
## 23 ABnoC     9  10 foo2
## 24 ABnoC    11  12 foo2

plot of chunk unnamed-chunk-94

cm_code.transform - Example 3

ct3 <-cm_code.transform(dat, 
    overlap.code.list = list(oABC=qcv(A, B, C)),
    combine.code.list = list(ABC=qcv(A, B, C)), 
    exclude.code.list = list(ABnoC=qcv(A, B, C))
)
ct3
##     code start  end    Start      End
## 1      A   159  180 00:02:39 00:03:00
## 2      A   300  301 00:05:00 00:05:01
## 3      A   361  420 00:06:01 00:07:00
## 4      A   539  540 00:08:59 00:09:00
## 5      B   159  160 00:02:39 00:02:40
## 6      B   180  182 00:03:00 00:03:02
## 7      B   300  301 00:05:00 00:05:01
## 8      B   361  420 00:06:01 00:07:00
## 9      B   539  540 00:08:59 00:09:00
## 10     B  4319 4741 01:11:59 01:19:01
## 11     C   159  180 00:02:39 00:03:00
## 12     C   300  301 00:05:00 00:05:01
## 13     C   361  420 00:06:01 00:07:00
## 14     C   539  540 00:08:59 00:09:00
## 15     C   984 1021 00:16:24 00:17:01
## 16  oABC   159  160 00:02:39 00:02:40
## 17  oABC   300  301 00:05:00 00:05:01
## 18  oABC   361  420 00:06:01 00:07:00
## 19  oABC   539  540 00:08:59 00:09:00
## 20   ABC   159  182 00:02:39 00:03:02
## 21   ABC   300  301 00:05:00 00:05:01
## 22   ABC   361  420 00:06:01 00:07:00
## 23   ABC   539  540 00:08:59 00:09:00
## 24   ABC   984 1021 00:16:24 00:17:01
## 25   ABC  4319 4741 01:11:59 01:19:01
## 26 ABnoC   180  182 00:03:00 00:03:02
## 27 ABnoC  4319 4741 01:11:59 01:19:01

plot of chunk unnamed-chunk-96

cm_code.blank Examples

cm_code.blank provides flexible Boolean comparisons between word.time spans. The overlap argument takes a logical value, an integer or a character string of binary operator couple with an integer. It is important to understand how the function operates. This initial step calls cm_long2dummy as seen below (stretching the spans to dummy coded columns), the comparison is conduted between columns, and then the columns are reverted back to spans via the cm)dummy2long. This first example illustrates the stretching to dummy and reverting back to spans.

Long to dummy and dummy to long

long2dummy <- cm_long2dummy(x, "variable")
list(original =x,
    long_2_dummy_format = long2dummy[[1]],
    dummy_back_2_long = cm_dummy2long(long2dummy, "variable")
)
$original
  code start end variable
1   AA     0  10      foo
2   BB     0   2      foo
3   BB     2  10      foo
4   BB    18  19      foo
5   CC     0   3      foo
6   CC     4   6      foo

$long_2_dummy_format
   AA BB CC
0   1  1  1
1   1  1  1
2   1  1  1
3   1  1  0
4   1  1  1
5   1  1  1
6   1  1  0
7   1  1  0
8   1  1  0
9   1  1  0
10  0  0  0
11  0  0  0
12  0  0  0
13  0  0  0
14  0  0  0
15  0  0  0
16  0  0  0
17  0  0  0
18  0  1  0
19  0  0  0

$dummy_back_2_long
  code start end variable
1   AA     0  10      foo
2   BB     0  10      foo
3   BB    18  19      foo
4   CC     0   3      foo
5   CC     4   6      foo

Now let's examine a few uses of cm_code.blank. The first is to set overlap = TRUE (the default behavior). This default behavior is identical to cm_code.overlap as seen below.

cm_code.blank - overlap = TRUE

(cb1 <- cm_code.blank(x, list(ABC=qcv(AA, BB, CC))))
##   code start end
## 1   AA     0  10
## 2   BB     0  10
## 3   BB    18  19
## 4   CC     0   3
## 5   CC     4   6
## 6  ABC     0   3
## 7  ABC     4   6

plot of chunk unnamed-chunk-99

Next we'll set overlap = FALSE and see that it is identical to cm_code.combine.

cm_code.blank - overlap = FALSE

(cb2 <- cm_code.blank(x, list(ABC=qcv(AA, BB, CC)), overlap = FALSE))
##   code start end
## 1   AA     0  10
## 2   BB     0  10
## 3   BB    18  19
## 4   CC     0   3
## 5   CC     4   6
## 6  ABC     0  10
## 7  ABC    18  19

plot of chunk unnamed-chunk-101

By first combining all codes (see cb2 above) and then excluding the final code by setting overlap = 1 the behavior of cm_code.exclude can be mimicked.

cm_code.blank - mimicking cm_code.exclude

## Using the output from `cb2` above.
(cb3 <- cm_code.blank(cb2, list(ABnoC=qcv(ABC, CC)), overlap = 1))
##     code start end
## 1     AA     0  10
## 2     BB     0  10
## 3     BB    18  19
## 4     CC     0   3
## 5     CC     4   6
## 6    ABC     0  10
## 7    ABC    18  19
## 8  ABnoC     3   4
## 9  ABnoC     6  10
## 10 ABnoC    18  19

plot of chunk unnamed-chunk-103

Next we shall find when at least two codes overlap by setting overlap = ">1".

cm_code.blank - At least 2 codes overlap

blanklist <- list(AB=qcv(AA, BB), ABC=qcv(AA, BB, CC))
(cb4 <- cm_code.blank(z, blanklist, rm.var = "time", overlap = ">1"))
##    code start end time
## 1    AA     0  10  foo
## 2    BB     0  10  foo
## 3    BB    18  19  foo
## 4    CC     0   3  foo
## 5    CC     4   6  foo
## 6    AB     0  10  foo
## 7   ABC     0  10  foo
## 8    AA     3   8 foo2
## 9    BB     0   4 foo2
## 10   BB     9  12 foo2
## 11   CC     0   1 foo2
## 12   CC    10  11 foo2
## 13   CC    14  20 foo2
## 14   AB     3   4 foo2
## 15  ABC     0   1 foo2
## 16  ABC     3   4 foo2
## 17  ABC    10  11 foo2

plot of chunk unnamed-chunk-105

Last, we will find spans where not one of the codes occurred by setting overlap = "==0".

cm_code.blank - Spans where no code occurs

blanklist2 <- list(noAB=qcv(AA, BB), noABC=qcv(AA, BB, CC))
(cb5 <- cm_code.blank(z, blanklist2, rm.var = "time", overlap = "==0"))
##     code start end time
## 1     AA     0  10  foo
## 2     BB     0  10  foo
## 3     BB    18  19  foo
## 4     CC     0   3  foo
## 5     CC     4   6  foo
## 6   noAB    10  18  foo
## 7   noAB    19  20  foo
## 8  noABC    10  18  foo
## 9  noABC    19  20  foo
## 10    AA     3   8 foo2
## 11    BB     0   4 foo2
## 12    BB     9  12 foo2
## 13    CC     0   1 foo2
## 14    CC    10  11 foo2
## 15    CC    14  20 foo2
## 16  noAB     8   9 foo2
## 17  noAB    12  21 foo2
## 18 noABC     8   9 foo2
## 19 noABC    12  14 foo2
## 20 noABC    20  21 foo2

plot of chunk unnamed-chunk-107

Initial Coding Analysis

The cm_ family of functions has three approaches to initial analysis of codes. The researcher may want to summarize, visualize or determine the proximity of codes to one another. The following functions accomplish these tasks:

  1. Summary
  2. Plotting
  3. Distance Measures
Summary

Most of the cm_ family of functions have a summary method to allows for summaries of codes by group. Note that these summaries can be wrapped with plot to print a heat map of the table of summaries.

Example 1: Summarizing Transcript/List Approach

## Two transcript lists
A <- list(
    person_greg = qcv(terms='7:11, 20:24, 30:33, 49:56'),
    person_researcher = qcv(terms='42:48'),
    person_sally = qcv(terms='25:29, 37:41'),
    person_sam = qcv(terms='1:6, 16:19, 34:36'),
    person_teacher = qcv(terms='12:15'),
    adult_0 = qcv(terms='1:11, 16:41, 49:56'),
    adult_1 = qcv(terms='12:15, 42:48'),
    AA = qcv(terms="1"),
    BB = qcv(terms="1:2, 3:10, 19"),
    CC = qcv(terms="1:9, 100:150")
)

B  <- list(
    person_greg = qcv(terms='7:11, 20:24, 30:33, 49:56'),
    person_researcher = qcv(terms='42:48'),
    person_sally = qcv(terms='25:29, 37:41'),
    person_sam = qcv(terms='1:6, 16:19, 34:36'),
    person_teacher = qcv(terms='12:15'),
    adult_0 = qcv(terms='1:11, 16:41, 49:56'),
    adult_1 = qcv(terms='12:15, 42:48'),
    AA = qcv(terms="40"),
    BB = qcv(terms="50:90"),
    CC = qcv(terms="60:90, 100:120, 150"),
    DD = qcv(terms="")
)

## Long format for transcript/list approach
v <- cm_2long(A, B, v.name = "time")
head(v)
##                code start end time
## 1       person_greg     6  11    A
## 2       person_greg    19  24    A
## 3       person_greg    29  33    A
## 4       person_greg    48  56    A
## 5 person_researcher    41  48    A
## 6      person_sally    24  29    A
## Summary of the data and plotting the summary
summary(v)
time              code total percent_total n percent_n  ave min max   mean(sd)
1  a       person_greg    22         12.0% 4     18.2%  5.5   4   8   5.5(1.7)
2  a person_researcher     7          3.8% 1      4.5%  7.0   7   7     7.0(0)
3  a      person_sally    10          5.4% 2      9.1%  5.0   5   5     5.0(0)
4  a        person_sam    13          7.1% 3     13.6%  4.3   3   6   4.3(1.5)
5  a    person_teacher     4          2.2% 1      4.5%  4.0   4   4     4.0(0)
6  a           adult_0    45         24.5% 3     13.6% 15.0   8  26  15.0(9.6)
7  a           adult_1    11          6.0% 2      9.1%  5.5   4   7   5.5(2.1)
8  a                AA     1           .5% 1      4.5%  1.0   1   1     1.0(0)
9  a                BB    11          6.0% 3     13.6%  3.7   1   8   3.7(3.8)
10 a                CC    60         32.6% 2      9.1% 30.0   9  51 30.0(29.7)
11 b       person_greg    22         10.6% 4     19.0%  5.5   4   8   5.5(1.7)
12 b person_researcher     7          3.4% 1      4.8%  7.0   7   7     7.0(0)
13 b      person_sally    10          4.8% 2      9.5%  5.0   5   5     5.0(0)
14 b        person_sam    13          6.3% 3     14.3%  4.3   3   6   4.3(1.5)
15 b    person_teacher     4          1.9% 1      4.8%  4.0   4   4     4.0(0)
16 b           adult_0    45         21.7% 3     14.3% 15.0   8  26  15.0(9.6)
17 b           adult_1    11          5.3% 2      9.5%  5.5   4   7   5.5(2.1)
18 b                AA     1           .5% 1      4.8%  1.0   1   1     1.0(0)
19 b                BB    41         19.8% 1      4.8% 41.0  41  41    41.0(0)
20 b                CC    53         25.6% 3     14.3% 17.7   1  31 17.7(15.3)
============================
Unit of measure: words
plot(summary(v))

plot of chunk unnamed-chunk-110

plot(summary(v), facet.vars = "time")

plot of chunk unnamed-chunk-110

Example 2: Summarizing Time Spans Approach

## Single time list
x <- list(
    transcript_time_span = qcv(00:00 - 1:12:00),
    A = qcv(terms = "2.40:3.00, 5.01, 6.02:7.00, 9.00"),
    B = qcv(terms = "2.40, 3.01:3.02, 5.01, 6.02:7.00,
        9.00, 1.12.00:1.19.01"),
    C = qcv(terms = "2.40:3.00, 5.01, 6.02:7.00, 9.00, 17.01")
)

## Long format for time span approach
z <-cm_2long(x)
head(z)
##   code start end    Start      End variable
## 1    A   159 180 00:02:39 00:03:00        x
## 2    A   300 301 00:05:00 00:05:01        x
## 3    A   361 420 00:06:01 00:07:00        x
## 4    A   539 540 00:08:59 00:09:00        x
## 5    B   159 160 00:02:39 00:02:40        x
## 6    B   180 182 00:03:00 00:03:02        x
## Summary of the data and plotting the summary
summary(z)
  code total percent_total n percent_n  ave min max    mean(sd)
1    A 01:22         12.6% 4     26.7% 20.5   1  59  20.5(27.3)
2    B 08:06         74.7% 6     40.0% 81.0   1 422 81.0(168.6)
3    C 01:23         12.7% 5     33.3% 16.6   1  59  16.6(25.2)
============================
Unit of measure: time
Columns measured in seconds unless in the form hh:mm:ss
plot(summary(z))

plot of chunk unnamed-chunk-113

Trouble Shooting Summary: Suppress Measurement Units

## suppress printing measurement units
suppressMessages(print(summary(z)))
  code total percent_total n percent_n  ave min max    mean(sd)
1    A 01:22         12.6% 4     26.7% 20.5   1  59  20.5(27.3)
2    B 08:06         74.7% 6     40.0% 81.0   1 422 81.0(168.6)
3    C 01:23         12.7% 5     33.3% 16.6   1  59  16.6(25.2)

Trouble Shooting Summary: Print as Dataframe

## remove print method
class(z) <- "data.frame"
z
##    code start  end    Start      End variable
## 1     A   159  180 00:02:39 00:03:00        x
## 2     A   300  301 00:05:00 00:05:01        x
## 3     A   361  420 00:06:01 00:07:00        x
## 4     A   539  540 00:08:59 00:09:00        x
## 5     B   159  160 00:02:39 00:02:40        x
## 6     B   180  182 00:03:00 00:03:02        x
## 7     B   300  301 00:05:00 00:05:01        x
## 8     B   361  420 00:06:01 00:07:00        x
## 9     B   539  540 00:08:59 00:09:00        x
## 10    B  4319 4741 01:11:59 01:19:01        x
## 11    C   159  180 00:02:39 00:03:00        x
## 12    C   300  301 00:05:00 00:05:01        x
## 13    C   361  420 00:06:01 00:07:00        x
## 14    C   539  540 00:08:59 00:09:00        x
## 15    C  1020 1021 00:17:00 00:17:01        x
Plotting

Like summary, most of the cm_ family of functions have a plot method as well that allows a Gantt plot visualization of codes by group.

Gantt Plot of Transcript/List or Time Spans Data

## Two transcript lists
A <- list(
    person_greg = qcv(terms='7:11, 20:24, 30:33, 49:56'),
    person_researcher = qcv(terms='42:48'),
    person_sally = qcv(terms='25:29, 37:41'),
    person_sam = qcv(terms='1:6, 16:19, 34:36'),
    person_teacher = qcv(terms='12:15'),
    adult_0 = qcv(terms='1:11, 16:41, 49:56'),
    adult_1 = qcv(terms='12:15, 42:48'),
    AA = qcv(terms="1"),
    BB = qcv(terms="1:2, 3:10, 19"),
    CC = qcv(terms="1:9, 100:150")
)

B  <- list(
    person_greg = qcv(terms='7:11, 20:24, 30:33, 49:56'),
    person_researcher = qcv(terms='42:48'),
    person_sally = qcv(terms='25:29, 37:41'),
    person_sam = qcv(terms='1:6, 16:19, 34:36'),
    person_teacher = qcv(terms='12:15'),
    adult_0 = qcv(terms='1:11, 16:41, 49:56'),
    adult_1 = qcv(terms='12:15, 42:48'),
    AA = qcv(terms="40"),
    BB = qcv(terms="50:90"),
    CC = qcv(terms="60:90, 100:120, 150"),
    DD = qcv(terms="")
)

## Long format
x <- cm_2long(A, v.name = "time")
y <- cm_2long(A, B, v.name = "time")

## cm_code family
combs <- list(sam_n_sally = qcv(person_sam, person_sally))
z <- cm_code.combine(v, combs, "time")
plot(x, title = "Single")

plot of chunk unnamed-chunk-117

plot(y, title = "Repeated Measure")

plot of chunk unnamed-chunk-118

plot(z, title = "Combined Codes")

plot of chunk unnamed-chunk-118

Distance Measures

Often a research will want to know which codes are clustering closer to other codes (regardless of whether the codes represent word or time spans). cm_distance allows the research to find the distances between codes and standardize the mean of the differences to allow for comparisons similar to a correlation. The matrix output from cm_distance is arrived at by taking the means and standard deviations of the differences between codes and scaling them (without centering) and then multiplying the two together. This results in a standardized distance measure that is non-negative, with values closer to zero indicating codes that are found in closer proximity.

The researcher may also access the means, standard deviations and number of codes by indexing the list output for each transcript. This distance measure compliments the Gantt plot.

Note that the argument causal = FALSE (the default) does not assume Code A comes before Code B whereas causal = TRUE assumes the first code precedes the second code. Generally, setting causal = FALSE will result in larger mean of differences and accompanying standardized values. Also note that row names are the first code and column names are the second comparison code. The values for Code A compared to Code B will not be the same as Code B compared to Code A. This is because, unlike a true distance measure, cm_distance's matrix is asymmetrical. cm_distancecomputes the distance by taking each span (start and end) for Code A and comparing it to the nearest start or end for Code B. So for example there may be 6 Code A spans and thus six differences between A and B, whereas Code B may only have 3 spans and thus three differences between B and A. This fact alone will lead to differences in A compared to B versus B compared to A.

cm_distance - Initial Data Setup

x <- list(
    transcript_time_span = qcv(00:00 - 1:12:00),
    A = qcv(terms = "2.40:3.00, 6.32:7.00, 9.00,
        10.00:11.00, 33.23:40.00, 59.56"),
    B = qcv(terms = "3.01:3.02, 5.01,  19.00, 1.12.00:1.19.01"),
    C = qcv(terms = "2.40:3.00, 5.01, 6.32:7.00, 9.00, 17.01, 38.09:40.00")
)
y <- list(
    transcript_time_span = qcv(00:00 - 1:12:00),
    A = qcv(terms = "2.40:3.00, 6.32:7.00, 9.00,
        10.00:11.00, 23.44:25.00, 59.56"),
    B = qcv(terms = "3.01:3.02, 5.01, 7.05:8.00 19.30, 1.12.00:1.19.01"),
    C = qcv(terms = "2.40:3.00, 5.01, 6.32:7.30, 9.00, 17.01, 25.09:27.00")
)

## Long format
dat <- cm_2long(x, y)

plot of chunk unnamed-chunk-120

cm_distance - Non-Causal Distance

## a cm_distance output
(out1 <- cm_distance(dat, time.var = "variable"))
x

standardized:
     A    B    C
A 0.00 1.04 0.82
B 0.88 0.00 3.89
C 0.09 0.95 0.00


y

standardized:
     A    B    C
A 0.00 0.38 1.97
B 0.47 0.00 4.94
C 0.08 0.09 0.00
## The elements available from the output
names(out1)
[1] "x" "y"
## A list containing means, standard deviations and other 
## descriptive statistics for the differences between codes
out1$x
$mean
       A      B      C
A   0.00 367.67 208.67
B 322.50   0.00 509.00
C  74.67 265.00   0.00

$sd
       A      B      C
A   0.00 347.51 483.27
B 337.47   0.00 940.94
C 143.77 440.92   0.00

$n
  A B C
A 6 6 6
B 4 4 4
C 6 6 6

$combined
  A                B                 C                
A n=6              367.67(347.51)n=6 208.67(483.27)n=6
B 322.5(337.47)n=4 n=4               509(940.94)n=4   
C 74.67(143.77)n=6 265(440.92)n=6    n=6              

$standardized
     A    B    C
A 0.00 1.04 0.82
B 0.88 0.00 3.89
C 0.09 0.95 0.00

cm_distance - Causal Distance

## a cm_distance output `causal = TRUE`
cm_distance(dat, time.var = "variable", causal = TRUE)
x

standardized:
     A    B    C
A 0.66 0.84 0.08
B 0.29 3.96 0.49
C 0.40 0.86 0.37


y

standardized:
     A    B    C
A 1.11 1.63 0.08
B 0.03 2.95 0.04
C 0.70 1.27 0.11

Word Counts and Descriptive Statistics

The following functions will be utilized in this section (click to view more):
- SPSS Style Frequency Tables
- Parts of Speech Tagging & Counts
- Question Type Counts
- Syllabication and Counts
- Convert/Generate Term Document Matrix or Document Term Matrix
- Search For and Count Terms
- Word Frequency Matrix
- Word & Character Counts
- Descriptive Word Statistics

A researcher often needs to quickly gather frequency counts for various words/word types. qdap offers multiple functions designed to efficiently generate descriptive word statistics by any combination of grouping variables. Many of the functions also offer proportional usage to more fairly compare between groups. Additionally, many functions also have plotting methods to better visualize the data that is transformed.

Descriptive Word Statistics

Often a researcher may want to get a general sense of how words are functioning for different grouping variables. The word_stats function enables a quick picture of what is occurring within the data. The displayed (printed) output is a dataframe, however, the output from word_stats is actually a list. Use ?word_stats to learn more.

The displayed output is a wide dataframe, hence the abbreviated column names. The following column names and meanings will provide guidance in understanding the output:

word_stats Column Names
It is assumed you have run sentSplit on the data.
If this is not the case the counts will not be accurate.

word_stats Example

Note that the initial output is broken into three dataframe outputs because of the width of printed output from word_stats being so large. The user will see that these three dataframes are actually one wide dataframe in the R output.

(desc_wrds <- with(mraja1spl, word_stats(dialogue, person, tot = tot)))
           person n.tot n.sent n.words n.char n.syl n.poly sptot wptot
1           Romeo    49    113    1163   4757  1441     48   2.3  23.7
2        Benvolio    34     51     621   2563   780     25   1.5  18.3
3           Nurse    20     59     599   2274   724     20   3.0  29.9
4         Sampson    20     28     259    912   294      7   1.4  12.9
5          Juliet    16     24     206    789   238      5   1.5  12.9
6         Gregory    15     20     149    553   166      1   1.3   9.9
7         Capulet    14     72     736   2900   902     35   5.1  52.6
8    Lady Capulet    12     27     288   1205   370     10   2.2  24.0
9        Mercutio    11     24     549   2355   704     29   2.2  49.9
10        Servant    10     19     184    725   226      5   1.9  18.4
11         Tybalt     8     17     160    660   207      9   2.1  20.0
12       Montague     6     13     217    919   284     13   2.2  36.2
13        Abraham     5      6      24     79    26      0   1.2   4.8
14  First Servant     3      7      69    294    87      2   2.3  23.0
15 Second Servant     3      4      41    160    49      0   1.3  13.7
16  Lady Montague     2      4      28     88    30      0   2.0  14.0
17          Paris     2      3      32    124    41      2   1.5  16.0
18 Second Capulet     2      2      17     64    21      0   1.0   8.5
19         Prince     1      9     167    780   228     17   9.0 167.0
20  First Citizen     1      5      16     79    22      3   5.0  16.0
           person  wps  cps  sps psps cpw spw pspw n.state n.quest n.exclm
1           Romeo 10.3 42.1 12.8  0.4 4.1 1.2  0.0      69      22      22
2        Benvolio 12.2 50.3 15.3  0.5 4.1 1.3  0.0      39       8       4
3           Nurse 10.2 38.5 12.3  0.3 3.8 1.2  0.0      37       9      13
4         Sampson  9.2 32.6 10.5  0.2 3.5 1.1  0.0      27       1       0
5          Juliet  8.6 32.9  9.9  0.2 3.8 1.2  0.0      16       5       3
6         Gregory  7.5 27.6  8.3  0.0 3.7 1.1  0.0      14       3       3
7         Capulet 10.2 40.3 12.5  0.5 3.9 1.2  0.0      40      10      22
8    Lady Capulet 10.7 44.6 13.7  0.4 4.2 1.3  0.0      20       6       1
9        Mercutio 22.9 98.1 29.3  1.2 4.3 1.3  0.1      20       2       2
10        Servant  9.7 38.2 11.9  0.3 3.9 1.2  0.0      14       2       3
11         Tybalt  9.4 38.8 12.2  0.5 4.1 1.3  0.1      13       2       2
12       Montague 16.7 70.7 21.8  1.0 4.2 1.3  0.1      11       2       0
13        Abraham  4.0 13.2  4.3  0.0 3.3 1.1  0.0       3       2       1
14  First Servant  9.9 42.0 12.4  0.3 4.3 1.3  0.0       3       2       2
15 Second Servant 10.2 40.0 12.2  0.0 3.9 1.2  0.0       4       0       0
16  Lady Montague  7.0 22.0  7.5  0.0 3.1 1.1  0.0       2       2       0
17          Paris 10.7 41.3 13.7  0.7 3.9 1.3  0.1       2       1       0
18 Second Capulet  8.5 32.0 10.5  0.0 3.8 1.2  0.0       2       0       0
19         Prince 18.6 86.7 25.3  1.9 4.7 1.4  0.1       7       1       1
20  First Citizen  3.2 15.8  4.4  0.6 4.9 1.4  0.2       0       0       5
           person p.state p.quest p.exclm n.hapax n.dis grow.rate prop.dis
1           Romeo     0.6     0.2     0.2     365    84       0.3      0.1
2        Benvolio     0.8     0.2     0.1     252    43       0.4      0.1
3           Nurse     0.6     0.2     0.2     147    48       0.2      0.1
4         Sampson     1.0     0.0     0.0      81    22       0.3      0.1
5          Juliet     0.7     0.2     0.1      94    22       0.5      0.1
6         Gregory     0.7     0.2     0.2      72    17       0.5      0.1
7         Capulet     0.6     0.1     0.3     232    46       0.3      0.1
8    Lady Capulet     0.7     0.2     0.0     135    28       0.5      0.1
9        Mercutio     0.8     0.1     0.1     253    28       0.5      0.1
10        Servant     0.7     0.1     0.2      71    19       0.4      0.1
11         Tybalt     0.8     0.1     0.1      79    17       0.5      0.1
12       Montague     0.8     0.2     0.0     117    21       0.5      0.1
13        Abraham     0.5     0.3     0.2       3     7       0.1      0.3
14  First Servant     0.4     0.3     0.3      33     8       0.5      0.1
15 Second Servant     1.0     0.0     0.0      32     3       0.8      0.1
16  Lady Montague     0.5     0.5     0.0      24     2       0.9      0.1
17          Paris     0.7     0.3     0.0      25     2       0.8      0.1
18 Second Capulet     1.0     0.0     0.0       7     5       0.4      0.3
19         Prince     0.8     0.1     0.1      83    15       0.5      0.1
20  First Citizen     0.0     0.0     1.0       9     2       0.6      0.1
## The following shows all the available elements in the `word_stats` output
names(desc_wrds)
## [1] "ts"        "gts"       "mpun"      "word.elem" "sent.elem" "omit"     
## [7] "digits"

word_stats has a plot method that plots the output as a heat map. This can be useful for finding high/low elements in the data set.

word_stats Plot

plot(desc_wrds)

plot of chunk unnamed-chunk-131

plot(desc_wrds, label=TRUE, lab.digits = 1)

plot of chunk unnamed-chunk-132

It takes considerable time to run word_stats because it is calculating syllable counts. The user may re-use the object output from one run and bass this as the text variable (text.var) in a subsequent run with different grouping variables (grouping.vars) as long as the text variable has not changed. The example below demonstrates how to re-use the output from one word_stats run in another run.

word_stats Re-use

with(mraja1spl, word_stats(desc_wrds, list(sex, fam.aff, died), tot = tot))

Word Frequency Matrix

Many analyses with words involve a matrix based on the words. qdap uses a word frequency matrix (wfm) or the less malleable dataframe version, word frequency dataframe (wfdf). The wfm is a count of word usages per grouping variable(s). This is a similar concept to the tm package's Term Document Matrix, though instead of documents we are interested in the grouping variable's usage of terms. wfm is the general function that should be used, however, the wfdf function does provide options for margin sums (row and column). Also note that the wfm_expanded and wfm_combine can expand or combine terms within a word frequency matrix.

wfm Examples

## By a single grouping variable
with(DATA, wfm(state, person))[1:15, ]
##          greg researcher sally sam teacher
## about       0          0     1   0       0
## already     1          0     0   0       0
## am          1          0     0   0       0
## are         0          0     1   0       0
## be          0          0     1   0       0
## can         0          0     1   0       0
## certain     0          0     1   0       0
## computer    0          0     0   1       0
## distrust    0          0     0   1       0
## do          0          0     0   0       1
## dumb        1          0     0   0       0
## eat         1          0     0   0       0
## fun         0          0     0   2       0
## good        0          1     0   0       0
## how         0          0     1   0       0
## By two grouping variables
with(DATA, wfm(state, list(sex, adult)))[1:15, ]
##          f.0 f.1 m.0 m.1
## about      1   0   0   0
## already    0   0   1   0
## am         0   0   1   0
## are        1   0   0   0
## be         1   0   0   0
## can        1   0   0   0
## certain    1   0   0   0
## computer   0   0   1   0
## distrust   0   0   1   0
## do         0   0   0   1
## dumb       0   0   1   0
## eat        0   0   1   0
## fun        0   0   2   0
## good       0   1   0   0
## how        1   0   0   0

wfm: Keep Two Word Phrase as a Single Term

## insert double tilde ("~~") to keep phrases(e. g., first last name)
space_keeps <- c(" fun", "I ")
state2 <- space_fill(DATA$state, space_keeps, rm.extra = FALSE)
with(DATA, wfm(state2, list(sex, adult)))[1:18, ]
##            f.0 f.1 m.0 m.1
## about        1   0   0   0
## already      0   0   1   0
## are          1   0   0   0
## be           1   0   0   0
## can          1   0   0   0
## certain      1   0   0   0
## computer     0   0   1   0
## do           0   0   0   1
## dumb         0   0   1   0
## eat          0   0   1   0
## good         0   1   0   0
## how          1   0   0   0
## hungry       0   0   1   0
## i'm          0   0   1   0
## i am         0   0   1   0
## i distrust   0   0   1   0
## is           0   0   1   0
## is fun       0   0   1   0

At times it may be useful to view the correlation between word occurrences between turns of talk or other useful groupings. The user can utilize the output from wfm to accomplish this.

**wfm: Word Correlations**

library(reports)
x <- factor(with(rajSPLIT, paste(act, pad(TOT(tot)), sep = "|")))
dat <- wfm(rajSPLIT$dialogue, x)
cor(t(dat)[, c("romeo", "juliet")])
##        romeo juliet
## romeo  1.000  0.111
## juliet 0.111  1.000
cor(t(dat)[, c("romeo", "banished")])
##          romeo banished
## romeo    1.000    0.343
## banished 0.343    1.000
cor(t(dat)[, c("romeo", "juliet", "hate", "love")])
##           romeo    juliet     hate     love
## romeo   1.00000  0.110981 -0.04456 0.208612
## juliet  0.11098  1.000000 -0.03815 0.005002
## hate   -0.04456 -0.038149  1.00000 0.158720
## love    0.20861  0.005002  0.15872 1.000000
dat2 <- wfm(DATA$state, id(DATA))
qheat(cor(t(dat2)), low = "yellow", high = "red", 
    grid = "grey90", diag.na = TRUE, by.column = NULL) 

plot of chunk unnamed-chunk-138

wfdf Examples: Add Margins

with(DATA, wfdf(state, person, margins = TRUE))[c(1:15, 41:42), ]
##             Words greg researcher sally sam teacher TOTAL.USES
## 1           about    0          0     1   0       0          1
## 2         already    1          0     0   0       0          1
## 3              am    1          0     0   0       0          1
## 4             are    0          0     1   0       0          1
## 5              be    0          0     1   0       0          1
## 6             can    0          0     1   0       0          1
## 7         certain    0          0     1   0       0          1
## 8        computer    0          0     0   1       0          1
## 9        distrust    0          0     0   1       0          1
## 10             do    0          0     0   0       1          1
## 11           dumb    1          0     0   0       0          1
## 12            eat    1          0     0   0       0          1
## 13            fun    0          0     0   2       0          2
## 14           good    0          1     0   0       0          1
## 15            how    0          0     1   0       0          1
## 41            you    1          0     1   2       0          4
## 42 TOTAL.WORDS ->   20          6    10  13       4         53
with(DATA, wfdf(state, list(sex, adult), margins = TRUE))[c(1:15, 41:42), ]
##             Words f.0 f.1 m.0 m.1 TOTAL.USES
## 1           about   1   0   0   0          1
## 2         already   0   0   1   0          1
## 3              am   0   0   1   0          1
## 4             are   1   0   0   0          1
## 5              be   1   0   0   0          1
## 6             can   1   0   0   0          1
## 7         certain   1   0   0   0          1
## 8        computer   0   0   1   0          1
## 9        distrust   0   0   1   0          1
## 10             do   0   0   0   1          1
## 11           dumb   0   0   1   0          1
## 12            eat   0   0   1   0          1
## 13            fun   0   0   2   0          2
## 14           good   0   1   0   0          1
## 15            how   1   0   0   0          1
## 41            you   1   0   3   0          4
## 42 TOTAL.WORDS ->  10   6  33   4         53

wfm_expanded: Expand the wfm

## Start with a word frequency matrix
z <- wfm(DATA$state, DATA$person)

## Note a single `you`
z[30:41, ]
##         greg researcher sally sam teacher
## stinks     0          0     0   1       0
## talking    0          0     1   0       0
## telling    1          0     0   0       0
## the        1          0     0   0       0
## then       0          1     0   0       0
## there      1          0     0   0       0
## too        0          0     0   1       0
## truth      1          0     0   0       0
## way        1          0     0   0       0
## we         0          1     1   0       1
## what       0          0     1   0       1
## you        1          0     1   2       0
## Note that there are two `you`s in the expanded version
wfm_expanded(z)[33:45, ] 
##         greg researcher sally sam teacher
## stinks     0          0     0   1       0
## talking    0          0     1   0       0
## telling    1          0     0   0       0
## the        1          0     0   0       0
## then       0          1     0   0       0
## there      1          0     0   0       0
## too        0          0     0   1       0
## truth      1          0     0   0       0
## way        1          0     0   0       0
## we         0          1     1   0       1
## what       0          0     1   0       1
## you        1          0     1   1       0
## you        0          0     0   1       0

wfm_combine: Combine Terms in the wfm

## Start with a word frequency matrix
x <- wfm(DATA$state, DATA$person)

## The terms to exclude
WL <- list(
    random = c("the", "fun", "i"), 
    yous = c("you", "your", "you're")
)

## Combine the terms
(out <- wfm_combine(x, WL))
##            greg researcher sally sam teacher
## random        2          0     0   3       0
## yous          1          0     1   2       0
## else.words   17          6     9   8       4
## Pass the combined version to Chi Squared Test
chisq.test(out)
## 
##  Pearson's Chi-squared test
## 
## data:  out
## X-squared = 7.661, df = 8, p-value = 0.4673

wfm: Correspondence Analysis Example

library(ca)

## Grab Just the Candidates
dat <- pres_debates2012
dat <- dat[dat$person %in% qcv(ROMNEY, OBAMA), ]

## Stem the text
speech <- stemmer(dat$dialogue)

## With 25 words removed
mytable1 <- with(dat, wfm(speech, list(person, time), stopwords = Top25Words))

## CA
fit <- ca(mytable)
summary(fit)
plot(fit)
plot3d.ca(fit, labels=1)

## With 200 words removed
mytable2 <- with(dat, wfm(speech, list(person, time), stopwords = Top200Words))

## CA
fit2 <- ca(mytable2)
summary(fit2)
plot(fit2)
plot3d.ca(fit2, labels=1)
Convert/Generate Term Document Matrix or Document Term Matrix

Some packages that could further the analysis of qdap expect a Document Term or Term Document Matrix. qdap's wfm is similar to the tm package's TermDocumentMatrix and DocumentTermMatrix. qdap does not try to replicate the extensive work of thetm package, however, the as.tdm and as.dtm do attempt to extend the work the researcher conducts in qdap to be utilized in other R packages. For a vignette describing qdap-tm compatability use browseVignettes(package = "qdap") or \r HR2("http://cran.r-project.org/web/packages/qdap/vignettes/tm_package_compatibility.pdf, "Click Here").

as.tdm Use

x <- wfm(DATA$state, DATA$person)
## Term Document Matrix
as.tdm(x)
## <<TermDocumentMatrix (terms: 41, documents: 5)>>
## Non-/sparse entries: 49/156
## Sparsity           : 76%
## Maximal term length: 8
## Weighting          : term frequency (tf)
## Document Term Matrix
as.dtm(x)
## <<DocumentTermMatrix (documents: 5, terms: 41)>>
## Non-/sparse entries: 49/156
## Sparsity           : 76%
## Maximal term length: 8
## Weighting          : term frequency (tf)

Search For and Count Terms

The termco family of functions are some of the most useful qdap functions for quantitative discourse analysis. termco searches for (an optionally groups) terms and outputs a raw count, percent, and combined (raw/percent) matrix of term counts by grouping variable. The term_match all_words syn, exclude, and spaste are complementary functions that are useful in developing word lists to provide to the match.list.

The match.list acts to search for similarly grouped themes. For example c(“ read ”, “ reads”, “ reading”, “ reader”) may be a search for words associated with reading. It is good practice to name the vectors of words that are stored in the match.list . This is the general form for how to set up a match.list:

themes <- list(
    theme_1 = c(),
    theme_2 = c(),
    theme_n = c()
)

It is important to understand how the match.list is handled by termco. The match.list is (optionally) case and character sensitive. Spacing is an important way to grab specific words and requires careful thought. For example using “read” will find the words “bread”, “read”, “reading”, and “ready”. If you want to search for just the word “read” supply a vector of c(“ read ”, “ reads”, “ reading”, “ reader”). Notice the leading and trailing spaces. A space acts as a boundary whereas starting/ending with a nonspace allows for greedy matching that will find words that contain this term. A leading, trailing or both may be used to control how termco searches for the supplied terms. So the reader may ask why not supply one string spaced as “ read”? Keep in mind that termco would also find the word “ready”

This section's examples will first view the complementary functions that augment the themes supplied to match.list and then main termco function will be explored.

term_match looks through a text variable (usually the text found in the transcript) and finds/returns a vector of words containing a term(s).

term_match and exclude Examples

term_match(text.var = DATA$state, terms = qcv(the, trust), return.list = FALSE)
## [1] "distrust" "the"      "then"     "there"
term_match(DATA$state, "i", FALSE)
##  [1] "certain"  "distrust" "i"        "i'm"      "is"       "it"      
##  [7] "it's"     "liar"     "stinks"   "talking"  "telling"
exclude(term_match(DATA$state, "i", FALSE), talking, telling)
## [1] "certain"  "distrust" "i"        "i'm"      "is"       "it"      
## [7] "it's"     "liar"     "stinks"

all_words is similar to term_match, however, the function looks at all the words found in a text variable (usually the transcript text) and returns words that begin with or contain the term(s). The output can be arrange alphabetically or by frequency. The output is a dataframe which helps the researcher to make decisions with regard to frequency of word use.

all_words Examples

x1 <- all_words(raj$dialogue, begins.with="re")
head(x1, 10)
##    WORD       FREQ
## 1  re            2
## 2  reach         1
## 3  read          6
## 4  ready         5
## 5  rearward      1
## 6  reason        5
## 7  reason's      1
## 8  rebeck        1
## 9  rebellious    1
## 10 receipt       1
all_words(raj$dialogue, begins.with="q")
##    WORD        FREQ
## 1  qualities      1
## 2  quarrel       11
## 3  quarrelled     1
## 4  quarrelling    2
## 5  quarrels       1
## 6  quarter        1
## 7  queen          1
## 8  quench         2
## 9  question       2
## 10 quick          2
## 11 quickly        5
## 12 quiet          4
## 13 quinces        1
## 14 quit           1
## 15 quite          2
## 16 quivering      1
## 17 quivers        1
## 18 quote          1
## 19 quoth          5
all_words(raj$dialogue, contains="conc")
##   WORD      FREQ
## 1 conceal'd    1
## 2 conceit      2
## 3 conceive     1
## 4 concludes    1
## 5 reconcile    1
x2 <- all_words(raj$dialogue)
head(x2, 10)
##    WORD      FREQ
## 1  'tis         9
## 2  a          445
## 3  a'           1
## 4  abate        1
## 5  abbey        1
## 6  abed         1
## 7  abhorred     1
## 8  abhors       1
## 9  able         2
## 10 ableeding    1
x3 <- all_words(raj$dialogue, alphabetical = FALSE)
head(x3, 10)
##    WORD FREQ
## 1  and   666
## 2  the   656
## 3  i     573
## 4  to    517
## 5  a     445
## 6  of    378
## 7  my    358
## 8  is    344
## 9  that  344
## 10 in    312

The synonyms (short hand: syn) function finds words that are synonyms of a given set of terms and returns either a list of vector that can be passed to termco's match.list.

synonyms Examples

synonyms(c("the", "cat", "job", "environment", "read", "teach"))
## $cat.def_1
## [1] "feline"    "gib"       "grimalkin" "kitty"     "malkin"   
## 
## $cat.def_2
## [1] "moggy"
## 
## $cat.def_3
## [1] "mouser" "puss"  
## 
## $cat.def_4
## [1] "pussy"
## 
## $cat.def_5
## [1] "tabby"
## 
## $job.def_1
##  [1] "affair"         "assignment"     "charge"         "chore"         
##  [5] "concern"        "contribution"   "duty"           "enterprise"    
##  [9] "errand"         "function"       "pursuit"        "responsibility"
## [13] "role"           "stint"          "task"           "undertaking"   
## [17] "venture"        "work"          
## 
## $job.def_2
##  [1] "business"   "calling"    "capacity"   "career"     "craft"     
##  [6] "employment" "function"   "livelihood" "metier"     "occupation"
## [11] "office"     "position"   "post"       "profession" "situation" 
## [16] "trade"      "vocation"  
## 
## $job.def_3
##  [1] "allotment"   "assignment"  "batch"       "commission"  "consignment"
##  [6] "contract"    "lot"         "output"      "piece"       "portion"    
## [11] "product"     "share"      
## 
## $environment.def_1
##  [1] "atmosphere"   "background"   "conditions"   "context"     
##  [5] "domain"       "element"      "habitat"      "locale"      
##  [9] "medium"       "milieu"       "scene"        "setting"     
## [13] "situation"    "surroundings" "territory"   
## 
## $environment.def_2
## [1] "The environment is the natural world of land"
## [2] "sea"                                         
## [3] "air"                                         
## [4] "plants"                                      
## [5] "and animals."                                
## 
## $read.def_1
## [1] "glance at"          "look at"            "peruse"            
## [4] "pore over"          "refer to"           "run one's eye over"
## [7] "scan"               "study"             
## 
## $read.def_2
## [1] "announce" "declaim"  "deliver"  "recite"   "speak"    "utter"   
## 
## $read.def_3
## [1] "comprehend"              "construe"               
## [3] "decipher"                "discover"               
## [5] "interpret"               "perceive the meaning of"
## [7] "see"                     "understand"             
## 
## $read.def_4
## [1] "display"  "indicate" "record"   "register" "show"    
## 
## $teach.def_1
##  [1] "advise"          "coach"           "demonstrate"    
##  [4] "direct"          "discipline"      "drill"          
##  [7] "edify"           "educate"         "enlighten"      
## [10] "give lessons in" "guide"           "impart"         
## [13] "implant"         "inculcate"       "inform"         
## [16] "instil"          "instruct"        "school"         
## [19] "show"            "train"           "tutor"
head(syn(c("the", "cat", "job", "environment", "read", "teach"),
    return.list = FALSE), 30)
##  [1] "feline"         "gib"            "grimalkin"      "kitty"         
##  [5] "malkin"         "moggy"          "mouser"         "puss"          
##  [9] "pussy"          "tabby"          "affair"         "assignment"    
## [13] "charge"         "chore"          "concern"        "contribution"  
## [17] "duty"           "enterprise"     "errand"         "function"      
## [21] "pursuit"        "responsibility" "role"           "stint"         
## [25] "task"           "undertaking"    "venture"        "work"          
## [29] "business"       "calling"
syn(c("the", "cat", "job", "environment", "read", "teach"), multiwords = FALSE)
## $cat.def_1
## [1] "feline"    "gib"       "grimalkin" "kitty"     "malkin"   
## 
## $cat.def_2
## [1] "moggy"
## 
## $cat.def_3
## [1] "mouser" "puss"  
## 
## $cat.def_4
## [1] "pussy"
## 
## $cat.def_5
## [1] "tabby"
## 
## $job.def_1
##  [1] "affair"         "assignment"     "charge"         "chore"         
##  [5] "concern"        "contribution"   "duty"           "enterprise"    
##  [9] "errand"         "function"       "pursuit"        "responsibility"
## [13] "role"           "stint"          "task"           "undertaking"   
## [17] "venture"        "work"          
## 
## $job.def_2
##  [1] "business"   "calling"    "capacity"   "career"     "craft"     
##  [6] "employment" "function"   "livelihood" "metier"     "occupation"
## [11] "office"     "position"   "post"       "profession" "situation" 
## [16] "trade"      "vocation"  
## 
## $job.def_3
##  [1] "allotment"   "assignment"  "batch"       "commission"  "consignment"
##  [6] "contract"    "lot"         "output"      "piece"       "portion"    
## [11] "product"     "share"      
## 
## $environment.def_1
##  [1] "atmosphere"   "background"   "conditions"   "context"     
##  [5] "domain"       "element"      "habitat"      "locale"      
##  [9] "medium"       "milieu"       "scene"        "setting"     
## [13] "situation"    "surroundings" "territory"   
## 
## $environment.def_2
## [1] "sea"    "air"    "plants"
## 
## $read.def_1
## [1] "peruse" "scan"   "study" 
## 
## $read.def_2
## [1] "announce" "declaim"  "deliver"  "recite"   "speak"    "utter"   
## 
## $read.def_3
## [1] "comprehend" "construe"   "decipher"   "discover"   "interpret" 
## [6] "see"        "understand"
## 
## $read.def_4
## [1] "display"  "indicate" "record"   "register" "show"    
## 
## $teach.def_1
##  [1] "advise"      "coach"       "demonstrate" "direct"      "discipline" 
##  [6] "drill"       "edify"       "educate"     "enlighten"   "guide"      
## [11] "impart"      "implant"     "inculcate"   "inform"      "instil"     
## [16] "instruct"    "school"      "show"        "train"       "tutor"

termco - Simple Example

## Make a small dialogue data set
(dat2 <- data.frame(dialogue=c("@bryan is bryan good @br",
    "indeed", "@ brian"), person=qcv(A, B, A)))
##                   dialogue person
## 1 @bryan is bryan good @br      A
## 2                   indeed      B
## 3                  @ brian      A
## The word list to search for
ml <- list(
    wrds=c("bryan", "indeed"), 
    "@", 
    bryan=c("bryan", "@ br", "@br")
)

## Search by person
with(dat2, termco(dialogue, person, match.list=ml))
##   person word.count       wrds         @     bryan
## 1      A          6  2(33.33%) 3(50.00%) 5(83.33%)
## 2      B          1 1(100.00%)         0         0
## Search by person proportion output
with(dat2, termco(dialogue, person, match.list=ml, percent = FALSE))
##   person word.count    wrds      @  bryan
## 1      A          6  2(.33) 3(.50) 5(.83)
## 2      B          1 1(1.00)      0      0

termco - Romeo and Juliet Act 1 Example

## Word list to search for
## Note: In the last vector using "the" will actually 
## include the other 3 versions
ml2 <- list(
    theme_1 = c(" the ", " a ", " an "),
    theme_2 = c(" I'" ),
    "good",
    the_words = c("the", " the ", " the", "the ")
)

(out <- with(raj.act.1,  termco(dialogue, person, ml2)))
##            person word.count   theme_1 theme_2     good   the_words
## 1         Abraham         24         0       0        0           0
## 2        Benvolio        621 32(5.15%) 2(.32%)  2(.32%) 123(19.81%)
## 3         Capulet        736 39(5.30%) 3(.41%)  3(.41%)  93(12.64%)
## 4   First Citizen         16 2(12.50%)       0        0  10(62.50%)
## 5   First Servant         69 8(11.59%)       0 1(1.45%)  20(28.99%)
## 6         Gregory        149  9(6.04%)       0        0  48(32.21%)
## 7          Juliet        206  5(2.43%) 1(.49%)  1(.49%)   20(9.71%)
## 8    Lady Capulet        286 20(6.99%)       0        0  63(22.03%)
## 9   Lady Montague         28  2(7.14%)       0        0           0
## 10       Mercutio        552 49(8.88%)       0  2(.36%) 146(26.45%)
## 11       Montague        217 12(5.53%)       0  1(.46%)  41(18.89%)
## 12          Nurse        598 44(7.36%) 1(.17%)  2(.33%) 103(17.22%)
## 13          Paris         32         0       0        0    1(3.12%)
## 14         Prince        167  8(4.79%)       0        0  35(20.96%)
## 15          Romeo       1164 56(4.81%) 3(.26%)  3(.26%) 142(12.20%)
## 16        Sampson        259 19(7.34%)       0  1(.39%)  70(27.03%)
## 17 Second Capulet         17         0       0        0           0
## 18 Second Servant         41  2(4.88%)       0 1(2.44%)   8(19.51%)
## 19        Servant        183 12(6.56%) 1(.55%)  1(.55%)  46(25.14%)
## 20         Tybalt        160 11(6.88%) 1(.62%)        0  24(15.00%)
## Available elements in the termco output (use dat$...)
names(out)
## [1] "raw"          "prop"         "rnp"          "zero.replace"
## [5] "percent"      "digits"
## Raw and proportion - useful for presenting in tables
out$rnp  
##            person word.count   theme_1 theme_2     good   the_words
## 1         Abraham         24         0       0        0           0
## 2        Benvolio        621 32(5.15%) 2(.32%)  2(.32%) 123(19.81%)
## 3         Capulet        736 39(5.30%) 3(.41%)  3(.41%)  93(12.64%)
## 4   First Citizen         16 2(12.50%)       0        0  10(62.50%)
## 5   First Servant         69 8(11.59%)       0 1(1.45%)  20(28.99%)
## 6         Gregory        149  9(6.04%)       0        0  48(32.21%)
## 7          Juliet        206  5(2.43%) 1(.49%)  1(.49%)   20(9.71%)
## 8    Lady Capulet        286 20(6.99%)       0        0  63(22.03%)
## 9   Lady Montague         28  2(7.14%)       0        0           0
## 10       Mercutio        552 49(8.88%)       0  2(.36%) 146(26.45%)
## 11       Montague        217 12(5.53%)       0  1(.46%)  41(18.89%)
## 12          Nurse        598 44(7.36%) 1(.17%)  2(.33%) 103(17.22%)
## 13          Paris         32         0       0        0    1(3.12%)
## 14         Prince        167  8(4.79%)       0        0  35(20.96%)
## 15          Romeo       1164 56(4.81%) 3(.26%)  3(.26%) 142(12.20%)
## 16        Sampson        259 19(7.34%)       0  1(.39%)  70(27.03%)
## 17 Second Capulet         17         0       0        0           0
## 18 Second Servant         41  2(4.88%)       0 1(2.44%)   8(19.51%)
## 19        Servant        183 12(6.56%) 1(.55%)  1(.55%)  46(25.14%)
## 20         Tybalt        160 11(6.88%) 1(.62%)        0  24(15.00%)
## Raw - useful for performing calculations
out$raw 
##            person word.count theme_1 theme_2 good the_words
## 1         Abraham         24       0       0    0         0
## 2        Benvolio        621      32       2    2       123
## 3         Capulet        736      39       3    3        93
## 4   First Citizen         16       2       0    0        10
## 5   First Servant         69       8       0    1        20
## 6         Gregory        149       9       0    0        48
## 7          Juliet        206       5       1    1        20
## 8    Lady Capulet        286      20       0    0        63
## 9   Lady Montague         28       2       0    0         0
## 10       Mercutio        552      49       0    2       146
## 11       Montague        217      12       0    1        41
## 12          Nurse        598      44       1    2       103
## 13          Paris         32       0       0    0         1
## 14         Prince        167       8       0    0        35
## 15          Romeo       1164      56       3    3       142
## 16        Sampson        259      19       0    1        70
## 17 Second Capulet         17       0       0    0         0
## 18 Second Servant         41       2       0    1         8
## 19        Servant        183      12       1    1        46
## 20         Tybalt        160      11       1    0        24
## Proportion - useful for performing calculations
out$prop
##            person word.count theme_1 theme_2   good the_words
## 1         Abraham         24   0.000  0.0000 0.0000     0.000
## 2        Benvolio        621   5.153  0.3221 0.3221    19.807
## 3         Capulet        736   5.299  0.4076 0.4076    12.636
## 4   First Citizen         16  12.500  0.0000 0.0000    62.500
## 5   First Servant         69  11.594  0.0000 1.4493    28.986
## 6         Gregory        149   6.040  0.0000 0.0000    32.215
## 7          Juliet        206   2.427  0.4854 0.4854     9.709
## 8    Lady Capulet        286   6.993  0.0000 0.0000    22.028
## 9   Lady Montague         28   7.143  0.0000 0.0000     0.000
## 10       Mercutio        552   8.877  0.0000 0.3623    26.449
## 11       Montague        217   5.530  0.0000 0.4608    18.894
## 12          Nurse        598   7.358  0.1672 0.3344    17.224
## 13          Paris         32   0.000  0.0000 0.0000     3.125
## 14         Prince        167   4.790  0.0000 0.0000    20.958
## 15          Romeo       1164   4.811  0.2577 0.2577    12.199
## 16        Sampson        259   7.336  0.0000 0.3861    27.027
## 17 Second Capulet         17   0.000  0.0000 0.0000     0.000
## 18 Second Servant         41   4.878  0.0000 2.4390    19.512
## 19        Servant        183   6.557  0.5464 0.5464    25.137
## 20         Tybalt        160   6.875  0.6250 0.0000    15.000

Using termco with term_match and exclude

## Example 1
termco(DATA$state, DATA$person, exclude(term_match(DATA$state, qcv(th),
    FALSE), "truth"))
##       person word.count       the      then    there
## 1       greg         20 2(10.00%)         0 1(5.00%)
## 2 researcher          6 1(16.67%) 1(16.67%)        0
## 3      sally         10         0         0        0
## 4        sam         13         0         0        0
## 5    teacher          4         0         0        0
## Example 2
MTCH.LST <- exclude(term_match(DATA$state, qcv(th, i)), qcv(truth, stinks))
termco(DATA$state, DATA$person, MTCH.LST)
##       person word.count        th          i
## 1       greg         20 3(15.00%) 13(65.00%)
## 2 researcher          6 2(33.33%)          0
## 3      sally         10         0  4(40.00%)
## 4        sam         13         0 11(84.62%)
## 5    teacher          4         0          0

Using termco with syn

syns <- synonyms("doubt")
syns[1]
## $doubt.def_1
## [1] "discredit"          "distrust"           "fear"              
## [4] "lack confidence in" "misgive"            "mistrust"          
## [7] "query"              "question"           "suspect"
termco(DATA$state, DATA$person, unlist(syns[1]))
##       person word.count discredit distrust fear query question
## 1       greg         20         0        0    0     0        0
## 2 researcher          6         0        0    0     0        0
## 3      sally         10         0        0    0     0        0
## 4        sam         13         0 1(7.69%)    0     0        0
## 5    teacher          4         0        0    0     0        0
synonyms("doubt", FALSE)
##  [1] "discredit"          "distrust"           "fear"              
##  [4] "lack confidence in" "misgive"            "mistrust"          
##  [7] "query"              "question"           "suspect"           
## [10] "apprehension"       "disquiet"           "incredulity"       
## [13] "lack of faith"      "misgiving"          "qualm"             
## [16] "scepticism"         "suspicion"          "be dubious"        
## [19] "be uncertain"       "demur"              "fluctuate"         
## [22] "hesitate"           "scruple"            "vacillate"         
## [25] "waver"              "dubiety"            "hesitancy"         
## [28] "hesitation"         "indecision"         "irresolution"      
## [31] "lack of conviction" "suspense"           "uncertainty"       
## [34] "vacillation"        "confusion"          "difficulty"        
## [37] "dilemma"            "perplexity"         "problem"           
## [40] "quandary"           "admittedly"         "assuredly"         
## [43] "certainly"          "doubtless"          "doubtlessly"       
## [46] "probably"           "surely"
termco(DATA$state, DATA$person, list(doubt = synonyms("doubt", FALSE)))
##       person word.count    doubt
## 1       greg         20        0
## 2 researcher          6        0
## 3      sally         10        0
## 4        sam         13 1(7.69%)
## 5    teacher          4        0
termco(DATA$state, DATA$person, syns)
##       person word.count doubt.def_1 doubt.def_2 doubt.def_5 doubt.def_6
## 1       greg         20           0           0           0           0
## 2 researcher          6           0           0           0           0
## 3      sally         10           0           0           0           0
## 4        sam         13    1(7.69%)    1(7.69%)           0           0
## 5    teacher          4           0           0           0           0

termco also has a plot method that plots a heat map of the termco output based on the percent usage by grouping variable. This allows for rapid visualizations of patterns and enables fast spotting of extreme values. Here are some plots from the Romeo and Juliet Act 1 Example above.

Using termco Plotting

plot(out)

plot of chunk unnamed-chunk-157

plot(out, label = TRUE)