qdap (Rinker, 2013) is an R package designed to assist in quantitative discourse analysis. The package stands as a bridge between qualitative transcripts of dialogue and statistical analysis and visualization. qdap was born out of a frustration with current discourse analysis programs. Packaged programs are a closed system, meaning the researcher using the method has little, if any, influence on the program applied to her data.
R already has thousands of excellent packages for statistics and visualization. qdap is designed to stand as a bridge between the qualitative discourse of a transcript and the computational power and freedom that R offers. As qdap returns the power to the researcher it will also allow the researcher to be more efficient and thus effective and productive in data analysis. The qdap package provides researchers with the tools to analyze data and more importantly is a dynamic system governed by the data, shaped by theory, and continuously refined by the field.
…if you can dream up an analysis then qdap and R can help get you there.
The following vignette is a loose chronological road map for utilizing the tools provided by qdap.
The function new_project
is designed to generate project template of multiple nested directories that organize and guide the researcher through a qualitative study, from data collection to analysis and report/presentation generation. This workflow framework will enable the researcher to be better organized and more efficient in all stages of the research process. new_project
utilizes the reports package (Rinker, 2013b)
Please see the following links for PDF descriptions of the contents of the new_project
and the reports directory.
Project Workflow |
Report Workflow |
click here |
The new_project
template is designed to be utilized with RStudio. Upon clicking the xxx.Rproj
file the template will be loaded into RStudio. The .Rprofile script will be sourced upon start up, allowing the user to automatically load packages, functions, etc. related to the project. The file extra_functions.R
is sourced, loading custom functions. Already included are two functions, email
and todo
, used to generate project member emails and track project tasks. This auto sourcing greatly enhances efficiency in workflow.
This subsection covers how to read in transcript data. Generally the researcher will have data stored as a .docx (Microsoft Word or Open/Libre Office) or .xlsx/.csv (spreadsheet format). It is of great importance that the researcher manually writes/parses their transcripts to avoid potential analysis problems later. All sentences should contain appropriate qdap punctuation (declarative = ., interrogative = ?, exclamatory = !, interrupted = | or imperative
= *., *?, *!, *|). Additionally, if a sentence contains an end mark/punctuation it should have accompanying text/dialogue. Two functions are useful for reading in data, read.transcript
and dir_map
. read.transcript
detects file type (.docx/.csv/.xlsx) and reads in a single transcript whereas dir_map
generates code that utilizes read.transcript
for each of the multiple transcripts in a single directory. Note that read.transcript
expects a two column formatted transcript (usually with person on the left and dialogue on the right).
Five arguments are of particular importance to read.transcript:
file |
The name of the file which the data are to be
read from. Each row of the table appears as one line of
the file. If it does not contain an absolute path, the
file name is relative to the current working directory,
|
col.names |
A character vector specifying the column names of the transcript columns. |
header |
logical. If |
sep |
The field separator character. Values on each
line of the file are separated by this character. The
default of |
skip |
Integer; the number of lines of the data file to skip before beginning to read data. |
Often transcripts contain extraneous material at the top and the argument skip = ? must be used to skip these extra lines. Some sort of unique separator must also be used to separate the person column from the text column. By default sep = “:” is assumed. If your transcripts do not contain a separator one must be inserted manually. Also note that the researcher may want to prepare the transcripts with brackets to denote non spoken annotations as well dialogue that is read rather than spoken. For more on bracket parsing see Bracket/General Chunk Extraction.
♦ Reading In Data- read.transcript ♦
## Location of sample transcripts from the qdap package
(doc1 <- system.file("extdata/transcripts/trans1.docx", package = "qdap"))
(doc2 <- system.file("extdata/transcripts/trans2.docx", package = "qdap"))
(doc3 <- system.file("extdata/transcripts/trans3.docx", package = "qdap"))
(doc4 <- system.file("extdata/transcripts/trans4.xlsx", package = "qdap"))
dat1 <- read.transcript(doc1)
truncdf(dat1, 40)
## X1 X2
## 1 Researcher 2 October 7, 1892.
## 2 Teacher 4 Students it's time to learn. [Student di
## 3 Multiple Students Yes teacher we're ready to learn.
## 4 [Cross Talk 3 00]
## 5 Teacher 4 Let's read this terrific book together.
dat2 <- read.transcript(doc1, col.names = c("person", "dialogue"))
truncdf(dat2, 40)
## person dialogue
## 1 Researcher 2 October 7, 1892.
## 2 Teacher 4 Students it's time to learn. [Student di
## 3 Multiple Students Yes teacher we're ready to learn.
## 4 [Cross Talk 3 00]
## 5 Teacher 4 Let's read this terrific book together.
dat2b <- rm_row(dat2, "person", "[C") #remove bracket row
truncdf(dat2b, 40)
## person dialogue
## 1 Researcher 2 October 7, 1892.
## 2 Teacher 4 Students it's time to learn. [Student di
## 3 Multiple Students Yes teacher we're ready to learn.
## 4 Teacher 4 Let's read this terrific book together.
## Be aware of the need to `skip` non transcript lines
## Incorrect read; Needed to use `skip`
read.transcript(doc2)
Error in data.frame(X1 = speaker, X2 = pvalues, stringsAsFactors = FALSE) :
arguments imply differing number of rows: 7, 8
## Correct: Used `skip`
dat3 <- read.transcript(doc2, skip = 1)
truncdf(dat3, 40)
## X1 X2
## 1 Researcher 2 October 7, 1892.
## 2 Teacher 4 Students it's time to learn. [Student di
## 3 Multiple Students Yes teacher we're ready to learn.
## 4 [Cross Talk 3 00]
## 5 Teacher 4 Let's read this terrific book together.
## Be Aware of the `sep` Used
## Incorrect Read; Wrong `sep` Provided (used default `:`)
read.transcript(doc3, skip = 1)
##Dialogue and Person Columns Mixed Inappropriately
## X1
## 1 [Cross Talk 3
## X2
## 1 Teacher 4-Students it's time to learn. [Student discussion; unintelligible] Multiple Students-Yes teacher we're ready to learn. 00] Teacher 4-Let's read this terrific book together. It's called Moo Baa La La La and what was I going to ... Oh yes The story is by Sandra Boynton. A cow says Moo. A Sheep says Baa. Three singing pigs say LA LA LA! "No, no!" you say, that isn't right. The pigs say oink all day and night. Rhinoceroses snort and snuff. And little dogs go ruff ruff ruff! Some other dogs go bow wow wow! And cats and kittens say Meow! Quack! Says the duck. A horse says neigh. It's quiet now. What do you say?
## Correct `sep` Used
dat4 <- read.transcript(doc3, sep = "-", skip = 1)
truncdf(dat4, 40)
## X1 X2
## 1 Teacher 4 Students it's time to learn. [Student di
## 2 Multiple Students Yes teacher we're ready to learn. [Cross
## 3 Teacher 4 Let's read this terrific book together.
## Read In .xlsx Data
dat5 <- read.transcript(doc4)
truncdf(dat5, 40)
## V1 V2
## 1 Researcher 2: October 7, 1892.
## 2 Teacher 4: Students it's time to learn.
## 3
## 4 Multiple Students: Yes teacher we're ready to learn.
## 5
## 6 Teacher 4: Let's read this terrific book together.
## Reading In Text
trans <- "sam: Computer is fun. Not too fun.
greg: No it's not, it's dumb.
teacher: What should we do?
sam: You liar, it stinks!"
read.transcript(text=trans)
## V1 V2
## 1 sam Computer is fun. Not too fun.
## 2 greg No its not, its dumb.
## 3 teacher What should we do?
## 4 sam You liar, it stinks!
The dir_map
function enables the researcher to produce multiple lines of code, one line with read.transcript
for each file in a directory, which is then optionally copied to the clipboard for easy insertion into a script. Note that setting the argument use.path = FALSE may allow the code to be more portable in that a static path is not supplied to the read.transcript
scripts.
♦ Reading In Data- dir_map ♦
(DIR <- system.file("extdata/transcripts", package = "qdap"))
dir_map(DIR)
…will produce…
dat1 <- read.transcript('~/extdata/transcripts/trans1.docx', col.names = c('person', 'dialogue'), skip = 0)
dat2 <- read.transcript('~/extdata/transcripts/trans2.docx', col.names = c('person', 'dialogue'), skip = 0)
dat3 <- read.transcript('~/extdata/transcripts/trans3.docx', col.names = c('person', 'dialogue'), skip = 0)
dat4 <- read.transcript('~/extdata/transcripts/trans4.xlsx', col.names = c('person', 'dialogue'), skip = 0)
The mcsv_x family of functions are utilized to read (mcsv_r
) and write (mcsv_w
) multiple csv files at once. mcsv_w
takes an arbitrary number of dataframes and outputs them to the supplied directory( dir = ?). An attempt will be made to output the dataframes from qdap functions that output lists of dataframes. Note that dataframes that contain columns that are lists must be condensed prior to writing with other R dataframe writing functions (e.g., write.csv
) using the condense
function. By default mcsv_w
attempts to utilize condense
.
The mcsv_r
function reads multiple files at once and then assigns then dataframes to identically named objects (minus the file extension) in the global environment. Additionally, all of the dataframes that are read in are also assigned to an inclusive list (name L1
by default).
♦ Reading and Writing Multiple csvs ♦
## Make new minimal data sets
mtcarsb <- mtcars[1:5, ]; CO2b <- CO2[1:5, ]
## Write multiple csvs and assign the directory path to `a`
a <- mcsv_w(mtcarsb, CO2b, dir="foo")
## New data sets gone from .GlobalEnv
rm("mtcarsb", "CO2b")
## View the files in `a` and assign to `nms`
(nms <- dir(a))
## Read in and notice the dataframes have been assigned in .GlobalEnv
mcsv_r(file.path(a, nms))
mtcarsb; CO2b
L1
## The dataframe names and list of dataframe can be altered
mcsv_r(file.path(a, nms), a.name = paste0("bot", 1:2), l.name = "bots_stink")
bot1; bot2
bots_stink
## Clean up
delete("foo")
♦ Writing Lists of Dataframes to csvs ♦
## poldat and termco produce lists of dataframes
poldat <- with(DATA, polarity(state, person))
term <- c("the ", "she", " wh")
termdat <- with(raj.act.1, termco(dialogue, person, term))
## View the lists of dataframes
str(poldat); str(termdat)
## Write the lists of dataframes to csv
mcsv_w(poldat, termdat, mtcars, CO2, dir="foo2")
## Clean up
delete("foo2")
The nature of dialogue data makes it large and cumbersome to view in R. This section explores qdap tools designed for more comfortable viewing of R dialogue oriented text dataframes.
The _truncdf
family of functions (trunc + dataframe = truncdf
) are designed to truncate the width of columns and number of rows in dataframes and lists of dataframes. The l and h in front of trunc stands for list and head and are extensions of truncdf
. qview
is a wrapper for htruncdf
that also displays number of rows, columns, and the dataframe name.
♦ Truncated Data Viewing ♦
truncdf(raj[1:10, ])
## person dialogue act
## 1 Sampson Gregory, o 1
## 2 Gregory No, for th 1
## 3 Sampson I mean, an 1
## 4 Gregory Ay, while 1
## 5 Sampson I strike q 1
## 6 Gregory But thou a 1
## 7 Sampson A dog of t 1
## 8 Gregory To move is 1
## 9 Sampson A dog of t 1
## 10 Gregory That shows 1
truncdf(raj[1:10, ], 40)
## person dialogue act
## 1 Sampson Gregory, o my word, we'll not carry coal 1
## 2 Gregory No, for then we should be colliers. 1
## 3 Sampson I mean, an we be in choler, we'll draw. 1
## 4 Gregory Ay, while you live, draw your neck out o 1
## 5 Sampson I strike quickly, being moved. 1
## 6 Gregory But thou art not quickly moved to strike 1
## 7 Sampson A dog of the house of Montague moves me. 1
## 8 Gregory To move is to stir; and to be valiant is 1
## 9 Sampson A dog of that house shall move me to sta 1
## 10 Gregory That shows thee a weak slave; for the we 1
htruncdf(raj)
## person dialogue act
## 1 Sampson Gregory, o 1
## 2 Gregory No, for th 1
## 3 Sampson I mean, an 1
## 4 Gregory Ay, while 1
## 5 Sampson I strike q 1
## 6 Gregory But thou a 1
## 7 Sampson A dog of t 1
## 8 Gregory To move is 1
## 9 Sampson A dog of t 1
## 10 Gregory That shows 1
htruncdf(raj, 20)
## person dialogue act
## 1 Sampson Gregory, o 1
## 2 Gregory No, for th 1
## 3 Sampson I mean, an 1
## 4 Gregory Ay, while 1
## 5 Sampson I strike q 1
## 6 Gregory But thou a 1
## 7 Sampson A dog of t 1
## 8 Gregory To move is 1
## 9 Sampson A dog of t 1
## 10 Gregory That shows 1
## 11 Sampson True; and 1
## 12 Gregory The quarre 1
## 13 Sampson 'Tis all o 1
## 14 Gregory The heads 1
## 15 Sampson Ay, the he 1
## 16 Gregory They must 1
## 17 Sampson Me they sh 1
## 18 Gregory 'Tis well 1
## 19 Sampson My naked w 1
## 20 Gregory How! turn 1
htruncdf(raj, ,20)
## person dialogue act
## 1 Sampson Gregory, o my word, 1
## 2 Gregory No, for then we shou 1
## 3 Sampson I mean, an we be in 1
## 4 Gregory Ay, while you live, 1
## 5 Sampson I strike quickly, be 1
## 6 Gregory But thou art not qui 1
## 7 Sampson A dog of the house o 1
## 8 Gregory To move is to stir; 1
## 9 Sampson A dog of that house 1
## 10 Gregory That shows thee a we 1
ltruncdf(rajPOS, width = 4)
## $text
## data
## 1 Greg
## 2 No,
## 3 I me
## 4 Ay,
## 5 I st
## 6 But
##
## $POStagged
## POSt POSt word
## 1 greg c("N 8
## 2 no/D c("D 7
## 3 i/PR c("P 9
## 4 ay/N c("N 11
## 5 i/VB c("V 5
## 6 but/ c("C 8
##
## $POSprop
## wrd. prop prop prop prop prop prop prop prop prop prop prop prop prop prop prop prop prop prop prop prop prop prop prop prop prop prop prop prop prop prop prop prop prop prop prop prop
## 1 8 0 0 0 0 0 0 0 12.5 0 0 0 0 25 0 0 12.5 0 0 0 12.5 25 0 0 0 0 0 12.5 0 0 0 0 0 0 0 0 0
## 2 7 0 0 0 0 14.2 0 0 14.2 0 0 0 14.2 0 0 0 14.2 0 0 14.2 0 14.2 0 0 0 0 0 14.2 0 0 0 0 0 0 0 0 0
## 3 9 0 0 0 0 11.1 0 0 11.1 0 0 0 0 11.1 0 0 0 0 0 22.2 0 11.1 0 0 0 0 0 22.2 0 0 0 11.1 0 0 0 0 0
## 4 11 0 0 0 0 9.09 0 0 27.2 0 0 0 0 27.2 0 0 0 0 0 9.09 9.09 0 0 0 0 0 0 9.09 0 0 0 9.09 0 0 0 0 0
## 5 5 0 0 0 0 0 0 0 0 0 0 0 0 20 0 0 0 0 0 0 0 20 0 0 0 0 0 0 0 20 40 0 0 0 0 0 0
## 6 8 0 0 12.5 0 12.5 0 0 0 0 0 0 0 12.5 0 0 0 0 0 0 0 25 0 0 0 12.5 0 12.5 12.5 0 0 0 0 0 0 0 0
##
## $POSfreq
## wrd. , . CC CD DT EX FW IN JJ JJR JJS MD NN NNP NNPS NNS PDT POS PRP PRP$ RB RBR RBS RP TO UH VB VBD VBG VBN VBP VBZ WDT WP WP$ WRB
## 1 8 0 0 0 0 0 0 0 1 0 0 0 0 2 0 0 1 0 0 0 1 2 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
## 2 7 0 0 0 0 1 0 0 1 0 0 0 1 0 0 0 1 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
## 3 9 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 2 0 1 0 0 0 0 0 2 0 0 0 1 0 0 0 0 0
## 4 11 0 0 0 0 1 0 0 3 0 0 0 0 3 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0
## 5 5 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 2 0 0 0 0 0 0
## 6 8 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 2 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0
##
## $POSrnp
## wrd. , . CC CD DT EX FW IN JJ JJR JJS MD NN NNP NNPS NNS PDT POS PRP PRP$ RB RBR RBS RP TO UH VB VBD VBG VBN VBP VBZ WDT WP WP$ WRB
## 1 8 0 0 0 0 0 0 0 1(12 0 0 0 0 2(25 0 0 1(12 0 0 0 1(12 2(25 0 0 0 0 0 1(12 0 0 0 0 0 0 0 0 0
## 2 7 0 0 0 0 1(14 0 0 1(14 0 0 0 1(14 0 0 0 1(14 0 0 1(14 0 1(14 0 0 0 0 0 1(14 0 0 0 0 0 0 0 0 0
## 3 9 0 0 0 0 1(11 0 0 1(11 0 0 0 0 1(11 0 0 0 0 0 2(22 0 1(11 0 0 0 0 0 2(22 0 0 0 1(11 0 0 0 0 0
## 4 11 0 0 0 0 1(9. 0 0 3(27 0 0 0 0 3(27 0 0 0 0 0 1(9. 1(9. 0 0 0 0 0 0 1(9. 0 0 0 1(9. 0 0 0 0 0
## 5 5 0 0 0 0 0 0 0 0 0 0 0 0 1(20 0 0 0 0 0 0 0 1(20 0 0 0 0 0 0 0 1(20 2(40 0 0 0 0 0 0
## 6 8 0 0 1(12 0 1(12 0 0 0 0 0 0 0 1(12 0 0 0 0 0 0 0 2(25 0 0 0 1(12 0 1(12 1(12 0 0 0 0 0 0 0 0
##
## $percent
## data
## 1 TRUE
##
## $zero.replace
## data
## 1 0
qview(raj)
## ========================================================================
## nrow = 840 ncol = 3 raj
## ========================================================================
## person dialogue act
## 1 Sampson Gregory, o 1
## 2 Gregory No, for th 1
## 3 Sampson I mean, an 1
## 4 Gregory Ay, while 1
## 5 Sampson I strike q 1
## 6 Gregory But thou a 1
## 7 Sampson A dog of t 1
## 8 Gregory To move is 1
## 9 Sampson A dog of t 1
## 10 Gregory That shows 1
qview(CO2)
## ========================================================================
## nrow = 84 ncol = 5 CO2
## ========================================================================
## Plant Type Treatment conc uptake
## 1 Qn1 Quebec nonchilled 95 16
## 2 Qn1 Quebec nonchilled 175 30.4
## 3 Qn1 Quebec nonchilled 250 34.8
## 4 Qn1 Quebec nonchilled 350 37.2
## 5 Qn1 Quebec nonchilled 500 35.3
## 6 Qn1 Quebec nonchilled 675 39.2
## 7 Qn1 Quebec nonchilled 1000 39.7
## 8 Qn2 Quebec nonchilled 95 13.6
## 9 Qn2 Quebec nonchilled 175 27.3
## 10 Qn2 Quebec nonchilled 250 37.1
Many qdap objects are lists that print as a single dataframe, though the rest of the objects in the list are available. The lview
function unclasses the object and assigns “list”.
lview(question_type(DATA.SPLIT$state, DATA.SPLIT$person))
## $raw
## person raw.text n.row endmark
## 4 teacher What should we do? 4 ?
## 7 sally How can we be certain? 7 ?
## 10 sally What are you talking about? 10 ?
## 11 researcher Shall we move on? 11 ?
## 15 greg You already? 15 ?
## strip.text q.type
## 4 what should we do what
## 7 how can we be certain how
## 10 what are you talking about what
## 11 shall we move on shall
## 15 you already implied_do/does/did
##
## $count
## person tot.quest what how shall implied_do/does/did
## 1 greg 1 0 0 0 1
## 2 researcher 1 0 0 1 0
## 3 sally 2 1 1 0 0
## 4 teacher 1 1 0 0 0
## 5 sam 0 0 0 0 0
##
## $prop
## person tot.quest what how shall implied_do/does/did
## 1 greg 1 0 0 0 100
## 2 researcher 1 0 0 100 0
## 3 sally 2 50 50 0 0
## 4 teacher 1 100 0 0 0
## 5 sam 0 0 0 0 0
##
## $rnp
## person tot.quest what how shall implied_do/does/did
## 1 greg 1 0 0 0 1(100%)
## 2 researcher 1 0 0 1(100%) 0
## 3 sally 2 1(50%) 1(50%) 0 0
## 4 teacher 1 1(100%) 0 0 0
## 5 sam 0 0 0 0 0
##
## $inds
## [1] 4 7 10 11 15
##
## $missing
## integer(0)
##
## $percent
## [1] TRUE
##
## $zero.replace
## [1] 0
##
## $digits
## [1] 2
By default text data (character vectors) are displayed as right justified in R. This can be difficult and unnatural to read, particularly as the length of the sentences increase. The left_just
function creates a more natural left justification of text. Note that left_just
inserts spaces to achieve the justification. This could interfere with analysis and therefore the output from left_just
should only be used for visualization purposes, not analysis.
♦ Justified Data Viewing ♦
## The unnatural state of R text data
DATA
## person sex adult state code
## 1 sam m 0 Computer is fun. Not too fun. K1
## 2 greg m 0 No it's not, it's dumb. K2
## 3 teacher m 1 What should we do? K3
## 4 sam m 0 You liar, it stinks! K4
## 5 greg m 0 I am telling the truth! K5
## 6 sally f 0 How can we be certain? K6
## 7 greg m 0 There is no way. K7
## 8 sam m 0 I distrust you. K8
## 9 sally f 0 What are you talking about? K9
## 10 researcher f 1 Shall we move on? Good then. K10
## 11 greg m 0 I'm hungry. Let's eat. You already? K11
## left just to the rescue
left_just(DATA)
## person sex adult state code
## 1 sam m 0 Computer is fun. Not too fun. K1
## 2 greg m 0 No it's not, it's dumb. K2
## 3 teacher m 1 What should we do? K3
## 4 sam m 0 You liar, it stinks! K4
## 5 greg m 0 I am telling the truth! K5
## 6 sally f 0 How can we be certain? K6
## 7 greg m 0 There is no way. K7
## 8 sam m 0 I distrust you. K8
## 9 sally f 0 What are you talking about? K9
## 10 researcher f 1 Shall we move on? Good then. K10
## 11 greg m 0 I'm hungry. Let's eat. You already? K11
## Left just select column(s)
left_just(DATA, c("sex", "state"))
## person sex adult state code
## 1 sam m 0 Computer is fun. Not too fun. K1
## 2 greg m 0 No it's not, it's dumb. K2
## 3 teacher m 1 What should we do? K3
## 4 sam m 0 You liar, it stinks! K4
## 5 greg m 0 I am telling the truth! K5
## 6 sally f 0 How can we be certain? K6
## 7 greg m 0 There is no way. K7
## 8 sam m 0 I distrust you. K8
## 9 sally f 0 What are you talking about? K9
## 10 researcher f 1 Shall we move on? Good then. K10
## 11 greg m 0 I'm hungry. Let's eat. You already? K11
left_just(CO2[1:15,])
## Plant Type Treatment conc uptake
## 1 Qn1 Quebec nonchilled 95 16
## 2 Qn1 Quebec nonchilled 175 30.4
## 3 Qn1 Quebec nonchilled 250 34.8
## 4 Qn1 Quebec nonchilled 350 37.2
## 5 Qn1 Quebec nonchilled 500 35.3
## 6 Qn1 Quebec nonchilled 675 39.2
## 7 Qn1 Quebec nonchilled 1000 39.7
## 8 Qn2 Quebec nonchilled 95 13.6
## 9 Qn2 Quebec nonchilled 175 27.3
## 10 Qn2 Quebec nonchilled 250 37.1
## 11 Qn2 Quebec nonchilled 350 41.8
## 12 Qn2 Quebec nonchilled 500 40.6
## 13 Qn2 Quebec nonchilled 675 41.4
## 14 Qn2 Quebec nonchilled 1000 44.3
## 15 Qn3 Quebec nonchilled 95 16.2
right_just(left_just(CO2[1:15,]))
## Plant Type Treatment conc uptake
## 1 Qn1 Quebec nonchilled 95 16
## 2 Qn1 Quebec nonchilled 175 30.4
## 3 Qn1 Quebec nonchilled 250 34.8
## 4 Qn1 Quebec nonchilled 350 37.2
## 5 Qn1 Quebec nonchilled 500 35.3
## 6 Qn1 Quebec nonchilled 675 39.2
## 7 Qn1 Quebec nonchilled 1000 39.7
## 8 Qn2 Quebec nonchilled 95 13.6
## 9 Qn2 Quebec nonchilled 175 27.3
## 10 Qn2 Quebec nonchilled 250 37.1
## 11 Qn2 Quebec nonchilled 350 41.8
## 12 Qn2 Quebec nonchilled 500 40.6
## 13 Qn2 Quebec nonchilled 675 41.4
## 14 Qn2 Quebec nonchilled 1000 44.3
## 15 Qn3 Quebec nonchilled 95 16.2
A task of many analyses is to search a dataframe for a particular phrase and return those rows/observations that contain that term. The researcher may optionally choose to specify a particular column to search (column.name) or search the entire dataframe.
♦ Search Dataframes ♦
(SampDF <- data.frame("islands"=names(islands)[1:32],mtcars, row.names=NULL))
## islands mpg cyl disp hp drat wt qsec vs am gear carb
## 1 Africa 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## 2 Antarctica 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## 3 Asia 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## 4 Australia 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## 5 Axel Heiberg 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## 6 Baffin 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## 7 Banks 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## 8 Borneo 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## 9 Britain 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## 10 Celebes 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## 11 Celon 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## 12 Cuba 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## 13 Devon 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## 14 Ellesmere 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## 15 Europe 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## 16 Greenland 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## 17 Hainan 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## 18 Hispaniola 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## 19 Hokkaido 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## 20 Honshu 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## 21 Iceland 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## 22 Ireland 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## 23 Java 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## 24 Kyushu 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## 25 Luzon 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## 26 Madagascar 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## 27 Melville 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## 28 Mindanao 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## 29 Moluccas 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## 30 New Britain 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## 31 New Guinea 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## 32 New Zealand (N) 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
Search(SampDF, "Cuba", "islands")
## islands mpg cyl disp hp drat wt qsec vs am gear carb
## 12 Cuba 16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
Search(SampDF, "New", "islands")
## islands mpg cyl disp hp drat wt qsec vs am gear carb
## 8 Borneo 24.4 4 146.7 62 3.69 3.19 20.0 1 0 4 2
## 30 New Britain 19.7 6 145.0 175 3.62 2.77 15.5 0 1 5 6
## 31 New Guinea 15.0 8 301.0 335 3.54 3.57 14.6 0 1 5 8
## 32 New Zealand (N) 21.4 4 121.0 109 4.11 2.78 18.6 1 1 4 2
Search(SampDF, "Ho")
## islands mpg cyl disp hp drat wt qsec vs am gear carb
## 5 Axel Heiberg 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## 8 Borneo 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## 11 Celon 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## 13 Devon 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## 15 Europe 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## 17 Hainan 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## 18 Hispaniola 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## 19 Hokkaido 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## 20 Honshu 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## 24 Kyushu 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## 25 Luzon 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## 28 Mindanao 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## 29 Moluccas 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
Search(SampDF, "Ho", max.distance = 0)
## islands mpg cyl disp hp drat wt qsec vs am gear carb
## 19 Hokkaido 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## 20 Honshu 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Search(SampDF, "Axel Heiberg")
## islands mpg cyl disp hp drat wt qsec vs am gear carb
## 5 Axel Heiberg 18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
Search(SampDF, 19) #too much tolerance in max.distance
## islands mpg cyl disp hp drat wt qsec vs am gear carb
## 1 Africa 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## 2 Antarctica 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## 3 Asia 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## 4 Australia 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## 5 Axel Heiberg 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## 6 Baffin 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## 7 Banks 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## 8 Borneo 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## 9 Britain 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## 10 Celebes 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## 11 Celon 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## 12 Cuba 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## 13 Devon 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## 14 Ellesmere 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## 15 Europe 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## 16 Greenland 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## 17 Hainan 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## 18 Hispaniola 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## 19 Hokkaido 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## 20 Honshu 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## 21 Iceland 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## 22 Ireland 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## 23 Java 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## 24 Kyushu 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## 25 Luzon 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## 26 Madagascar 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## 27 Melville 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## 28 Mindanao 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## 29 Moluccas 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## 30 New Britain 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## 31 New Guinea 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## 32 New Zealand (N) 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
Search(SampDF, 19, max.distance = 0)
## islands mpg cyl disp hp drat wt qsec vs am gear carb
## 4 Australia 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## 8 Borneo 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## 10 Celebes 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## 18 Hispaniola 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## 20 Honshu 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## 25 Luzon 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## 30 New Britain 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
Search(SampDF, 19, "qsec", max.distance = 0)
## islands mpg cyl disp hp drat wt qsec vs am gear carb
## 4 Australia 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## 18 Hispaniola 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## 20 Honshu 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
This manual arranges functions into categories in the order a researcher is likely to use them. The Generic qdap Tools section does not fit this convention, however, because these tools may be used throughout all stages of analysis it is important that the reader is familiar with them. It is important to note that after reading in transcript data the researcher will likely that the next step is the need to parse the dataframe utilizing the techniques found in the Cleaning/Preparing the Data section.
Often it can be tedious to supply quotes to character vectors when dealing with large vectors. qcv
replaces the typical c(“A”, “B”, “C”, “…”) approach to creating character vectors. Instead the user supplies qcv(A, B, C, …). This format assumes single words separated by commas. If your data/string does not fit this approach the combined terms
and split
argument can be utilized.
♦ Quick Character Vector ♦
qcv(I, like, dogs)
## [1] "I" "like" "dogs"
qcv(terms = "I like, big dogs", split = ",")
## [1] "I like" "big dogs"
qcv(I, like, dogs, space.wrap = TRUE)
## [1] " I " " like " " dogs "
qcv(I, like, dogs, trailing = TRUE)
## [1] "I " "like " "dogs "
qcv(I, like, dogs, leading = TRUE)
## [1] " I" " like" " dogs"
qcv(terms = "mpg cyl disp hp drat wt qsec vs am gear carb")
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb"
Often the researcher who deals with text data will have the need to lookup values quickly and return an accompanying value. This is often called a dictionary, hash, or lookup. This can be used to find corresponding values or recode variables etc. The lookup
& %l% functions provide a fast environment lookup for single usage. The hash
& hash_lookup/%hl% functions provide a fast environment lookup for multiple uses of the same hash table.
♦ lookup
- Dictionary/Look Up Examples ♦
lookup(1:5, data.frame(1:4, 11:14))
## [1] 11 12 13 14 NA
lookup(LETTERS[1:5], data.frame(LETTERS[1:4], 11:14), missing = NULL)
## [1] "11" "12" "13" "14" "E"
lookup(LETTERS[1:5], data.frame(LETTERS[1:5], 100:104))
## [1] 100 101 102 103 104
## Fast with very large vectors
key <- data.frame(x=1:2, y=c("A", "B"))
set.seed(10)
big.vec <- sample(1:2, 3000000, T)
out <- lookup(big.vec, key)
out[1:20]
## [1] "B" "A" "A" "B" "A" "A" "A" "A" "B" "A" "B" "B" "A"
## [14] "B" "A" "A" "A" "A" "A" "B"
## Supply a named list of vectors to key.match
codes <- list(A=c(1, 2, 4),
B = c(3, 5),
C = 7,
D = c(6, 8:10)
)
lookup(1:10, codes) #or
## [1] "A" "A" "B" "A" "B" "D" "C" "D" "D" "D"
1:10 %l% codes
## [1] "A" "A" "B" "A" "B" "D" "C" "D" "D" "D"
## Supply a single vector to key.match and key.assign
lookup(mtcars$carb, sort(unique(mtcars$carb)),
c('one', 'two', 'three', 'four', 'six', 'eight'))
## [1] "four" "four" "one" "one" "two" "one" "four" "two"
## [9] "two" "four" "four" "three" "three" "three" "four" "four"
## [17] "four" "one" "two" "one" "one" "two" "two" "four"
## [25] "two" "one" "two" "two" "four" "six" "eight" "two"
lookup(mtcars$carb, sort(unique(mtcars$carb)),
seq(10, 60, by=10))
## [1] 40 40 10 10 20 10 40 20 20 40 40 30 30 30 40 40 40 10 20 10 10 20 20
## [24] 40 20 10 20 20 40 50 60 20
♦ hash
/hash_look
- Dictionary/Look Up Examples ♦
## Create a fake data set of hash values
(DF <- aggregate(mpg~as.character(carb), mtcars, mean))
## as.character(carb) mpg
## 1 1 25.34
## 2 2 22.40
## 3 3 16.30
## 4 4 15.79
## 5 6 19.70
## 6 8 15.00
## Use `hash` to create a lookup environment
hashTab <- hash(DF)
## Create a vector to lookup
x <- sample(DF[, 1], 20, TRUE)
## Lookup x in the hash with `hash_look` or `%hl%`
hash_look(x, hashTab)
## [1] 15.79 25.34 22.40 15.79 15.00 15.00 15.79 19.70 19.70 15.00 19.70
## [12] 15.00 15.79 19.70 16.30 25.34 19.70 25.34 16.30 22.40
x %hl% hashTab
## [1] 15.79 25.34 22.40 15.79 15.00 15.00 15.79 19.70 19.70 15.00 19.70
## [12] 15.00 15.79 19.70 16.30 25.34 19.70 25.34 16.30 22.40
Researchers dealing with transcripts may have the need to convert between traditional Hours:Minutes:Seconds format and seconds. The hms2sec
and sec2hms
functions offer this type of time conversion.
♦ Time Conversion Examples ♦
hms2sec(c("02:00:03", "04:03:01"))
## [1] 7203 14581
hms2sec(sec2hms(c(222, 1234, 55)))
## [1] 222 1234 55
sec2hms(c(256, 3456, 56565))
## [1] 00:04:16 00:57:36 15:42:45
url_dl
is a function used to provide qdap users with examples taken from the Internet. It is useful for most document downloads from the Internet.
♦ url_dl Examples ♦
## Example 1 (download from dropbox)
# download transcript of the debate to working directory
url_dl(pres.deb1.docx, pres.deb2.docx, pres.deb3.docx)
# load multiple files with read transcript and assign to working directory
dat1 <- read.transcript("pres.deb1.docx", c("person", "dialogue"))
dat2 <- read.transcript("pres.deb2.docx", c("person", "dialogue"))
dat3 <- read.transcript("pres.deb3.docx", c("person", "dialogue"))
docs <- qcv(pres.deb1.docx, pres.deb2.docx, pres.deb3.docx)
dir() %in% docs
delete(docs) #remove the documents
dir() %in% docs
## Example 2 (quoted string urls)
url_dl("https://dl.dropboxusercontent.com/u/61803503/qdap.pdf",
"http://www.cran.r-project.org/doc/manuals/R-intro.pdf")
## Clean up
delete(qcv(qdap.pdf, R-intro.pdf))
After reading in the data the researcher may want to remove all non-dialogue text from the transcript dataframe such as transcriber annotations. This can be accomplished with the bracketX
family of functions, which removes text found between two brackets (( ), { }, [ ], < >) or more generally using genX
and genXtract
to remove text between two character reference points.
If the bracketed text is useful to analysis it is recommended that the researcher assigns the un-bracketed text to a new column.
♦ Extracting Chunks 1- bracketX/bracketXtract ♦
## A fake data set
examp <- structure(list(person = structure(c(1L, 2L, 1L, 3L),
.Label = c("bob", "greg", "sue"), class = "factor"), text =
c("I love chicken [unintelligible]!",
"Me too! (laughter) It's so good.[interrupting]",
"Yep it's awesome {reading}.", "Agreed. {is so much fun}")), .Names =
c("person", "text"), row.names = c(NA, -4L), class = "data.frame")
examp
## person text
## 1 bob I love chicken [unintelligible]!
## 2 greg Me too! (laughter) It's so good.[interrupting]
## 3 bob Yep it's awesome {reading}.
## 4 sue Agreed. {is so much fun}
bracketX(examp$text, "square")
## [1] "I love chicken !" "Me too! (laughter) It's so good."
## [3] "Yep it's awesome {reading} ." "Agreed. {is so much fun}"
bracketX(examp$text, "curly")
## [1] "I love chicken [unintelligible] !"
## [2] "Me too! (laughter) It's so good. [interrupting]"
## [3] "Yep it's awesome ."
## [4] "Agreed."
bracketX(examp$text, c("square", "round"))
## [1] "I love chicken !" "Me too! It's so good."
## [3] "Yep it's awesome {reading} ." "Agreed. {is so much fun}"
bracketX(examp$text)
## [1] "I love chicken !" "Me too! It's so good." "Yep it's awesome ."
## [4] "Agreed."
bracketXtract(examp$text, "square")
## $square1
## [1] "unintelligible"
##
## $square2
## [1] "interrupting"
##
## $square3
## character(0)
##
## $square4
## character(0)
bracketXtract(examp$text, "curly")
## $curly1
## character(0)
##
## $curly2
## character(0)
##
## $curly3
## [1] "reading"
##
## $curly4
## [1] "is so much fun"
bracketXtract(examp$text, c("square", "round"))
## [[1]]
## [1] "unintelligible"
##
## [[2]]
## [1] "interrupting" "laughter"
##
## [[3]]
## character(0)
##
## [[4]]
## character(0)
bracketXtract(examp$text, c("square", "round"), merge = FALSE)
## $square
## $square[[1]]
## [1] "unintelligible"
##
## $square[[2]]
## [1] "interrupting"
##
## $square[[3]]
## character(0)
##
## $square[[4]]
## character(0)
##
##
## $round
## $round[[1]]
## character(0)
##
## $round[[2]]
## [1] "laughter"
##
## $round[[3]]
## character(0)
##
## $round[[4]]
## character(0)
bracketXtract(examp$text)
## $all1
## [1] "unintelligible"
##
## $all2
## [1] "laughter" "interrupting"
##
## $all3
## [1] "reading"
##
## $all4
## [1] "is so much fun"
bracketXtract(examp$text, with = TRUE)
## $all1
## [1] "[unintelligible]"
##
## $all2
## [1] "(laughter)" "[interrupting]"
##
## $all3
## [1] "{reading}"
##
## $all4
## [1] "{is so much fun}"
Often a researcher will want to extract some text from the transcript and put it back together. One example is the reconstructing of material read from a book, poem, play or other text. This information is generally dispersed throughout the dialogue (within classroom/teaching procedures). If this text is denoted with a particular identifying bracket such as curly braces this text can be extracted and then pasted back together.
♦ Extracting Chunks 2- Recombining Chunks ♦
paste2(bracketXtract(examp$text, "curly"), " ")
## [1] "reading is so much fun"
The researcher may need a more general extraction method that allows for any left/right boundaries to be specified. This is useful in that many qualitative transcription/coding programs have specific syntax for various dialogue markup for events that must be parsed from the data set. The genX
and genXtract
functions have such capabilities.
♦ Extracting Chunks 3- genX/genXtract ♦
DATA$state
## [1] "Computer is fun. Not too fun."
## [2] "No it's not, it's dumb."
## [3] "What should we do?"
## [4] "You liar, it stinks!"
## [5] "I am telling the truth!"
## [6] "How can we be certain?"
## [7] "There is no way."
## [8] "I distrust you."
## [9] "What are you talking about?"
## [10] "Shall we move on? Good then."
## [11] "I'm hungry. Let's eat. You already?"
## Look at the difference in number 1 and 10 from above
genX(DATA$state, c("is", "we"), c("too", "on"))
## [1] "Computer fun."
## [2] "No it's not, it's dumb."
## [3] "What should we do?"
## [4] "You liar, it stinks!"
## [5] "I am telling the truth!"
## [6] "How can we be certain?"
## [7] "There is no way."
## [8] "I distrust you."
## [9] "What are you talking about?"
## [10] "Shall ? Good then."
## [11] "I'm hungry. Let's eat. You already?"
## A fake data set
x <- c("Where is the /big dog#?",
"I think he's @arunning@b with /little cat#.")
x
## [1] "Where is the /big dog#?"
## [2] "I think he's @arunning@b with /little cat#."
genXtract(x, c("/", "@a"), c("#", "@b"))
## [[1]]
## [1] "big dog"
##
## [[2]]
## [1] "little cat" "running"
## A fake data set
x2 <- c("Where is the L1big dogL2?",
"I think he's 98running99 with L1little catL2.")
x2
## [1] "Where is the L1big dogL2?"
## [2] "I think he's 98running99 with L1little catL2."
genXtract(x2, c("L1", 98), c("L2", 99))
## [[1]]
## [1] "big dog"
##
## [[2]]
## [1] "little cat" "running"
After reading in data, removing non-dialogue (via bracketX
), and viewing it the researcher will want to find text rows that do not contain proper punctuation and or that contain punctuation and no text. This is accomplished with the _truncdf
family of functions and potential_NA
functions as the researcher manually parses the original transcripts, makes alterations and re-reads the data back into qdap. This important procedure is not an automatic process, requiring that the researcher give attention to detail in comparing the R dataframe with the original transcript.
♦ Identifying and Coding Missing Values ♦
## Create A Data Set With Punctuation and No Text
(DATA$state[c(3, 7, 10)] <- c(".", ".", NA))
## [1] "." "." NA
DATA
## person sex adult state code
## 1 sam m 0 Computer is fun. Not too fun. K1
## 2 greg m 0 No it's not, it's dumb. K2
## 3 teacher m 1 . K3
## 4 sam m 0 You liar, it stinks! K4
## 5 greg m 0 I am telling the truth! K5
## 6 sally f 0 How can we be certain? K6
## 7 greg m 0 . K7
## 8 sam m 0 I distrust you. K8
## 9 sally f 0 What are you talking about? K9
## 10 researcher f 1 <NA> K10
## 11 greg m 0 I'm hungry. Let's eat. You already? K11
potential_NA(DATA$state, 20)
## row text
## 1 3 .
## 2 7 .
## 3 8 I distrust you.
potential_NA(DATA$state)
## row text
## 1 3 .
## 2 7 .
## Use To Selctively Replace Cells With Missing Values
DATA$state[potential_NA(DATA$state, 20)$row[-c(3)]] <- NA
DATA
## person sex adult state code
## 1 sam m 0 Computer is fun. Not too fun. K1
## 2 greg m 0 No it's not, it's dumb. K2
## 3 teacher m 1 <NA> K3
## 4 sam m 0 You liar, it stinks! K4
## 5 greg m 0 I am telling the truth! K5
## 6 sally f 0 How can we be certain? K6
## 7 greg m 0 <NA> K7
## 8 sam m 0 I distrust you. K8
## 9 sally f 0 What are you talking about? K9
## 10 researcher f 1 <NA> K10
## 11 greg m 0 I'm hungry. Let's eat. You already? K11
## Reset DATA
DATA <- qdap::DATA
The researcher may wish to remove empty rows (using rm_empty_row
) and/or rows that contain certain markers (using rm_row
). Sometimes empty rows are read into the dataframe from the transcript. These rows should be completely removed from the data set rather than denoting with NA
. The rm_empty_row
removes completely empty rows (those rows with only 1 or more blank spaces) from the dataframe.
♦ Remove Empty Rows♦
(dat <- rbind.data.frame(DATA[, c(1, 4)], matrix(rep(" ", 4),
ncol =2, dimnames=list(12:13, colnames(DATA)[c(1, 4)]))))
## person state
## 1 sam Computer is fun. Not too fun.
## 2 greg No it's not, it's dumb.
## 3 teacher What should we do?
## 4 sam You liar, it stinks!
## 5 greg I am telling the truth!
## 6 sally How can we be certain?
## 7 greg There is no way.
## 8 sam I distrust you.
## 9 sally What are you talking about?
## 10 researcher Shall we move on? Good then.
## 11 greg I'm hungry. Let's eat. You already?
## 12
## 13
rm_empty_row(dat)
## person state
## 1 sam Computer is fun. Not too fun.
## 2 greg No it's not, it's dumb.
## 3 teacher What should we do?
## 4 sam You liar, it stinks!
## 5 greg I am telling the truth!
## 6 sally How can we be certain?
## 7 greg There is no way.
## 8 sam I distrust you.
## 9 sally What are you talking about?
## 10 researcher Shall we move on? Good then.
## 11 greg I'm hungry. Let's eat. You already?
Other times the researcher may wish to use rm_row
to remove rows from the dataframe/analysis based on transcription conventions or to remove demographic characteristics. For example, in the example below the transcript is read in with [Cross Talk 3. This is a transcription convention and we would want to parse these rows from the transcript. A second example shows the removal of people from the dataframe.
♦ Remove Selected Rows♦
## Read in transcript
dat2 <- read.transcript(system.file("extdata/transcripts/trans1.docx",
package = "qdap"))
truncdf(dat2, 40)
## X1 X2
## 1 Researcher 2 October 7, 1892.
## 2 Teacher 4 Students it's time to learn. [Student di
## 3 Multiple Students Yes teacher we're ready to learn.
## 4 [Cross Talk 3 00]
## 5 Teacher 4 Let's read this terrific book together.
## Use column names to remove rows
truncdf(rm_row(dat2, "X1", "[C"), 40)
## X1 X2
## 1 Researcher 2 October 7, 1892.
## 2 Teacher 4 Students it's time to learn. [Student di
## 3 Multiple Students Yes teacher we're ready to learn.
## 4 Teacher 4 Let's read this terrific book together.
## Use column numbers to remove rows
truncdf(rm_row(dat2, 2, "[C"), 40)
## X1 X2
## 1 Researcher 2 October 7, 1892.
## 2 Teacher 4 Students it's time to learn. [Student di
## 3 Multiple Students Yes teacher we're ready to learn.
## 4 [Cross Talk 3 00]
## 5 Teacher 4 Let's read this terrific book together.
## Also remove people etc. from the analysis
rm_row(DATA, 1, c("sam", "greg"))
## person sex adult state code
## 1 teacher m 1 What should we do? K3
## 2 sally f 0 How can we be certain? K6
## 3 sally f 0 What are you talking about? K9
## 4 researcher f 1 Shall we move on? Good then. K10
An important step in the cleaning process is the removal of extra white spaces (use Trim
) and escaped characters (use clean
). The scrubber
function wraps both Trim
and clean
and adds in the functionality of some of the replace_ family of functions.
♦ Remove Extra Spaces and Escaped Characters♦
x1 <- "I go \r
to the \tnext line"
x1
## [1] "I go \r\n to the \tnext line"
clean(x1)
## [1] "I go to the next line"
x2 <- c(" talkstats.com ", " really? ", " yeah")
x2
## [1] " talkstats.com " " really? " " yeah"
Trim(x2)
## [1] "talkstats.com" "really?" "yeah"
x3 <- c("I like 456 dogs\t , don't you?\"")
x3
## [1] "I like 456 dogs\t , don't you?\""
scrubber(x3)
## [1] "I like 456 dogs, don't you?"
scrubber(x3, TRUE)
## [1] "I like 456 dogs, don't you?"
The replacement family of functions replace various text elements within the transcripts with alphabetic versions that are more suited to analysis. These alterations may affect word counts and other alphabetic dependent forms of analysis.
The replace_abbreviation
replaces standard abbreviations that utilize periods with forms that do not rely on periods. This is necessary in that many sentence specific functions (e.g., sentSplit
and word_stats
) rely on period usage acting as sentence end marks. The researcher may augment the standard abbreviations
dictionary from qdapDictionaries with field specific abbreviations.
♦ Replace Abbreviations♦
## Use the standard contractions dictionary
x <- c("Mr. Jones is here at 7:30 p.m.",
"Check it out at www.github.com/trinker/qdap",
"i.e. He's a sr. dr.; the best in 2012 A.D.",
"the robot at t.s. is 10ft. 3in.")
x
## [1] "Mr. Jones is here at 7:30 p.m."
## [2] "Check it out at www.github.com/trinker/qdap"
## [3] "i.e. He's a sr. dr.; the best in 2012 A.D."
## [4] "the robot at t.s. is 10ft. 3in."
replace_abbreviation(x)
## [1] "Mister Jones is here at 7:30 PM."
## [2] "Check it out at www dot github dot com /trinker/qdap"
## [3] "ie He's a Senior Doctor ; the best in 2012 AD."
## [4] "the robot at t.s. is 10ft. 3in."
## Augment the standard dictionary with replacement vectors
abv <- c("in.", "ft.", "t.s.")
repl <- c("inch", "feet", "talkstats")
replace_abbreviation(x, abv, repl)
## [1] "Mr. Jones is here at 7:30 p.m."
## [2] "Check it out at www.github.com/trinker/qdap"
## [3] "i.e. He's a sr. dr.; the best in 2012 A.D."
## [4] "the robot at talkstats is 10 feet 3 inch."
## Augment the standard dictionary with a replacement dataframe
(KEY <- rbind(abbreviations, data.frame(abv = abv, rep = repl)))
## abv rep
## 1 Mr. Mister
## 2 Mrs. Misses
## 3 Ms. Miss
## 4 .com dot com
## 5 www. www dot
## 6 i.e. ie
## 7 A.D. AD
## 8 B.C. BC
## 9 A.M. AM
## 10 P.M. PM
## 11 et al. et al
## 12 Jr. Junior
## 13 Dr. Doctor
## 14 Sr. Senior
## 15 in. inch
## 16 ft. feet
## 17 t.s. talkstats
replace_abbreviation(x, KEY)
## [1] "Mister Jones is here at 7:30 PM."
## [2] "Check it out at www dot github dot com /trinker/qdap"
## [3] "ie He's a Senior Doctor ; the best in 2012 AD."
## [4] "the robot at talkstats is 10 feet 3 inch."
The replace_contraction
replaces contractions with equivalent multi-word forms. This is useful for some word/sentence statistics. The researcher may augment the contractions
dictionary supplied by qdapDictionaries, however, the word list is exhaustive.
♦ Replace Contractions♦
x <- c("Mr. Jones isn't going.",
"Check it out what's going on.",
"He's here but didn't go.",
"the robot at t.s. wasn't nice",
"he'd like it if i'd go away")
x
## [1] "Mr. Jones isn't going." "Check it out what's going on."
## [3] "He's here but didn't go." "the robot at t.s. wasn't nice"
## [5] "he'd like it if i'd go away"
replace_contraction(x)
## [1] "Mr. Jones is not going."
## [2] "Check it out what is going on."
## [3] "He is here but did not go."
## [4] "The robot at t.s. was not nice"
## [5] "He would like it if I would go away"
The replace_number
function utilizes The work of John Fox (2005) to turn numeric representations of numbers into their textual equivalents. This is useful for word statistics that require the text version of dialogue.
♦ Replace Numbers-Numeral Representation♦
x <- c("I like 346457 ice cream cones.", "They are 99 percent good")
replace_number(x)
## [1] "I like three hundred forty six thousand four hundred fifty seven ice cream cones."
## [2] "They are ninety nine percent good"
## Replace numbers that contain commas as well
y <- c("I like 346,457 ice cream cones.", "They are 99 percent good")
replace_number(y)
## [1] "I like three hundred forty six thousand four hundred fifty seven ice cream cones."
## [2] "They are ninety nine percent good"
## Combine numbers as one word/string
replace_number(x, FALSE)
## [1] "I like threehundredfortysixthousandfourhundredfiftyseven ice cream cones."
## [2] "They are ninetynine percent good"
The replace_symbol
converts ($) to “dollar”, (%) to “percent”, (#) to “number”, (@) to “at”, (&) to “and”, (w/) to “with”. Additional substitutions can be undertaken with the multigsub
function.
♦ Replace Symbols♦
x <- c("I am @ Jon's & Jim's w/ Marry",
"I owe $41 for food",
"two is 10% of a #")
x
## [1] "I am @ Jon's & Jim's w/ Marry" "I owe $41 for food"
## [3] "two is 10% of a #"
replace_symbol(x)
## [1] "I am at Jon's and Jim's with Marry"
## [2] "I owe dollar 41 for food"
## [3] "two is 10 percent of a number"
replace_number(replace_symbol(x))
## [1] "I am at Jon's and Jim's with Marry"
## [2] "I owe dollar forty one for food"
## [3] "two is ten percent of a number"
The qprep
function is a wrapper for several other replacement family function that allows for more speedy cleaning of the text. This approach, while speedy, reduces the flexibility and care that is undertaken by the researcher when the individual replacement functions are utilized. The function is intended for analysis that requires less care.
♦ General Replacement (Quick Preparation)♦
x <- "I like 60 (laughter) #d-bot and $6 @ the store w/o 8p.m."
x
## [1] "I like 60 (laughter) #d-bot and $6 @ the store w/o 8p.m."
qprep(x)
## [1] "I like sixty number d bot and dollar six at the store without eight PM."
Many qdap functions break sentences up into words based on the spaces between words. Often the researcher will want to keep a group of words as a single unit. The space_fill
allows the researcher to replace spaces between selected phrases with ~~. By default ~~ is recognized by many qdap functions as a space separator.
♦ Space Fill Examples♦
## Fake Data
x <- c("I want to hear the Dr. Martin Luther King Jr. speech.",
"I also want to go to the white House to see President Obama speak.")
x
## [1] "I want to hear the Dr. Martin Luther King Jr. speech."
## [2] "I also want to go to the white House to see President Obama speak."
## Words to keep as a single unit
keeps <- c("Dr. Martin Luther King Jr.", "The White House", "President Obama")
text <- space_fill(x, keeps)
text
## [1] "I want to hear the Dr.~~Martin~~Luther~~King~~Jr. speech."
## [2] "I also want to go to The~~White~~House to see President~~Obama speak."
## strip Example
strip(text, lower=FALSE)
## [1] "I want to hear the Dr~~Martin~~Luther~~King~~Jr speech"
## [2] "I also want to go to The~~White~~House to see President~~Obama speak"
## bag_o_words Example
bag_o_words(text, lower=FALSE)
## [1] "i" "want"
## [3] "to" "hear"
## [5] "the" "dr~~martin~~luther~~king~~jr"
## [7] "speech" "i"
## [9] "also" "want"
## [11] "to" "go"
## [13] "to" "the~~white~~house"
## [15] "to" "see"
## [17] "president~~obama" "speak"
## wfm Example
wfm(text, c("greg", "bob"))
## bob greg
## also 1 0
## dr martin luther king jr 0 1
## go 1 0
## hear 0 1
## i 1 1
## president obama 1 0
## see 1 0
## speak 1 0
## speech 0 1
## the 0 1
## the white house 1 0
## to 3 1
## want 1 1
## trans_cloud Example
obs <- strip(space_fill(keeps, keeps), lower=FALSE)
trans_cloud(text, c("greg", "bob"), target.words=list(obs), caps.list=obs,
cloud.colors=qcv(red, gray65), expand.target = FALSE, title.padj = .7,
legend = c("space_filled", "other"), title.cex = 2, title.color = "blue",
max.word.size = 3)
The researcher may have the need to make multiple substitutions in a text. An example of when this is needed is when a transcript is marked up with transcription coding convention specific to a particular transcription method. These codes, while useful in some contexts, may lead to inaccurate word statistics. The base R function gsub makes a single replacement of these types of coding conventions. The multigsub
(alias mgsub
) takes a vector of patterns to search for as well as a vector of replacements. Note that the replacements occur sequentially rather than all at once. This means a previous (first in pattern string) sub could alter or be altered by a later sub. mgsub
is useful throughout multiple stages of the research process.
♦ Multiple Substitutions♦
left_just(DATA[, c(1, 4)])
## person state
## 1 sam Computer is fun. Not too fun.
## 2 greg No it's not, it's dumb.
## 3 teacher What should we do?
## 4 sam You liar, it stinks!
## 5 greg I am telling the truth!
## 6 sally How can we be certain?
## 7 greg There is no way.
## 8 sam I distrust you.
## 9 sally What are you talking about?
## 10 researcher Shall we move on? Good then.
## 11 greg I'm hungry. Let's eat. You already?
multigsub(c("it's", "I'm"), c("it is", "I am"), DATA$state)
## [1] "Computer is fun. Not too fun."
## [2] "No it is not, it is dumb."
## [3] "What should we do?"
## [4] "You liar, it stinks!"
## [5] "I am telling the truth!"
## [6] "How can we be certain?"
## [7] "There is no way."
## [8] "I distrust you."
## [9] "What are you talking about?"
## [10] "Shall we move on? Good then."
## [11] "I am hungry. Let's eat. You already?"
mgsub(c("it's", "I'm"), c("it is", "I am"), DATA$state)
## [1] "Computer is fun. Not too fun."
## [2] "No it is not, it is dumb."
## [3] "What should we do?"
## [4] "You liar, it stinks!"
## [5] "I am telling the truth!"
## [6] "How can we be certain?"
## [7] "There is no way."
## [8] "I distrust you."
## [9] "What are you talking about?"
## [10] "Shall we move on? Good then."
## [11] "I am hungry. Let's eat. You already?"
mgsub(c("it's", "I'm"), "SINGLE REPLACEMENT", DATA$state)
## [1] "Computer is fun. Not too fun."
## [2] "No SINGLE REPLACEMENT not, SINGLE REPLACEMENT dumb."
## [3] "What should we do?"
## [4] "You liar, it stinks!"
## [5] "I am telling the truth!"
## [6] "How can we be certain?"
## [7] "There is no way."
## [8] "I distrust you."
## [9] "What are you talking about?"
## [10] "Shall we move on? Good then."
## [11] "SINGLE REPLACEMENT hungry. Let's eat. You already?"
mgsub("[[:punct:]]", "PUNC", DATA$state, fixed = FALSE)
## [1] "Computer is funPUNC Not too funPUNC"
## [2] "No itPUNCs notPUNC itPUNCs dumbPUNC"
## [3] "What should we doPUNC"
## [4] "You liarPUNC it stinksPUNC"
## [5] "I am telling the truthPUNC"
## [6] "How can we be certainPUNC"
## [7] "There is no wayPUNC"
## [8] "I distrust youPUNC"
## [9] "What are you talking aboutPUNC"
## [10] "Shall we move onPUNC Good thenPUNC"
## [11] "IPUNCm hungryPUNC LetPUNCs eatPUNC You alreadyPUNC"
## Iterative "I'm" converts to "I am" which converts to "INTERATIVE"
mgsub(c("it's", "I'm", "I am"), c("it is", "I am", "ITERATIVE"), DATA$state)
## [1] "Computer is fun. Not too fun."
## [2] "No it is not, it is dumb."
## [3] "What should we do?"
## [4] "You liar, it stinks!"
## [5] "ITERATIVE telling the truth!"
## [6] "How can we be certain?"
## [7] "There is no way."
## [8] "I distrust you."
## [9] "What are you talking about?"
## [10] "Shall we move on? Good then."
## [11] "ITERATIVE hungry. Let's eat. You already?"
A researcher may face a list of names and be uncertain about gender of the participants. The name2sex
function utilizes the gender package to predict names based on Social Security Administration data, defaulting to the period from 1932-2012.
♦ Name to Gender Prediction♦
name2sex(qcv(mary, jenn, linda, JAME, GABRIEL, OLIVA, tyler, jamie, JAMES,
tyrone, cheryl, drew))
[1] F F F M M F M F M M F M
Levels: F M
During the initial cleaning stage of analysis the researcher may choose to create a stemmed version of the dialogue, that is words are reduced to their root words. The stemmer
family of functions allow the researcher to create stemmed text. The stem2df
function wraps stemmer
to quickly create a dataframe with the stemmed column added.
♦ Stemming♦
## stem2df EXAMPLE:
(stemdat <- stem2df(DATA, "state", "new"))
## person sex adult state code
## 1 sam m 0 Computer is fun. Not too fun. K1
## 2 greg m 0 No it's not, it's dumb. K2
## 3 teacher m 1 What should we do? K3
## 4 sam m 0 You liar, it stinks! K4
## 5 greg m 0 I am telling the truth! K5
## 6 sally f 0 How can we be certain? K6
## 7 greg m 0 There is no way. K7
## 8 sam m 0 I distrust you. K8
## 9 sally f 0 What are you talking about? K9
## 10 researcher f 1 Shall we move on? Good then. K10
## 11 greg m 0 I'm hungry. Let's eat. You already? K11
## new
## 1 Comput is fun not too fun.
## 2 No it not it dumb.
## 3 What should we do?
## 4 You liar it stink!
## 5 I am tell the truth!
## 6 How can we be certain?
## 7 There is no way.
## 8 I distrust you.
## 9 What are you talk about?
## 10 Shall we move on good then.
## 11 I'm hungri let eat you alreadi?
with(stemdat, trans_cloud(new, sex, title.cex = 2.5,
title.color = "blue", max.word.size = 5, title.padj = .7))
## stemmer EXAMPLE:
stemmer(DATA$state)
## [1] "Comput is fun not too fun." "No it not it dumb."
## [3] "What should we do?" "You liar it stink!"
## [5] "I am tell the truth!" "How can we be certain?"
## [7] "There is no way." "I distrust you."
## [9] "What are you talk about?" "Shall we move on good then."
## [11] "I'm hungri let eat you alreadi?"
## stem_words EXAMPLE:
stem_words(doggies, jumping, swims)
## [1] "doggi" "jump" "swim"
At times it is handy to be able to grab from the beginning or end of a string to a specific character. The beg2char
function allows you to grab from the beginning of a string to the nth occurrence of a character. The counterpart function, char2end
, grab from the nth occurrence of a character to the end of a string to. This behavior is useful if the transcript contains annotations at the beginning or end of a line that should be eliminated.
♦ Grab From Character to Beginning/End of String♦
x <- c("a_b_c_d", "1_2_3_4", "<_?_._:")
beg2char(x, "_")
## [1] "a" "1" "<"
beg2char(x, "_", 4)
## [1] "a_b_c_d" "1_2_3_4" "<_?_._:"
char2end(x, "_")
## [1] "b_c_d" "2_3_4" "?_._:"
char2end(x, "_", 2)
## [1] "c_d" "3_4" "._:"
char2end(x, "_", 3, include=TRUE)
## [1] "_d" "_4" "_:"
(x2 <- gsub("_", " ", x))
## [1] "a b c d" "1 2 3 4" "< ? . :"
beg2char(x2, " ", 2)
## [1] "a b" "1 2" "< ?"
(x3 <- gsub("_", "\\^", x))
## [1] "a^b^c^d" "1^2^3^4" "<^?^.^:"
char2end(x3, "^", 2)
## [1] "c^d" "3^4" ".^:"
Often incomplete sentences have a different function than complete sentences. The researcher may want to denote incomplete sentences for consideration in later analysis. Traditionally, incomplete sentence are denoted with the following end marks (.., …, .?, ..?, en & em). The incomplete_replace
can identify and replace the traditional end marks with a standard form “|”.
♦ Incomplete Sentence Identification♦
x <- c("the...", "I.?", "you.", "threw..", "we?")
incomplete_replace(x)
## [1] "the|" "I|" "you." "threw|" "we?"
incomp(x)
## [1] "the|" "I|" "you." "threw|" "we?"
incomp(x, scan.mode = TRUE)
## row.num text
## 1 1 the...
## 2 2 I.?
## 3 4 threw..
The capitalizer
functions allows the researcher to specify words within a vector to be capitalized. By default I, and contractions containing I, are capitalized. Additional words can be specified through the caps.list argument. To capitalize words within strings the mgsub
can be used.
♦ Word Capitalization♦
capitalizer(bag_o_words("i like it but i'm not certain"), "like")
## [1] "I" "Like" "it" "but" "I'm" "not" "certain"
capitalizer(bag_o_words("i like it but i'm not certain"), "like", FALSE)
## [1] "i" "Like" "it" "but" "i'm" "not" "certain"
Many functions in the qdap package require that the dialogue is broken apart into individual sentences, failure to do so may invalidate many of the outputs from the analysis and will lead to lead to warnings. After reading in and cleaning the data the next step should be to split the text variable into individual sentences. The sentSplit
function outputs a dataframe with the text variable split into individual sentences and repeats the demographic variables as necessary. Additionally, a turn of talk (tot column) variable is added that keeps track of the original turn of talk (row number) and the sentence number per turn of talk. The researcher may also want to create a second text column that has been stemmed for future analysis by setting stem.col = TRUE, though this is more time intensive.
♦ sentSplit
Example♦
sentSplit(DATA, "state")
## person tot sex adult code state
## 1 sam 1.1 m 0 K1 Computer is fun.
## 2 sam 1.2 m 0 K1 Not too fun.
## 3 greg 2.1 m 0 K2 No it's not, it's dumb.
## 4 teacher 3.1 m 1 K3 What should we do?
## 5 sam 4.1 m 0 K4 You liar, it stinks!
## 6 greg 5.1 m 0 K5 I am telling the truth!
## 7 sally 6.1 f 0 K6 How can we be certain?
## 8 greg 7.1 m 0 K7 There is no way.
## 9 sam 8.1 m 0 K8 I distrust you.
## 10 sally 9.1 f 0 K9 What are you talking about?
## 11 researcher 10.1 f 1 K10 Shall we move on?
## 12 researcher 10.2 f 1 K10 Good then.
## 13 greg 11.1 m 0 K11 I'm hungry.
## 14 greg 11.2 m 0 K11 Let's eat.
## 15 greg 11.3 m 0 K11 You already?
sentSplit(DATA, "state", stem.col = TRUE)
## person tot sex adult code state stem.text
## 1 sam 1.1 m 0 K1 Computer is fun. Comput is fun.
## 2 sam 1.2 m 0 K1 Not too fun. Not too fun.
## 3 greg 2.1 m 0 K2 No it's not, it's dumb. No it not it dumb.
## 4 teacher 3.1 m 1 K3 What should we do? What should we do?
## 5 sam 4.1 m 0 K4 You liar, it stinks! You liar it stink!
## 6 greg 5.1 m 0 K5 I am telling the truth! I am tell the truth!
## 7 sally 6.1 f 0 K6 How can we be certain? How can we be certain?
## 8 greg 7.1 m 0 K7 There is no way. There is no way.
## 9 sam 8.1 m 0 K8 I distrust you. I distrust you.
## 10 sally 9.1 f 0 K9 What are you talking about? What are you talk about?
## 11 researcher 10.1 f 1 K10 Shall we move on? Shall we move on?
## 12 researcher 10.2 f 1 K10 Good then. Good then.
## 13 greg 11.1 m 0 K11 I'm hungry. I'm hungri.
## 14 greg 11.2 m 0 K11 Let's eat. Let eat.
## 15 greg 11.3 m 0 K11 You already? You alreadi?
sentSplit(raj, "dialogue")[1:11, ]
## person tot act dialogue
## 1 Sampson 1.1 1 Gregory, o my word, we'll not carry coals.
## 2 Gregory 2.1 1 No, for then we should be colliers.
## 3 Sampson 3.1 1 I mean, an we be in choler, we'll draw.
## 4 Gregory 4.1 1 Ay, while you live, draw your neck out o the collar.
## 5 Sampson 5.1 1 I strike quickly, being moved.
## 6 Gregory 6.1 1 But thou art not quickly moved to strike.
## 7 Sampson 7.1 1 A dog of the house of Montague moves me.
## 8 Gregory 8.1 1 To move is to stir; and to be valiant is to stand.
## 9 Gregory 8.2 1 therefore, if thou art moved, thou runn'st away.
## 10 Sampson 9.1 1 A dog of that house shall move me to stand.
## 11 Sampson 9.2 1 I will take the wall of any man or maid of Montague's.
♦ sentSplit
- plot Method♦
plot(sentSplit(DATA, "state"), grouping.var = "person")
plot(sentSplit(DATA, "state"), grouping.var = "sex")
♦ TOT
Example ♦
## Convert tot column with sub sentences to turns of talk
dat <- sentSplit(DATA, "state")
TOT(dat$tot)
## 1.1 1.2 2.1 3.1 4.1 5.1 6.1 7.1 8.1 9.1 10.1 10.2 11.1 11.2 11.3
## 1 1 2 3 4 5 6 7 8 9 10 10 11 11 11
Within dialogue (particularly classroom dialogue) several speakers may say the same speech at the same. The transcripts may lump this speech together in the form of:
Person | Dialogue |
John, Josh & Imani | Yes Mrs. Smith. |
The speakerSplit
function attributes this text to each of the people as separate entries. The default behavior is the search for the person separators of sep = c(“and”, “&”, “,”), though other separators may be specified.
♦ Break and Stretch if Multiple Persons per Cell♦
## Create data set with multiple speakers per turn of talk
DATA$person <- as.character(DATA$person)
DATA$person[c(1, 4, 6)] <- c("greg, sally, & sam",
"greg, sally", "sam and sally")
speakerSplit(DATA)
## person sex adult state code
## 1 greg m 0 Computer is fun. Not too fun. K1
## 2 sally m 0 Computer is fun. Not too fun. K1
## 3 sam m 0 Computer is fun. Not too fun. K1
## 4 greg m 0 No it's not, it's dumb. K2
## 5 teacher m 1 What should we do? K3
## 6 greg m 0 You liar, it stinks! K4
## 7 sally m 0 You liar, it stinks! K4
## 8 greg m 0 I am telling the truth! K5
## 9 sam f 0 How can we be certain? K6
## 10 sally f 0 How can we be certain? K6
## 11 greg m 0 There is no way. K7
## 12 sam m 0 I distrust you. K8
## 13 sally f 0 What are you talking about? K9
## 14 researcher f 1 Shall we move on? Good then. K10
## 15 greg m 0 I'm hungry. Let's eat. You already? K11
## Change the separator
DATA$person[c(1, 4, 6)] <- c("greg_sally_sam",
"greg.sally", "sam; sally")
speakerSplit(DATA, sep = c(".", "_", ";"))
## person sex adult state code
## 1 greg m 0 Computer is fun. Not too fun. K1
## 2 sally m 0 Computer is fun. Not too fun. K1
## 3 sam m 0 Computer is fun. Not too fun. K1
## 4 greg m 0 No it's not, it's dumb. K2
## 5 teacher m 1 What should we do? K3
## 6 greg m 0 You liar, it stinks! K4
## 7 sally m 0 You liar, it stinks! K4
## 8 greg m 0 I am telling the truth! K5
## 9 sam f 0 How can we be certain? K6
## 10 sally f 0 How can we be certain? K6
## 11 greg m 0 There is no way. K7
## 12 sam m 0 I distrust you. K8
## 13 sally f 0 What are you talking about? K9
## 14 researcher f 1 Shall we move on? Good then. K10
## 15 greg m 0 I'm hungry. Let's eat. You already? K11
## Reset DATA
DATA <- qdap::DATA
The sentCombine
function is the opposite of the sentSplit
, combining sentences into a single turn of talk per grouping variable.
♦ Sentence Combining♦
dat <- sentSplit(DATA, "state")
## Combine by person
sentCombine(dat$state, dat$person)
## person text.var
## 1 sam Computer is fun. Not too fun.
## 2 greg No it's not, it's dumb.
## 3 teacher What should we do?
## 4 sam You liar, it stinks!
## 5 greg I am telling the truth!
## 6 sally How can we be certain?
## 7 greg There is no way.
## 8 sam I distrust you.
## 9 sally What are you talking about?
## 10 researcher Shall we move on? Good then.
## 11 greg I'm hungry. Let's eat. You already?
## Combine by sex
truncdf(sentCombine(dat$state, dat$sex), 65)
## sex text.var
## 1 m Computer is fun. Not too fun. No it's not, it's dumb. What should
## 2 f How can we be certain?
## 3 m There is no way. I distrust you.
## 4 f What are you talking about? Shall we move on? Good then.
## 5 m I'm hungry. Let's eat. You already?
It is more efficient to maintain a dialogue dataframe (consisting of a column for people and a column for dialogue) and a separate demographics dataframe (a person column and demographic column(s)) and then merge the two during analysis. The key_merge
function is a wrapper for the merge function from R's base install that merges the dialogue and demographics dataframe. key_merge
attempts to guess the person column and outputs a qdap friendly dataframe.
♦ Merging Demographic Information♦
## A dialogue dataframe and a demographics dataframe
ltruncdf(list(dialogue=raj, demographics=raj.demographics), 10, 50)
## $dialogue
## person dialogue act
## 1 Sampson Gregory, o my word, we'll not carry coals. 1
## 2 Gregory No, for then we should be colliers. 1
## 3 Sampson I mean, an we be in choler, we'll draw. 1
## 4 Gregory Ay, while you live, draw your neck out o the colla 1
## 5 Sampson I strike quickly, being moved. 1
## 6 Gregory But thou art not quickly moved to strike. 1
## 7 Sampson A dog of the house of Montague moves me. 1
## 8 Gregory To move is to stir; and to be valiant is to stand. 1
## 9 Sampson A dog of that house shall move me to stand. I will 1
## 10 Gregory That shows thee a weak slave; for the weakest goes 1
##
## $demographics
## person sex fam.aff died
## 1 Abraham m mont FALSE
## 2 Apothecary m none FALSE
## 3 Balthasar m mont FALSE
## 4 Benvolio m mont FALSE
## 5 Capulet f cap FALSE
## 6 Chorus none none FALSE
## 7 First Citizen none none FALSE
## 8 First Musician m none FALSE
## 9 First Servant m none FALSE
## 10 First Watchman m none FALSE
## Merge the two
merged.raj <- key_merge(raj, raj.demographics)
htruncdf(merged.raj, 10, 40)
## person act sex fam.aff died dialogue
## 1 Sampson 1 m cap FALSE Gregory, o my word, we'll not carry coal
## 2 Gregory 1 m cap FALSE No, for then we should be colliers.
## 3 Sampson 1 m cap FALSE I mean, an we be in choler, we'll draw.
## 4 Gregory 1 m cap FALSE Ay, while you live, draw your neck out o
## 5 Sampson 1 m cap FALSE I strike quickly, being moved.
## 6 Gregory 1 m cap FALSE But thou art not quickly moved to strike
## 7 Sampson 1 m cap FALSE A dog of the house of Montague moves me.
## 8 Gregory 1 m cap FALSE To move is to stir; and to be valiant is
## 9 Sampson 1 m cap FALSE A dog of that house shall move me to sta
## 10 Gregory 1 m cap FALSE That shows thee a weak slave; for the we
Many functions in qdap utilize the paste2
function, which pastes multiple columns/lists of vectors. paste2
differs from base R's paste function in that paste2
can paste unspecified columns or a list of vectors together. The colpaste2df
function, a wrapper for paste2
, pastes multiple columns together and outputs an appropriately named dataframe. The colsplit2df
and lcolsplit2df
are useful because they can split the output from qdap functions that contain dataframes with pasted columns.
♦ Using paste2
and colSplit
: Pasting & Splitting Vectors and Dataframes♦
## Pasting a list of vectors
paste2(rep(list(state.abb[1:8], month.abb[1:8]) , 2), sep = "|_|")
## [1] "AL|_|Jan|_|AL|_|Jan" "AK|_|Feb|_|AK|_|Feb" "AZ|_|Mar|_|AZ|_|Mar"
## [4] "AR|_|Apr|_|AR|_|Apr" "CA|_|May|_|CA|_|May" "CO|_|Jun|_|CO|_|Jun"
## [7] "CT|_|Jul|_|CT|_|Jul" "DE|_|Aug|_|DE|_|Aug"
## Pasting a dataframe
foo1 <- paste2(CO2[, 1:3])
head(foo1, 12)
## [1] "Qn1.Quebec.nonchilled" "Qn1.Quebec.nonchilled"
## [3] "Qn1.Quebec.nonchilled" "Qn1.Quebec.nonchilled"
## [5] "Qn1.Quebec.nonchilled" "Qn1.Quebec.nonchilled"
## [7] "Qn1.Quebec.nonchilled" "Qn2.Quebec.nonchilled"
## [9] "Qn2.Quebec.nonchilled" "Qn2.Quebec.nonchilled"
## [11] "Qn2.Quebec.nonchilled" "Qn2.Quebec.nonchilled"
## Splitting a pasted column
bar1 <- colSplit(foo1)
head(bar1, 10)
## X1 X2 X3
## 1 Qn1 Quebec nonchilled
## 2 Qn1 Quebec nonchilled
## 3 Qn1 Quebec nonchilled
## 4 Qn1 Quebec nonchilled
## 5 Qn1 Quebec nonchilled
## 6 Qn1 Quebec nonchilled
## 7 Qn1 Quebec nonchilled
## 8 Qn2 Quebec nonchilled
## 9 Qn2 Quebec nonchilled
## 10 Qn2 Quebec nonchilled
♦ colpaste2df
& colsplit2df
: Splitting Columns in Dataframes♦
## Create a dataset with a pasted column
(dat <- colpaste2df(head(CO2), 1:3, keep.orig = FALSE)[, c(3, 1:2)])
## Plant&Type&Treatment conc uptake
## 1 Qn1.Quebec.nonchilled 95 16.0
## 2 Qn1.Quebec.nonchilled 175 30.4
## 3 Qn1.Quebec.nonchilled 250 34.8
## 4 Qn1.Quebec.nonchilled 350 37.2
## 5 Qn1.Quebec.nonchilled 500 35.3
## 6 Qn1.Quebec.nonchilled 675 39.2
## Split column
colsplit2df(dat)
## Plant Type Treatment conc uptake
## 1 Qn1 Quebec nonchilled 95 16.0
## 2 Qn1 Quebec nonchilled 175 30.4
## 3 Qn1 Quebec nonchilled 250 34.8
## 4 Qn1 Quebec nonchilled 350 37.2
## 5 Qn1 Quebec nonchilled 500 35.3
## 6 Qn1 Quebec nonchilled 675 39.2
## Specify names
colsplit2df(dat, new.names = qcv(A, B, C))
## A B C conc uptake
## 1 Qn1 Quebec nonchilled 95 16.0
## 2 Qn1 Quebec nonchilled 175 30.4
## 3 Qn1 Quebec nonchilled 250 34.8
## 4 Qn1 Quebec nonchilled 350 37.2
## 5 Qn1 Quebec nonchilled 500 35.3
## 6 Qn1 Quebec nonchilled 675 39.2
## Keep the original pasted column
colsplit2df(dat, new.names = qcv(A, B, C), keep.orig = TRUE)
## Plant&Type&Treatment A B C conc uptake
## 1 Qn1.Quebec.nonchilled Qn1 Quebec nonchilled 95 16.0
## 2 Qn1.Quebec.nonchilled Qn1 Quebec nonchilled 175 30.4
## 3 Qn1.Quebec.nonchilled Qn1 Quebec nonchilled 250 34.8
## 4 Qn1.Quebec.nonchilled Qn1 Quebec nonchilled 350 37.2
## 5 Qn1.Quebec.nonchilled Qn1 Quebec nonchilled 500 35.3
## 6 Qn1.Quebec.nonchilled Qn1 Quebec nonchilled 675 39.2
## Pasting columns and output a dataframe
colpaste2df(head(mtcars)[, 1:5], qcv(mpg, cyl, disp), sep ="_", name.sep = "|")
## mpg cyl disp hp drat mpg|cyl|disp
## Mazda RX4 21.0 6 160 110 3.90 21_6_160
## Mazda RX4 Wag 21.0 6 160 110 3.90 21_6_160
## Datsun 710 22.8 4 108 93 3.85 22.8_4_108
## Hornet 4 Drive 21.4 6 258 110 3.08 21.4_6_258
## Hornet Sportabout 18.7 8 360 175 3.15 18.7_8_360
## Valiant 18.1 6 225 105 2.76 18.1_6_225
colpaste2df(head(CO2)[, -3], list(1:2, qcv("conc", "uptake")))
## Plant Type conc uptake Plant&Type conc&uptake
## 1 Qn1 Quebec 95 16.0 Qn1.Quebec 95.16
## 2 Qn1 Quebec 175 30.4 Qn1.Quebec 175.30.4
## 3 Qn1 Quebec 250 34.8 Qn1.Quebec 250.34.8
## 4 Qn1 Quebec 350 37.2 Qn1.Quebec 350.37.2
## 5 Qn1 Quebec 500 35.3 Qn1.Quebec 500.35.3
## 6 Qn1 Quebec 675 39.2 Qn1.Quebec 675.39.2
♦ lcolsplit2df
: Splitting Columns in Lists of Dataframes♦
## A list with dataframes that contain pasted columns
x <- question_type(DATA.SPLIT$state, list(DATA.SPLIT$sex, DATA.SPLIT$adult))
ltruncdf(x[1:4])
## $raw
## sex&adult raw.text n.row endmark strip.text q.type
## 1 m.1 What shoul 4 ? what shou what
## 2 f.0 How can we 7 ? how can w how
## 3 f.0 What are y 10 ? what are what
## 4 f.1 Shall we m 11 ? shall we shall
## 5 m.0 You alread 15 ? you alrea implied_do
##
## $count
## sex&adult tot.quest what how shall implied_do
## 1 f.0 2 1 1 0 0
## 2 f.1 1 0 0 1 0
## 3 m.0 1 0 0 0 1
## 4 m.1 1 1 0 0 0
##
## $prop
## sex&adult tot.quest what how shall implied_do
## 1 f.0 2 50 50 0 0
## 2 f.1 1 0 0 100 0
## 3 m.0 1 0 0 0 100
## 4 m.1 1 100 0 0 0
##
## $rnp
## sex&adult tot.quest what how shall implied_do
## 1 f.0 2 1(50%) 1(50%) 0 0
## 2 f.1 1 0 0 1(100%) 0
## 3 m.0 1 0 0 0 1(100%)
## 4 m.1 1 1(100%) 0 0 0
z <- lcolsplit2df(x)
ltruncdf(z[1:4])
## $raw
## sex adult raw.text n.row endmark strip.text q.type
## 1 m 1 What shoul 4 ? what shou what
## 2 f 0 How can we 7 ? how can w how
## 3 f 0 What are y 10 ? what are what
## 4 f 1 Shall we m 11 ? shall we shall
## 5 m 0 You alread 15 ? you alrea implied_do
##
## $count
## sex adult tot.quest what how shall implied_do
## 1 f 0 2 1 1 0 0
## 2 f 1 1 0 0 1 0
## 3 m 0 1 0 0 0 1
## 4 m 1 1 1 0 0 0
##
## $prop
## sex adult tot.quest what how shall implied_do
## 1 f 0 2 50 50 0 0
## 2 f 1 1 0 0 100 0
## 3 m 0 1 0 0 0 100
## 4 m 1 1 100 0 0 0
##
## $rnp
## sex adult tot.quest what how shall implied_do
## 1 f 0 2 1(50%) 1(50%) 0 0
## 2 f 1 1 0 0 1(100%) 0
## 3 m 0 1 0 0 0 1(100%)
## 4 m 1 1 1(100%) 0 0 0
Often a researcher will want to view the patterns of the discourse by grouping variables over time. This requires the data to have start and end times based on units (sentence, turn of talk, or word). The gantt
function provides the user with unit spans (start and end times) with the gantt_rep
extending this capability to repeated measures. The gantt
function has a basic plotting method to allow visualization of the unit span data, however, the gantt_wrap
function extends the gantt
and gantt_rep
functions to plot precise depictions (Gantt plots) of the unit span data. Note that if the researcher is only interested in the plotting the data as a Gantt plot, the gantt_plot
function combines the gantt
/gantt_rep
functions with the gantt
function
♦ Unit Spans♦
## Unit Span Dataframe
dat <- gantt(mraja1$dialogue, mraja1$person)
head(dat, 12)
## person n start end
## 1 Sampson 8 0 8
## 2 Gregory 7 8 15
## 3 Sampson 9 15 24
## 4 Gregory 11 24 35
## 5 Sampson 5 35 40
## 6 Gregory 8 40 48
## 7 Sampson 9 48 57
## 8 Gregory 20 57 77
## 9 Sampson 22 77 99
## 10 Gregory 13 99 112
## 11 Sampson 30 112 142
## 12 Gregory 10 142 152
plot(dat)
plot(dat, base = TRUE)
♦ Repeated Measures Unit Spans♦
## Repeated Measures Unit Span Dataframe
dat2 <- with(rajSPLIT, gantt_rep(act, dialogue, list(fam.aff, sex)))
head(dat2, 12)
## act fam.aff_sex n start end
## 1 1 cap_m 327 0 327
## 2 1 mont_m 8 327 335
## 3 1 cap_m 6 335 341
## 4 1 mont_m 8 341 349
## 5 1 cap_m 32 349 381
## 6 1 mont_m 4 381 385
## 7 1 cap_m 16 385 401
## 8 1 mont_m 2 401 403
## 9 1 cap_m 14 403 417
## 10 1 mont_m 2 417 419
## 11 1 cap_m 10 419 429
## 12 1 mont_m 12 429 441
## Plotting Repeated Measures Unit Span Dataframe
plot(dat2)
gantt_wrap(dat2, "fam.aff_sex", facet.vars = "act",
title = "Repeated Measures Gantt Plot")
It is useful to convert data to an adjacency matrix for examining relationships between grouping variables in word usage. The adjaceny_matrix
(aka: adjmat
) provide this capability, interacting with a termco
or wfm
object. In the first example below Sam and Greg share 4 words in common, whereas, the Teacher and Greg share no words. The adjacency matrix can be passed to a network graphing package such as the igraph package for visualization of the data structure as seen in Example 3.
♦ Adjacency Matrix: Example 1♦
adjacency_matrix(wfm(DATA$state, DATA$person))
## Adjacency Matrix:
##
## greg researcher sally sam
## researcher 0
## sally 1 1
## sam 4 0 1
## teacher 0 1 2 0
##
##
## Summed occurrences:
##
## greg researcher sally sam teacher
## 18 6 10 11 4
♦ Adjacency Matrix: Example 2♦
words <- c(" education", " war ", " econom", " job", "governor ")
(terms <- with(pres_debates2012, termco(dialogue, person, words)))
adjmat(terms)
## Adjacency Matrix:
##
## OBAMA ROMNEY CROWLEY LEHRER QUESTION
## ROMNEY 5
## CROWLEY 2 2
## LEHRER 4 4 2
## QUESTION 4 4 2 4
## SCHIEFFER 2 2 1 1 1
##
##
## Summed occurrences:
##
## OBAMA ROMNEY CROWLEY LEHRER QUESTION SCHIEFFER
## 5 5 2 4 4 2
It is often useful to plot the adjacency matrix as a network. The igraph package provides this functionality.
♦ Plotting an Adjacency Matrix: Example 1♦
library(igraph)
dat <- adjacency_matrix(wfm(DATA$state, DATA$person, stopword = Top25Words))
g <- graph.adjacency(dat$adjacency, weighted=TRUE, mode ="undirected")
g <- simplify(g)
V(g)$label <- V(g)$name
V(g)$degree <- igraph::degree(g)
plot(g, layout=layout.auto(g))
The following example will visualize the presidential debates data as a network plot.
♦ Plotting an Adjacency Matrix: Example 2♦
library(igraph)
## Subset the presidential debates data set
subpres <- pres_debates2012[pres_debates2012$person %in% qcv(ROMNEY, OBAMA), ]
## Create a word frequency matrix
dat <- with(subpres, wfm(dialogue, list(person, time), stopword = Top200Words))
## Generate an adjacency matrix
adjdat <- adjacency_matrix(dat)
X <- adjdat$adjacency
g <- graph.adjacency(X, weighted=TRUE, mode ="undirected")
g <- simplify(g)
V(g)$label <- V(g)$name
V(g)$degree <- igraph::degree(g)
plot(g, layout=layout.auto(g))
We can easily add information to the network plot utilizing the Dissimilarity
function to obtain weights and distance measures for use with the plot.
♦ Plotting an Adjacency Matrix: Example 2b♦
edge.weight <- 15 #a maximizing thickness constant
d <- as.matrix(Dissimilarity(dat))
d2 <- d[lower.tri(d)]
z1 <- edge.weight*d2^2/max(d2)
z2 <- c(round(d2, 3))
E(g)$width <- c(z1)[c(z1) != 0]
E(g)$label <- c(z2)[c(z2) != 0]
plot(g, layout=layout.auto(g))
plot(g, layout=layout.auto(g), edge.curved =TRUE)
♦ Plotting an Adjacency Matrix: Try the plot interactively!♦
tkplot(g)
This section overviews functions that can extract words and word lists from dialogue text. The subsections describing function use are in alphabetical order as there is no set chronology for use.
The all_words
breaks the dialogue into a bag of words and searches based on the criteria arguments begins.with and contains. The resulting word list can be useful for analysis or to pass to qdap functions that deal with Word Counts and Descriptive Statistics.
♦ all_words
♦
## Words starting with `re`
x1 <- all_words(raj$dialogue, begins.with="re")
head(x1, 10)
## WORD FREQ
## 1 re 2
## 2 reach 1
## 3 read 6
## 4 ready 5
## 5 rearward 1
## 6 reason 5
## 7 reason's 1
## 8 rebeck 1
## 9 rebellious 1
## 10 receipt 1
## Words containing with `conc`
all_words(raj$dialogue, contains = "conc")
## WORD FREQ
## 1 conceal'd 1
## 2 conceit 2
## 3 conceive 1
## 4 concludes 1
## 5 reconcile 1
## All words ordered by frequency
x2 <- all_words(raj$dialogue, alphabetical = FALSE)
head(x2, 10)
## WORD FREQ
## 1 and 666
## 2 the 656
## 3 i 573
## 4 to 517
## 5 a 445
## 6 of 378
## 7 my 358
## 8 is 344
## 9 that 344
## 10 in 312
The qdap package utilizes the following functions to turn text into a bag of words (word order is preserved):
bag_o_words | Reduces a text column to a single vector bag of words. |
breaker | Reduces a text column to a single vector bag of words and qdap recognized end marks. |
word.split | Reduces a text column to a list of vectors of bag of words and qdap recognized end marks (i.e., “.”, “!”, “?”, “*”, “-”). |
Bag of words can be useful for any number of reasons within the scope of analyzing discourse. Many other qdap functions employ or mention these three functions as seen in the following counts for the three word splitting functions.
Function | bag_o_words | breaker | word.split | |
1 | all_words.R | 1 | - | - |
2 | automated_readability_index.R | - | - | 2 |
3 | bag_o_words.R | 10 | 6 | 3 |
4 | capitalizer.R | 3 | 1 | - |
5 | imperative.R | - | 3 | - |
6 | ngrams.R | 1 | - | - |
7 | polarity.R | 2 | - | - |
8 | rm_stopwords.R | 1 | 3 | - |
9 | textLISTER.R | - | - | 2 |
10 | trans_cloud.R | 1 | 1 | - |
11 | wfm.R | 1 | - | - |
♦ Word Splitting Examples♦
bag_o_words("I'm going home!")
## [1] "i'm" "going" "home"
bag_o_words("I'm going home!", apostrophe.remove = TRUE)
## [1] "im" "going" "home"
bag_o_words(DATA$state)
## [1] "computer" "is" "fun" "not" "too" "fun"
## [7] "no" "it's" "not" "it's" "dumb" "what"
## [13] "should" "we" "do" "you" "liar" "it"
## [19] "stinks" "i" "am" "telling" "the" "truth"
## [25] "how" "can" "we" "be" "certain" "there"
## [31] "is" "no" "way" "i" "distrust" "you"
## [37] "what" "are" "you" "talking" "about" "shall"
## [43] "we" "move" "on" "good" "then" "i'm"
## [49] "hungry" "let's" "eat" "you" "already"
by(DATA$state, DATA$person, bag_o_words)
## DATA$person: greg
## [1] "no" "it's" "not" "it's" "dumb" "i" "am"
## [8] "telling" "the" "truth" "there" "is" "no" "way"
## [15] "i'm" "hungry" "let's" "eat" "you" "already"
## --------------------------------------------------------
## DATA$person: researcher
## [1] "shall" "we" "move" "on" "good" "then"
## --------------------------------------------------------
## DATA$person: sally
## [1] "how" "can" "we" "be" "certain" "what" "are"
## [8] "you" "talking" "about"
## --------------------------------------------------------
## DATA$person: sam
## [1] "computer" "is" "fun" "not" "too" "fun"
## [7] "you" "liar" "it" "stinks" "i" "distrust"
## [13] "you"
## --------------------------------------------------------
## DATA$person: teacher
## [1] "what" "should" "we" "do"
lapply(DATA$state, bag_o_words)
## [[1]]
## [1] "computer" "is" "fun" "not" "too" "fun"
##
## [[2]]
## [1] "no" "it's" "not" "it's" "dumb"
##
## [[3]]
## [1] "what" "should" "we" "do"
##
## [[4]]
## [1] "you" "liar" "it" "stinks"
##
## [[5]]
## [1] "i" "am" "telling" "the" "truth"
##
## [[6]]
## [1] "how" "can" "we" "be" "certain"
##
## [[7]]
## [1] "there" "is" "no" "way"
##
## [[8]]
## [1] "i" "distrust" "you"
##
## [[9]]
## [1] "what" "are" "you" "talking" "about"
##
## [[10]]
## [1] "shall" "we" "move" "on" "good" "then"
##
## [[11]]
## [1] "i'm" "hungry" "let's" "eat" "you" "already"
breaker(DATA$state)
## [1] "Computer" "is" "fun" "." "Not" "too"
## [7] "fun" "." "No" "it's" "not," "it's"
## [13] "dumb" "." "What" "should" "we" "do"
## [19] "?" "You" "liar," "it" "stinks" "!"
## [25] "I" "am" "telling" "the" "truth" "!"
## [31] "How" "can" "we" "be" "certain" "?"
## [37] "There" "is" "no" "way" "." "I"
## [43] "distrust" "you" "." "What" "are" "you"
## [49] "talking" "about" "?" "Shall" "we" "move"
## [55] "on" "?" "Good" "then" "." "I'm"
## [61] "hungry" "." "Let's" "eat" "." "You"
## [67] "already" "?"
by(DATA$state, DATA$person, breaker)
## DATA$person: greg
## [1] "No" "it's" "not," "it's" "dumb" "." "I"
## [8] "am" "telling" "the" "truth" "!" "There" "is"
## [15] "no" "way" "." "I'm" "hungry" "." "Let's"
## [22] "eat" "." "You" "already" "?"
## --------------------------------------------------------
## DATA$person: researcher
## [1] "Shall" "we" "move" "on" "?" "Good" "then" "."
## --------------------------------------------------------
## DATA$person: sally
## [1] "How" "can" "we" "be" "certain" "?" "What"
## [8] "are" "you" "talking" "about" "?"
## --------------------------------------------------------
## DATA$person: sam
## [1] "Computer" "is" "fun" "." "Not" "too"
## [7] "fun" "." "You" "liar," "it" "stinks"
## [13] "!" "I" "distrust" "you" "."
## --------------------------------------------------------
## DATA$person: teacher
## [1] "What" "should" "we" "do" "?"
lapply(DATA$state, breaker)
## [[1]]
## [1] "Computer" "is" "fun" "." "Not" "too"
## [7] "fun" "."
##
## [[2]]
## [1] "No" "it's" "not," "it's" "dumb" "."
##
## [[3]]
## [1] "What" "should" "we" "do" "?"
##
## [[4]]
## [1] "You" "liar," "it" "stinks" "!"
##
## [[5]]
## [1] "I" "am" "telling" "the" "truth" "!"
##
## [[6]]
## [1] "How" "can" "we" "be" "certain" "?"
##
## [[7]]
## [1] "There" "is" "no" "way" "."
##
## [[8]]
## [1] "I" "distrust" "you" "."
##
## [[9]]
## [1] "What" "are" "you" "talking" "about" "?"
##
## [[10]]
## [1] "Shall" "we" "move" "on" "?" "Good" "then" "."
##
## [[11]]
## [1] "I'm" "hungry" "." "Let's" "eat" "." "You"
## [8] "already" "?"
word_split(c(NA, DATA$state))
## $<NA>
## [1] NA
##
## $`Computer is fun. Not too fun.`
## [1] "Computer" "is" "fun" "." "Not" "too"
## [7] "fun" "."
##
## $`No it's not, it's dumb.`
## [1] "No" "it's" "not," "it's" "dumb" "."
##
## $`What should we do?`
## [1] "What" "should" "we" "do" "?"
##
## $`You liar, it stinks!`
## [1] "You" "liar," "it" "stinks" "!"
##
## $`I am telling the truth!`
## [1] "I" "am" "telling" "the" "truth" "!"
##
## $`How can we be certain?`
## [1] "How" "can" "we" "be" "certain" "?"
##
## $`There is no way.`
## [1] "There" "is" "no" "way" "."
##
## $`I distrust you.`
## [1] "I" "distrust" "you" "."
##
## $`What are you talking about?`
## [1] "What" "are" "you" "talking" "about" "?"
##
## $`Shall we move on? Good then.`
## [1] "Shall" "we" "move" "on" "?" "Good" "then" "."
##
## $`I'm hungry. Let's eat. You already?`
## [1] "I'm" "hungry" "." "Let's" "eat" "." "You"
## [8] "already" "?"
The common
function finds items that are common between n vectors
(i.e., subjects or grouping variables). This is useful for determining common language choices shared across participants in a conversation.
♦ Words in Common Examples♦
## Create vectors of words
a <- c("a", "cat", "dog", "the", "the")
b <- c("corn", "a", "chicken", "the")
d <- c("house", "feed", "a", "the", "chicken")
## Supply individual vectors
common(a, b, d, overlap=2)
## word freq
## 1 a 3
## 2 the 3
## 3 chicken 2
common(a, b, d, overlap=3)
## word freq
## 1 a 3
## 2 the 3
## Supply a list of vectors
common(list(a, b, d))
## word freq
## 1 a 3
## 2 the 3
## Using to find common words between subjects
common(word_list(DATA$state, DATA$person)$cwl, overlap = 2)
## word freq
## 1 we 3
## 2 you 3
## 3 I 2
## 4 is 2
## 5 not 2
## 6 what 2
It is often useful and more efficient to start with a preset vector of words and eliminate or exclude
the words you do not wish to include. Examples could range from excluding an individual(s) from a column of participant names or excluding a few select word(s) from a pre-defined qdap word list. This is particularly useful for passing terms or stopwords to word counting functions like termco
or trans_cloud
.
♦ exclude
Examples♦
exclude(1:10, 3, 4)
## [1] 1 2 5 6 7 8 9 10
exclude(Top25Words, qcv(the, of, and))
## [1] "a" "to" "in" "is" "you" "that" "it" "he" "was" "for"
## [11] "on" "are" "as" "with" "his" "they" "I" "at" "be" "this"
## [21] "have" "from"
exclude(Top25Words, "the", "of", "an")
## [1] "and" "a" "to" "in" "is" "you" "that" "it" "he" "was"
## [11] "for" "on" "are" "as" "with" "his" "they" "I" "at" "be"
## [21] "this" "have" "from"
#Using with `term_match` and `termco`
MTCH.LST <- exclude(term_match(DATA$state, qcv(th, i)), qcv(truth, stinks))
termco(DATA$state, DATA$person, MTCH.LST)
## person word.count th i
## 1 greg 20 3(15.00%) 13(65.00%)
## 2 researcher 6 2(33.33%) 0
## 3 sally 10 0 4(40.00%)
## 4 sam 13 0 11(84.62%)
## 5 teacher 4 0 0
Utilizing ngrams can be useful for gaining a sense of what terms are used in conjunction with other terms. This is particularly useful in the analysis of dialogue when the combination of a particular vocabulary is meaningful. The ngrams
function provides a list of ngram related output that can be utilize in various analyses.
♦ ngrams
Example note that the output is only partial♦
out <- ngrams(DATA$state, DATA$person, 2)
lapply(out[["all_n"]], function(x) sapply(x, paste, collapse = " "))
## $n_1
## [1] "about" "already" "am" "are" "be" "can"
## [7] "certain" "computer" "distrust" "do" "dumb" "eat"
## [13] "fun" "fun" "good" "how" "hungry" "i"
## [19] "i" "i'm" "is" "is" "it" "it's"
## [25] "it's" "let's" "liar" "move" "no" "no"
## [31] "not" "not" "on" "shall" "should" "stinks"
## [37] "talking" "telling" "the" "then" "there" "too"
## [43] "truth" "way" "we" "we" "we" "what"
## [49] "what" "you" "you" "you" "you"
##
## $n_2
## [1] "am telling" "are you" "be certain" "can we"
## [5] "computer is" "distrust you" "eat you" "fun not"
## [9] "good then" "how can" "hungry let's" "i'm hungry"
## [13] "i am" "i distrust" "is fun" "is no"
## [17] "it's dumb" "it's not" "it stinks" "let's eat"
## [21] "liar it" "move on" "no it's" "no way"
## [25] "not it's" "not too" "on good" "shall we"
## [29] "should we" "talking about" "telling the" "the truth"
## [33] "there is" "too fun" "we be" "we do"
## [37] "we move" "what are" "what should" "you already"
## [41] "you liar" "you talking"
In analyzing discourse it may be helpful to remove certain words from the analysis as the words may not be meaningful or may overshadow the impact of other words. The rm_stopwords
function can be utilized to remove rm_stopwords from the dialogue before passing to further analysis. It should be noted that many functions have a stopwords argument that allows for the removal of the stopwords within the function environment rather than altering the text in the primary discourse dataframe. Careful researcher consideration must be given as to the functional impact of removing words from an analysis.
♦ Stopword Removal Examples♦
## The data
DATA$state
## [1] "Computer is fun. Not too fun."
## [2] "No it's not, it's dumb."
## [3] "What should we do?"
## [4] "You liar, it stinks!"
## [5] "I am telling the truth!"
## [6] "How can we be certain?"
## [7] "There is no way."
## [8] "I distrust you."
## [9] "What are you talking about?"
## [10] "Shall we move on? Good then."
## [11] "I'm hungry. Let's eat. You already?"
rm_stopwords(DATA$state, Top200Words)
## [[1]]
## [1] "computer" "fun" "." "fun" "."
##
## [[2]]
## [1] "it's" "," "it's" "dumb" "."
##
## [[3]]
## [1] "?"
##
## [[4]]
## [1] "liar" "," "stinks" "!"
##
## [[5]]
## [1] "am" "telling" "truth" "!"
##
## [[6]]
## [1] "certain" "?"
##
## [[7]]
## [1] "."
##
## [[8]]
## [1] "distrust" "."
##
## [[9]]
## [1] "talking" "?"
##
## [[10]]
## [1] "shall" "?" "."
##
## [[11]]
## [1] "i'm" "hungry" "." "let's" "eat" "." "already"
## [8] "?"
rm_stopwords(DATA$state, Top200Words, strip = TRUE)
## [[1]]
## [1] "computer" "fun" "fun"
##
## [[2]]
## [1] "it's" "it's" "dumb"
##
## [[3]]
## character(0)
##
## [[4]]
## [1] "liar" "stinks"
##
## [[5]]
## [1] "am" "telling" "truth"
##
## [[6]]
## [1] "certain"
##
## [[7]]
## character(0)
##
## [[8]]
## [1] "distrust"
##
## [[9]]
## [1] "talking"
##
## [[10]]
## [1] "shall"
##
## [[11]]
## [1] "i'm" "hungry" "let's" "eat" "already"
rm_stopwords(DATA$state, Top200Words, separate = FALSE)
## [1] "computer fun. fun." "it's, it's dumb."
## [3] "?" "liar, stinks!"
## [5] "am telling truth!" "certain?"
## [7] "." "distrust."
## [9] "talking?" "shall?."
## [11] "i'm hungry. let's eat. already?"
rm_stopwords(DATA$state, Top200Words, unlist = TRUE, unique = TRUE)
## [1] "computer" "fun" "." "it's" "," "dumb"
## [7] "?" "liar" "stinks" "!" "am" "telling"
## [13] "truth" "certain" "distrust" "talking" "shall" "i'm"
## [19] "hungry" "let's" "eat" "already"
It is often useful to remove capitalization and punctuation from the dialogue in order to standardize the text. R is case sensitive. By removing capital letters and extra punctuation with the strip
function the text is more comparable. In the following output we can see, through the == comparison operator and outer function that the use of strip
makes the different forms of Dan comparable.
x <- c("Dan", "dan", "dan.", "DAN")
y <- outer(x, x, "==")
dimnames(y) <- list(x, x); y
## Dan dan dan. DAN
## Dan TRUE FALSE FALSE FALSE
## dan FALSE TRUE FALSE FALSE
## dan. FALSE FALSE TRUE FALSE
## DAN FALSE FALSE FALSE TRUE
x <- strip(c("Dan", "dan", "dan.", "DAN"))
y <- outer(x, x, "==")
dimnames(y) <- list(x, x); y
## dan dan dan dan
## dan TRUE TRUE TRUE TRUE
## dan TRUE TRUE TRUE TRUE
## dan TRUE TRUE TRUE TRUE
## dan TRUE TRUE TRUE TRUE
As seen in the examples below, strip
comes with multiple arguments to adjust the flexibility of the degree of text standardization.
♦ strip
Examples♦
## Demonstrating the standardization of
## The data
DATA$state
## [1] "Computer is fun. Not too fun."
## [2] "No it's not, it's dumb."
## [3] "What should we do?"
## [4] "You liar, it stinks!"
## [5] "I am telling the truth!"
## [6] "How can we be certain?"
## [7] "There is no way."
## [8] "I distrust you."
## [9] "What are you talking about?"
## [10] "Shall we move on? Good then."
## [11] "I'm hungry. Let's eat. You already?"
strip(DATA$state)
## [1] "computer is fun not too fun" "no its not its dumb"
## [3] "what should we do" "you liar it stinks"
## [5] "i am telling the truth" "how can we be certain"
## [7] "there is no way" "i distrust you"
## [9] "what are you talking about" "shall we move on good then"
## [11] "im hungry lets eat you already"
strip(DATA$state, apostrophe.remove=FALSE)
## [1] "computer is fun not too fun" "no it's not it's dumb"
## [3] "what should we do" "you liar it stinks"
## [5] "i am telling the truth" "how can we be certain"
## [7] "there is no way" "i distrust you"
## [9] "what are you talking about" "shall we move on good then"
## [11] "i'm hungry let's eat you already"
strip(DATA$state, char.keep = c("?", "."))
## [1] "computer is fun. not too fun."
## [2] "no its not its dumb."
## [3] "what should we do?"
## [4] "you liar it stinks"
## [5] "i am telling the truth"
## [6] "how can we be certain?"
## [7] "there is no way."
## [8] "i distrust you."
## [9] "what are you talking about?"
## [10] "shall we move on? good then."
## [11] "im hungry. lets eat. you already?"
It is useful in discourse analysis to analyze vocabulary use. This may mean searching for words similar to your initial word list. The synonyms
(aka syn
) function generates synonyms from the qdapDictionaries' SYNONYM dictionary. These synonyms can be returned as a list or a vector that can then be passed to other qdap functions.
♦ Synonyms Examples♦
synonyms(c("the", "cat", "teach"))
## no match for the following:
##
## the
## ========================
## $cat.def_1
## [1] "feline" "gib" "grimalkin" "kitty" "malkin"
##
## $cat.def_2
## [1] "moggy"
##
## $cat.def_3
## [1] "mouser" "puss"
##
## $cat.def_4
## [1] "pussy"
##
## $cat.def_5
## [1] "tabby"
##
## $teach.def_1
## [1] "advise" "coach" "demonstrate"
## [4] "direct" "discipline" "drill"
## [7] "edify" "educate" "enlighten"
## [10] "give lessons in" "guide" "impart"
## [13] "implant" "inculcate" "inform"
## [16] "instil" "instruct" "school"
## [19] "show" "train" "tutor"
syn(c("the", "cat", "teach"), return.list = FALSE)
## no match for the following:
##
## the
## ========================
## [1] "feline" "gib" "grimalkin"
## [4] "kitty" "malkin" "moggy"
## [7] "mouser" "puss" "pussy"
## [10] "tabby" "advise" "coach"
## [13] "demonstrate" "direct" "discipline"
## [16] "drill" "edify" "educate"
## [19] "enlighten" "give lessons in" "guide"
## [22] "impart" "implant" "inculcate"
## [25] "inform" "instil" "instruct"
## [28] "school" "show" "train"
## [31] "tutor"
syn(c("the", "cat", "teach"), multiwords = FALSE)
## no match for the following:
##
## the
## ========================
## $cat.def_1
## [1] "feline" "gib" "grimalkin" "kitty" "malkin"
##
## $cat.def_2
## [1] "moggy"
##
## $cat.def_3
## [1] "mouser" "puss"
##
## $cat.def_4
## [1] "pussy"
##
## $cat.def_5
## [1] "tabby"
##
## $teach.def_1
## [1] "advise" "coach" "demonstrate" "direct" "discipline"
## [6] "drill" "edify" "educate" "enlighten" "guide"
## [11] "impart" "implant" "inculcate" "inform" "instil"
## [16] "instruct" "school" "show" "train" "tutor"
♦ Word Association Examples♦
ms <- c(" I ", "you")
et <- c(" it", " tell", "tru")
word_associate(DATA2$state, DATA2$person, match.string = ms,
wordcloud = TRUE, proportional = TRUE,
network.plot = TRUE, nw.label.proportional = TRUE, extra.terms = et,
cloud.legend =c("A", "B", "C"),
title.color = "blue", cloud.colors = c("red", "purple", "gray70"))
## row group unit text
## 1 4 sam 4 You liar, it stinks!
## 2 5 greg 5 I am telling the truth!
## 3 8 sam 8 I distrust you.
## 4 9 sally 9 What are you talking about?
## 5 11 greg 11 Im hungry. Lets eat. You already?
## 6 12 sam 12 I distrust you.
## 7 15 greg 15 I am telling the truth!
## 8 18 greg 18 Im hungry. Lets eat. You already?
## 9 19 sally 19 What are you talking about?
## 10 20 sam 20 You liar, it stinks!
## 11 21 greg 21 I am telling the truth!
## 12 22 sam 22 You liar, it stinks!
## 13 24 greg 24 Im hungry. Lets eat. You already?
## 14 25 greg 25 I am telling the truth!
## 15 30 sam 30 I distrust you.
## 16 31 greg 31 Im hungry. Lets eat. You already?
## 17 33 sam 33 I distrust you.
## 18 36 sam 36 You liar, it stinks!
## 19 40 greg 40 I am telling the truth!
## 20 41 sam 41 You liar, it stinks!
## 21 42 greg 42 I am telling the truth!
## 22 44 sam 44 You liar, it stinks!
## 23 47 sam 47 I distrust you.
## 24 49 sam 49 You liar, it stinks!
## 25 52 sally 52 What are you talking about?
## 26 53 sally 53 What are you talking about?
## 27 54 greg 54 I am telling the truth!
## 28 55 sam 55 I distrust you.
## 29 56 greg 56 Im hungry. Lets eat. You already?
## 30 57 greg 57 I am telling the truth!
## 31 58 greg 58 I am telling the truth!
## 32 59 greg 59 Im hungry. Lets eat. You already?
## 33 62 sam 62 You liar, it stinks!
## 34 63 sally 63 What are you talking about?
## 35 65 sam 65 I distrust you.
## 36 67 sally 67 What are you talking about?
## 37 68 sam 68 I distrust you.
##
## Match Terms
## ===========
##
## List 1:
## i, you
♦ Word Difference Examples♦
out <- with(DATA, word_diff_list(text.var = state,
grouping.var = list(sex, adult)))
ltruncdf(unlist(out, recursive = FALSE), n=4)
## $f.0_vs_f.1.unique_to_f.0
## word freq prop
## 1 about 1 0.1
## 2 are 1 0.25
## 3 be 1 0.1
## 4 can 1 0.16666666
##
## $f.0_vs_f.1.unique_to_f.1
## word freq prop
## 1 good 1 0.03030303
## 2 move 1 0.1
## 3 on 1 0.25
## 4 shall 1 0.1
##
## $f.0_vs_m.0.unique_to_f.0
## word freq prop
## 1 about 1 0.1
## 2 are 1 0.25
## 3 be 1 0.1
## 4 can 1 0.16666666
##
## $f.0_vs_m.0.unique_to_m.0
## word freq prop
## 1 fun 2 0.06060606
## 2 i 2 0.06060606
## 3 is 2 0.2
## 4 it's 2 0.06060606
##
## $f.1_vs_m.0.unique_to_f.1
## word freq prop
## 1 good 1 0.03030303
## 2 move 1 0.1
## 3 on 1 0.25
## 4 shall 1 0.1
##
## $f.1_vs_m.0.unique_to_m.0
## word freq prop
## 1 you 3 0.09090909
## 2 fun 2 0.06060606
## 3 i 2 0.06060606
## 4 is 2 0.2
##
## $f.0_vs_m.1.unique_to_f.0
## word freq prop
## 1 about 1 0.1
## 2 are 1 0.25
## 3 be 1 0.1
## 4 can 1 0.16666666
##
## $f.0_vs_m.1.unique_to_m.1
## word freq prop
## 1 do 1 0.1
## 2 should 1 0.25
##
## $f.1_vs_m.1.unique_to_f.1
## word freq prop
## 1 good 1 0.03030303
## 2 move 1 0.1
## 3 on 1 0.25
## 4 shall 1 0.1
##
## $f.1_vs_m.1.unique_to_m.1
## word freq prop
## 1 do 1 0.1
## 2 should 1 0.25
## 3 what 1 0.03030303
##
## $m.0_vs_m.1.unique_to_m.0
## word freq prop
## 1 you 3 0.09090909
## 2 fun 2 0.06060606
## 3 i 2 0.06060606
## 4 is 2 0.2
##
## $m.0_vs_m.1.unique_to_m.1
## word freq prop
## 1 do 1 0.1
## 2 should 1 0.25
## 3 we 1 0.16666666
## 4 what 1 0.03030303
♦ word_list
Examples♦
with(DATA, word_list(state, person))
## $greg
## WORD FREQ
## 1 it's 2
## 2 no 2
## 3 already 1
## 4 am 1
## 5 dumb 1
## 6 eat 1
## 7 hungry 1
## 8 I 1
## 9 I'm 1
## 10 is 1
## 11 let's 1
## 12 not 1
## 13 telling 1
## 14 the 1
## 15 there 1
## 16 truth 1
## 17 way 1
## 18 you 1
##
## $researcher
## WORD FREQ
## 1 good 1
## 2 move 1
## 3 on 1
## 4 shall 1
## 5 then 1
## 6 we 1
##
## $sally
## WORD FREQ
## 1 about 1
## 2 are 1
## 3 be 1
## 4 can 1
## 5 certain 1
## 6 how 1
## 7 talking 1
## 8 we 1
## 9 what 1
## 10 you 1
##
## $sam
## WORD FREQ
## 1 fun 2
## 2 you 2
## 3 computer 1
## 4 distrust 1
## 5 I 1
## 6 is 1
## 7 it 1
## 8 liar 1
## 9 not 1
## 10 stinks 1
## 11 too 1
##
## $teacher
## WORD FREQ
## 1 do 1
## 2 should 1
## 3 we 1
## 4 what 1
with(DATA, word_list(state, person, stopwords = Top25Words))
## $greg
## WORD FREQ
## 1 it's 2
## 2 no 2
## 3 already 1
## 4 am 1
## 5 dumb 1
## 6 eat 1
## 7 hungry 1
## 8 I'm 1
## 9 let's 1
## 10 not 1
## 11 telling 1
## 12 there 1
## 13 truth 1
## 14 way 1
##
## $researcher
## WORD FREQ
## 1 good 1
## 2 move 1
## 3 shall 1
## 4 then 1
## 5 we 1
##
## $sally
## WORD FREQ
## 1 about 1
## 2 can 1
## 3 certain 1
## 4 how 1
## 5 talking 1
## 6 we 1
## 7 what 1
##
## $sam
## WORD FREQ
## 1 fun 2
## 2 computer 1
## 3 distrust 1
## 4 liar 1
## 5 not 1
## 6 stinks 1
## 7 too 1
##
## $teacher
## WORD FREQ
## 1 do 1
## 2 should 1
## 3 we 1
## 4 what 1
with(DATA, word_list(state, person, cap = FALSE, cap.list=c("do", "we")))
## $greg
## WORD FREQ
## 1 it's 2
## 2 no 2
## 3 already 1
## 4 am 1
## 5 dumb 1
## 6 eat 1
## 7 hungry 1
## 8 I 1
## 9 I'm 1
## 10 is 1
## 11 let's 1
## 12 not 1
## 13 telling 1
## 14 the 1
## 15 there 1
## 16 truth 1
## 17 way 1
## 18 you 1
##
## $researcher
## WORD FREQ
## 1 good 1
## 2 move 1
## 3 on 1
## 4 shall 1
## 5 then 1
## 6 We 1
##
## $sally
## WORD FREQ
## 1 about 1
## 2 are 1
## 3 be 1
## 4 can 1
## 5 certain 1
## 6 how 1
## 7 talking 1
## 8 We 1
## 9 what 1
## 10 you 1
##
## $sam
## WORD FREQ
## 1 fun 2
## 2 you 2
## 3 computer 1
## 4 distrust 1
## 5 I 1
## 6 is 1
## 7 it 1
## 8 liar 1
## 9 not 1
## 10 stinks 1
## 11 too 1
##
## $teacher
## WORD FREQ
## 1 Do 1
## 2 should 1
## 3 We 1
## 4 what 1
A major task in qualitative work is coding either time or words with selected coding structures. For example a researcher may code the teacher's dialogue as related to the resulting behavior of a student in a classroom as “high”, “medium” or “low” engagement. The researcher may choose to apply the coding to:
The coding process in qdap starts with the decision of whether to code the dialogue and/or the time spans. After that the researcher may follow the sequential subsections in the Qualitative Coding System section outlined in these steps:
If you choose the route of coding words qdap gives two approaches. Each has distinct benefits and disadvantages dependent upon the situation. If you chose the coding of time spans qdap provides one option.
If you chose the coding of words you may choose to code a csv file or to code the transcript directly (perhaps with markers or other forms of markup), record the ranges in a text list and then read in the data. Both approaches can result in the same data being read back into qdap. The csv approach may allow for extended capabilities (beyond the scope of this vignette) while the transcript/list approach is generally more efficient and takes the approach many qualitative researchers typically utilize in qualitative coding (it also has the added benefit of producing a hard copy).
The next three subsections will walk the reader through how to make a template, code in the template, and read the data back into R/qdap. Subsections 4-5 will cover reshaping and initial analysis after the data has been read in (this approach is generally the same for all three coded data types).
Before getting started with subsections 1-3 the reader will want to know the naming scheme of the code matrix (cm_) functions used. The initial cm_ is utilized for any code matrix family of functions. The functions containing cm_temp are template functions. The df, range, or time determine whether the csv (df), Transcript/List (range), or Time Span (time) approach is being utilized. cm_ functions that bear 2long transform a read in list to a usable long format.
The csv approach utilizes cm_df.temp
and cm_2long
functions. To utilize the csv template approach simply supply the dataframe, specify the text variable and provide a list of anticipated codes.
♦ Coding Words (csv approach): The Template ♦
## Codes
codes <- qcv(dc, sf, wes, pol, rejk, lk, azx, mmm)
## The csv template
X <- cm_df.temp(DATA, text.var = "state", codes = codes, file = "DATA.csv")
qview(X)
========================================================================
nrow = 56 ncol = 14 X
========================================================================
person sex adult code text word.num dc sf wes pol rejk lk azx mmm
1 sam m 0 K1 Computer 1 0 0 0 0 0 0 0 0
2 sam m 0 K1 is 2 0 0 0 0 0 0 0 0
3 sam m 0 K1 fun. 3 0 0 0 0 0 0 0 0
4 sam m 0 K1 Not 4 0 0 0 0 0 0 0 0
5 sam m 0 K1 too 5 0 0 0 0 0 0 0 0
6 sam m 0 K1 fun. 6 0 0 0 0 0 0 0 0
7 greg m 0 K2 No 7 0 0 0 0 0 0 0 0
8 greg m 0 K2 it's 8 0 0 0 0 0 0 0 0
9 greg m 0 K2 not, 9 0 0 0 0 0 0 0 0
10 greg m 0 K2 it's 10 0 0 0 0 0 0 0 0
After coding the data (see the YouTube video) the data can be read back in with read.csv.
♦ Coding Words (csv approach): Read In and Reshape ♦
## Read in the data
dat <- read.csv("DATA.csv")
## Reshape to long format with word durations
cm_2long(dat)
code person sex adult code.1 text word.num start end variable
1 dc sam m 0 K1 Computer 1 0 1 dat
2 wes sam m 0 K1 Computer 1 0 1 dat
3 rejk sam m 0 K1 Computer 1 0 1 dat
4 mmm sam m 0 K1 Computer 1 0 1 dat
5 lk sam m 0 K1 is 2 1 2 dat
6 azx sam m 0 K1 is 2 1 2 dat
.
.
.
198 wes greg m 0 K11 already? 56 55 56 dat
199 rejk greg m 0 K11 already? 56 55 56 dat
200 lk greg m 0 K11 already? 56 55 56 dat
201 azx greg m 0 K11 already? 56 55 56 dat
202 mmm greg m 0 K11 already? 56 55 56 dat
The Transcript/List approach utilizes cm_df.transcript
, cm_range.temp
and cm_2long
functions. To use the transcript template simply supply the dataframe, specify the text variable and provide a list of anticipated codes.
♦ Coding Words (Transcript/List approach): Transcript Template ♦
## Codes
codes <- qcv(AA, BB, CC)
## Transcript template
X <- cm_df.transcript(DATA$state, DATA$person, file="DATA.txt")
sam:
1 2 3 4 5 6
Computer is fun. Not too fun.
greg:
7 8 9 10 11
No it's not, it's dumb.
teacher:
12 13 14 15
What should we do?
sam:
16 17 18 19
You liar, it stinks!
♦ Coding Words (Transcript/List approach): List Template 1♦
### List template
cm_range.temp(codes, file = "foo1.txt")
list(
AA = qcv(terms=''),
BB = qcv(terms=''),
CC = qcv(terms='')
)
This list below contains demographic variables. If the researcher has demographic variables it is recommended to supply them at this point. The demographic variables will be generated with durations automatically.
♦ Coding Words (Transcript/List approach): List Template 2♦
### List template with demographic variables
with(DATA, cm_range.temp(codes = codes, text.var = state,
grouping.var = list(person, adult), file = "foo2.txt"))
list(
person_greg = qcv(terms='7:11, 20:24, 30:33, 49:56'),
person_researcher = qcv(terms='42:48'),
person_sally = qcv(terms='25:29, 37:41'),
person_sam = qcv(terms='1:6, 16:19, 34:36'),
person_teacher = qcv(terms='12:15'),
adult_0 = qcv(terms='1:11, 16:41, 49:56'),
adult_1 = qcv(terms='12:15, 42:48'),
AA = qcv(terms=''),
BB = qcv(terms=''),
CC = qcv(terms='')
)
After coding the data (see the YouTube video) the data can be read back in with source. Be sure to assign list to an object (e.g., dat <- list()
).
♦ Coding Words (Transcript/List approach): Read in the data♦
## Read it in
source("foo1.txt")
### View it
Time1
$AA
[1] "1"
$BB
[1] "1:2," "3:10," "19"
$CC
[1] "1:9," "100:150"
This format is not particularly useful. The data can be reshaped to long format with durations via cm_2long
:
♦ Coding Words (Transcript/List approach): Long format♦
## Long format with durations
datL <- cm_2long(Time1)
datL
code start end variable
1 AA 0 1 Time1
2 BB 0 2 Time1
3 BB 2 10 Time1
4 BB 18 19 Time1
5 CC 0 9 Time1
6 CC 99 150 Time1
The Time Span approach utilizes the cm_time.temp
and cm_2long
functions. To generate the timespan template approach simply supply the list of anticipated codes and a start/end time.
♦ Coding Times Spans: Time Span Template ♦
## Codes
## Time span template
X <- cm_time.temp(start = ":14", end = "7:40", file="timespans.txt")
X <- cm_time.temp(start = ":14", end = "7:40", file="timespans.doc")
[0] 14 15 16 ... 51 52 53 54 55 56 57 58 59
[1]0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ... 51 52 53 54 55 56 57 58 59
[2]0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ... 51 52 53 54 55 56 57 58 59
[3]0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ... 51 52 53 54 55 56 57 58 59
[4]0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ... 51 52 53 54 55 56 57 58 59
[5]0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ... 51 52 53 54 55 56 57 58 59
[6]0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ... 51 52 53 54 55 56 57 58 59
[7]0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ... 51 52 53
♦ Coding Times Spans: List Template 1♦
### List template
codes <- qcv(AA, BB, CC)
cm_time.temp(codes, file = "codelist.txt")
list(
transcript_time_span = qcv(terms="00:00 - 00:00"),
AA = qcv(terms=""),
BB = qcv(terms=""),
CC = qcv(terms="")
)
This list below contains demographic variables. If the researcher has demographic variables it is recommended to supply them at this point.
♦ Coding Times Spans: List Template 2♦
### List template with demographic variables
with(DATA, cm_time.temp(codes, list(person, adult), file = "codelist.txt"))
list(
transcript_time_span = qcv(terms="00:00 - 00:00"),
person_sam = qcv(terms=""),
person_greg = qcv(terms=""),
person_teacher = qcv(terms=""),
person_sally = qcv(terms=""),
person_researcher = qcv(terms=""),
adult_0 = qcv(terms=""),
adult_1 = qcv(terms=""),
AA = qcv(terms=""),
BB = qcv(terms=""),
CC = qcv(terms="")
)
After coding the data (see the YouTube video) the data can be read back in with source. Be sure to assign list to an object (e.g., dat <- list()
).
♦ Coding Times Spans: Read in the data♦
## Read it in
source("codelist.txt")
### View it
Time1
$transcript_time_span
[1] "00:00" "-" "1:12:00"
$A
[1] "2.40:3.00," "5.01," "6.52:7.00," "9.00"
$B
[1] "2.40," "3.01:3.40," "5.01," "6.52:7.00," "9.00"
$C
[1] "2.40:4.00," "5.01," "6.52:7.00," "9.00," "13.00:17.01"
This format is not particularly useful. The data can be reshaped to long format with durations via cm_2long
:
♦ Coding Times Spans: Long format♦
## Long format with durations
datL <- cm_2long(Time1, v.name = "time")
datL
code start end Start End variable
1 A 159 180 00:02:39 00:03:00 Time1
2 A 300 301 00:05:00 00:05:01 Time1
3 A 411 420 00:06:51 00:07:00 Time1
4 A 539 540 00:08:59 00:09:00 Time1
5 B 159 160 00:02:39 00:02:40 Time1
6 B 180 220 00:03:00 00:03:40 Time1
7 B 300 301 00:05:00 00:05:01 Time1
8 B 411 420 00:06:51 00:07:00 Time1
9 B 539 540 00:08:59 00:09:00 Time1
10 C 159 240 00:02:39 00:04:00 Time1
11 C 300 301 00:05:00 00:05:01 Time1
12 C 411 420 00:06:51 00:07:00 Time1
13 C 539 540 00:08:59 00:09:00 Time1
14 C 779 1021 00:12:59 00:17:01 Time1
The researcher may want to determine where codes do and do not overlap with one other. The cm_ family of functions bearing (cm_code.) perform various transformative functions (Boolean search). cm_code.combine
will merge the spans (time or word) for given codes. cm_code.exclude
will give provide spans that exclude given codes. cm_code.overlap
will yield the spans where all of the given codes co-occur. cm_code.transform
is a wrapper for the previous three functions that produces one dataframe in a single call. Lastly, cm_code.blank
provides a more flexible framework that allows for the introduction of multiple logical operators between codes. Most tasks can be handled with the cm_code.transform
function.
For Examples of each click the links below:
1. cm_code.combine Examples
2. cm_code.exclude Examples
3. cm_code.overlap Examples
4. cm_code.transform Examples
5. cm_code.blank Examples
For the sake of simplicity the uses of these functions will be demonstrated via a gantt plot for a visual comparison of the data sets.
The reader should note that all of the above functions utilize two helper functions (cm_long2dummy
and cm_dummy2long
) to stretch the spans into single units of measure (word or second) perform a calculation and then condense back to spans. More advanced needs may require the explicit use of these functions, though they are beyond the scope of this vignette.
The following data sets will be utilized throughout the demonstrations of the cm_code. family of functions:
♦ Common Data Sets - Word Approach♦
foo <- list(
AA = qcv(terms="1:10"),
BB = qcv(terms="1:2, 3:10, 19"),
CC = qcv(terms="1:3, 5:6")
)
foo2 <- list(
AA = qcv(terms="4:8"),
BB = qcv(terms="1:4, 10:12"),
CC = qcv(terms="1, 11, 15:20"),
DD = qcv(terms="")
)
## Single time, long word approach
(x <- cm_2long(foo))
## code start end variable
## 1 AA 0 10 foo
## 2 BB 0 2 foo
## 3 BB 2 10 foo
## 4 BB 18 19 foo
## 5 CC 0 3 foo
## 6 CC 4 6 foo
## Repeated measures, long word approach
(z <- cm_2long(foo, foo2, v.name="time"))
## code start end time
## 1 AA 0 10 foo
## 2 BB 0 2 foo
## 3 BB 2 10 foo
## 4 BB 18 19 foo
## 5 CC 0 3 foo
## 6 CC 4 6 foo
## 7 AA 3 8 foo2
## 8 BB 0 4 foo2
## 9 BB 9 12 foo2
## 10 CC 0 1 foo2
## 11 CC 10 11 foo2
## 12 CC 14 20 foo2
♦ Common Data Sets - Time Span Approach♦
bar1 <- list(
transcript_time_span = qcv(00:00 - 1:12:00),
A = qcv(terms = "2.40:3.00, 5.01, 6.02:7.00, 9.00"),
B = qcv(terms = "2.40, 3.01:3.02, 5.01, 6.02:7.00, 9.00,
1.12.00:1.19.01"),
C = qcv(terms = "2.40:3.00, 5.01, 6.02:7.00, 9.00, 16.25:17.01")
)
bar2 <- list(
transcript_time_span = qcv(00:00 - 1:12:00),
A = qcv(terms = "2.40:3.00, 5.01, 6.02:7.00, 9.00"),
B = qcv(terms = "2.40, 3.01:3.02, 5.01, 6.02:7.00, 9.00,
1.12.00:1.19.01"),
C = qcv(terms = "2.40:3.00, 5.01, 6.02:7.00, 9.00, 17.01")
)
## Single time, long time approach
(dat <- cm_2long(bar1))
## code start end Start End variable
## 1 A 159 180 00:02:39 00:03:00 bar1
## 2 A 300 301 00:05:00 00:05:01 bar1
## 3 A 361 420 00:06:01 00:07:00 bar1
## 4 A 539 540 00:08:59 00:09:00 bar1
## 5 B 159 160 00:02:39 00:02:40 bar1
## 6 B 180 182 00:03:00 00:03:02 bar1
## 7 B 300 301 00:05:00 00:05:01 bar1
## 8 B 361 420 00:06:01 00:07:00 bar1
## 9 B 539 540 00:08:59 00:09:00 bar1
## 10 B 4319 4741 01:11:59 01:19:01 bar1
## 11 C 159 180 00:02:39 00:03:00 bar1
## 12 C 300 301 00:05:00 00:05:01 bar1
## 13 C 361 420 00:06:01 00:07:00 bar1
## 14 C 539 540 00:08:59 00:09:00 bar1
## 15 C 984 1021 00:16:24 00:17:01 bar1
## Repeated measures, long time approach
(dats <- cm_2long(bar1, bar2, v.name = "time"))
## code start end Start End time
## 1 A 159 180 00:02:39 00:03:00 bar1
## 2 A 300 301 00:05:00 00:05:01 bar1
## 3 A 361 420 00:06:01 00:07:00 bar1
## 4 A 539 540 00:08:59 00:09:00 bar1
## 5 B 159 160 00:02:39 00:02:40 bar1
## 6 B 180 182 00:03:00 00:03:02 bar1
## 7 B 300 301 00:05:00 00:05:01 bar1
## 8 B 361 420 00:06:01 00:07:00 bar1
## 9 B 539 540 00:08:59 00:09:00 bar1
## 10 B 4319 4741 01:11:59 01:19:01 bar1
## 11 C 159 180 00:02:39 00:03:00 bar1
## 12 C 300 301 00:05:00 00:05:01 bar1
## 13 C 361 420 00:06:01 00:07:00 bar1
## 14 C 539 540 00:08:59 00:09:00 bar1
## 15 C 984 1021 00:16:24 00:17:01 bar1
## 16 A 159 180 00:02:39 00:03:00 bar2
## 17 A 300 301 00:05:00 00:05:01 bar2
## 18 A 361 420 00:06:01 00:07:00 bar2
## 19 A 539 540 00:08:59 00:09:00 bar2
## 20 B 159 160 00:02:39 00:02:40 bar2
## 21 B 180 182 00:03:00 00:03:02 bar2
## 22 B 300 301 00:05:00 00:05:01 bar2
## 23 B 361 420 00:06:01 00:07:00 bar2
## 24 B 539 540 00:08:59 00:09:00 bar2
## 25 B 4319 4741 01:11:59 01:19:01 bar2
## 26 C 159 180 00:02:39 00:03:00 bar2
## 27 C 300 301 00:05:00 00:05:01 bar2
## 28 C 361 420 00:06:01 00:07:00 bar2
## 29 C 539 540 00:08:59 00:09:00 bar2
## 30 C 1020 1021 00:17:00 00:17:01 bar2
cm_code.combine
provides all the spans (time/words) that are occupied by one or more of the combined codes. For example, if we utilized cm_code.combine
on code list X and Y the result would be any span where X or Y is located. This is the OR of the Boolean search. Note that combine.code.list
must be supplied as a list of named character vectors.
♦ cm_code.combine
Single Time Word Example♦
(cc1 <- cm_code.combine(x, list(ALL=qcv(AA, BB, CC))))
## code start end
## 1 AA 0 10
## 2 BB 0 10
## 3 BB 18 19
## 4 CC 0 3
## 5 CC 4 6
## 6 ALL 0 10
## 7 ALL 18 19
♦ cm_code.combine
Repeated Measures Word Example♦
combines <- list(AB=qcv(AA, BB), ABC=qcv(AA, BB, CC))
(cc2 <- cm_code.combine(z, combines, rm.var = "time"))
## code start end time
## 1 AA 0 10 foo
## 2 BB 0 10 foo
## 3 BB 18 19 foo
## 4 CC 0 3 foo
## 5 CC 4 6 foo
## 6 AB 0 10 foo
## 7 AB 18 19 foo
## 8 ABC 0 10 foo
## 9 ABC 18 19 foo
## 10 AA 3 8 foo2
## 11 BB 0 4 foo2
## 12 BB 9 12 foo2
## 13 CC 0 1 foo2
## 14 CC 10 11 foo2
## 15 CC 14 20 foo2
## 16 AB 0 8 foo2
## 17 AB 9 12 foo2
## 18 ABC 0 8 foo2
## 19 ABC 9 12 foo2
## 20 ABC 14 20 foo2
♦ cm_code.combine
Single Time Time Span Example♦
combines2 <- list(AB=qcv(A, B), BC=qcv(B, C), ABC=qcv(A, B, C))
(cc3 <- cm_code.combine(dat, combines2))
## code start end Start End
## 1 A 159 180 00:02:39 00:03:00
## 2 A 300 301 00:05:00 00:05:01
## 3 A 361 420 00:06:01 00:07:00
## 4 A 539 540 00:08:59 00:09:00
## 5 B 159 160 00:02:39 00:02:40
## 6 B 180 182 00:03:00 00:03:02
## 7 B 300 301 00:05:00 00:05:01
## 8 B 361 420 00:06:01 00:07:00
## 9 B 539 540 00:08:59 00:09:00
## 10 B 4319 4741 01:11:59 01:19:01
## 11 C 159 180 00:02:39 00:03:00
## 12 C 300 301 00:05:00 00:05:01
## 13 C 361 420 00:06:01 00:07:00
## 14 C 539 540 00:08:59 00:09:00
## 15 C 984 1021 00:16:24 00:17:01
## 16 AB 159 182 00:02:39 00:03:02
## 17 AB 300 301 00:05:00 00:05:01
## 18 AB 361 420 00:06:01 00:07:00
## 19 AB 539 540 00:08:59 00:09:00
## 20 AB 4319 4741 01:11:59 01:19:01
## 21 BC 159 182 00:02:39 00:03:02
## 22 BC 300 301 00:05:00 00:05:01
## 23 BC 361 420 00:06:01 00:07:00
## 24 BC 539 540 00:08:59 00:09:00
## 25 BC 984 1021 00:16:24 00:17:01
## 26 BC 4319 4741 01:11:59 01:19:01
## 27 ABC 159 182 00:02:39 00:03:02
## 28 ABC 300 301 00:05:00 00:05:01
## 29 ABC 361 420 00:06:01 00:07:00
## 30 ABC 539 540 00:08:59 00:09:00
## 31 ABC 984 1021 00:16:24 00:17:01
## 32 ABC 4319 4741 01:11:59 01:19:01
cm_code.exclude
provides all the spans (time/words) that are occupied by one or more of the combined codes with the exclusion of another code. For example, if we utilized cm_code.combine
on code list X and Y the result would be any span where X is located but Y is not. This is the NOT of the Boolean search. The last term supplied to exclude.code.list is the excluded term. All other terms are combined and the final code term is partitioned out. Note that exclude.code.list
must be supplied as a list of named character vectors.
♦ cm_code.exclude
Single Time Word Example♦
(ce1 <- cm_code.exclude(x, list(BnoC=qcv(BB, CC))))
## code start end
## 1 AA 0 10
## 2 BB 0 10
## 3 BB 18 19
## 4 CC 0 3
## 5 CC 4 6
## 6 BnoC 3 4
## 7 BnoC 6 10
## 8 BnoC 18 19
♦ cm_code.exclude
Repeated Measures Word Example♦
exlist <- list(AnoB=qcv(AA, BB), ABnoC=qcv(AA, BB, CC))
(ce2 <- cm_code.exclude(z, exlist, rm.var = "time"))
## code start end time
## 1 AA 0 10 foo
## 2 BB 0 10 foo
## 3 BB 18 19 foo
## 4 CC 0 3 foo
## 5 CC 4 6 foo
## 6 ABnoC 3 4 foo
## 7 ABnoC 6 10 foo
## 8 ABnoC 18 19 foo
## 9 AA 3 8 foo2
## 10 BB 0 4 foo2
## 11 BB 9 12 foo2
## 12 CC 0 1 foo2
## 13 CC 10 11 foo2
## 14 CC 14 20 foo2
## 15 AnoB 4 8 foo2
## 16 ABnoC 1 8 foo2
## 17 ABnoC 9 10 foo2
## 18 ABnoC 11 12 foo2
♦ cm_code.exclude
Repeated Measures Time Span Example♦
exlist2 <- list(AnoB=qcv(A, B), BnoC=qcv(B, C), ABnoC=qcv(A, B, C))
(ce3 <- cm_code.exclude(dats, exlist2, "time"))
## code start end Start End time
## 1 A 159 180 00:02:39 00:03:00 bar1
## 2 A 300 301 00:05:00 00:05:01 bar1
## 3 A 361 420 00:06:01 00:07:00 bar1
## 4 A 539 540 00:08:59 00:09:00 bar1
## 5 B 159 160 00:02:39 00:02:40 bar1
## 6 B 180 182 00:03:00 00:03:02 bar1
## 7 B 300 301 00:05:00 00:05:01 bar1
## 8 B 361 420 00:06:01 00:07:00 bar1
## 9 B 539 540 00:08:59 00:09:00 bar1
## 10 B 4319 4741 01:11:59 01:19:01 bar1
## 11 C 159 180 00:02:39 00:03:00 bar1
## 12 C 300 301 00:05:00 00:05:01 bar1
## 13 C 361 420 00:06:01 00:07:00 bar1
## 14 C 539 540 00:08:59 00:09:00 bar1
## 15 C 984 1021 00:16:24 00:17:01 bar1
## 16 AnoB 160 180 00:02:40 00:03:00 bar1
## 17 BnoC 180 182 00:03:00 00:03:02 bar1
## 18 BnoC 4319 4741 01:11:59 01:19:01 bar1
## 19 ABnoC 180 182 00:03:00 00:03:02 bar1
## 20 ABnoC 4319 4741 01:11:59 01:19:01 bar1
## 21 A 159 180 00:02:39 00:03:00 bar2
## 22 A 300 301 00:05:00 00:05:01 bar2
## 23 A 361 420 00:06:01 00:07:00 bar2
## 24 A 539 540 00:08:59 00:09:00 bar2
## 25 B 159 160 00:02:39 00:02:40 bar2
## 26 B 180 182 00:03:00 00:03:02 bar2
## 27 B 300 301 00:05:00 00:05:01 bar2
## 28 B 361 420 00:06:01 00:07:00 bar2
## 29 B 539 540 00:08:59 00:09:00 bar2
## 30 B 4319 4741 01:11:59 01:19:01 bar2
## 31 C 159 180 00:02:39 00:03:00 bar2
## 32 C 300 301 00:05:00 00:05:01 bar2
## 33 C 361 420 00:06:01 00:07:00 bar2
## 34 C 539 540 00:08:59 00:09:00 bar2
## 35 C 1020 1021 00:17:00 00:17:01 bar2
## 36 AnoB 160 180 00:02:40 00:03:00 bar2
## 37 BnoC 180 182 00:03:00 00:03:02 bar2
## 38 BnoC 4319 4741 01:11:59 01:19:01 bar2
## 39 ABnoC 180 182 00:03:00 00:03:02 bar2
## 40 ABnoC 4319 4741 01:11:59 01:19:01 bar2
♦ cm_code.exclude
Single Time Time Span Combined Exclude Example♦
(ce4.1 <- cm_code.combine(dat, list(AB = qcv(A, B))))
## code start end Start End
## 1 A 159 180 00:02:39 00:03:00
## 2 A 300 301 00:05:00 00:05:01
## 3 A 361 420 00:06:01 00:07:00
## 4 A 539 540 00:08:59 00:09:00
## 5 B 159 160 00:02:39 00:02:40
## 6 B 180 182 00:03:00 00:03:02
## 7 B 300 301 00:05:00 00:05:01
## 8 B 361 420 00:06:01 00:07:00
## 9 B 539 540 00:08:59 00:09:00
## 10 B 4319 4741 01:11:59 01:19:01
## 11 C 159 180 00:02:39 00:03:00
## 12 C 300 301 00:05:00 00:05:01
## 13 C 361 420 00:06:01 00:07:00
## 14 C 539 540 00:08:59 00:09:00
## 15 C 984 1021 00:16:24 00:17:01
## 16 AB 159 182 00:02:39 00:03:02
## 17 AB 300 301 00:05:00 00:05:01
## 18 AB 361 420 00:06:01 00:07:00
## 19 AB 539 540 00:08:59 00:09:00
## 20 AB 4319 4741 01:11:59 01:19:01
(ce4.2 <- cm_code.exclude(ce4.1, list(CnoAB = qcv(C, AB))))
## code start end Start End
## 1 A 159 180 00:02:39 00:03:00
## 2 A 300 301 00:05:00 00:05:01
## 3 A 361 420 00:06:01 00:07:00
## 4 A 539 540 00:08:59 00:09:00
## 5 B 159 160 00:02:39 00:02:40
## 6 B 180 182 00:03:00 00:03:02
## 7 B 300 301 00:05:00 00:05:01
## 8 B 361 420 00:06:01 00:07:00
## 9 B 539 540 00:08:59 00:09:00
## 10 B 4319 4741 01:11:59 01:19:01
## 11 C 159 180 00:02:39 00:03:00
## 12 C 300 301 00:05:00 00:05:01
## 13 C 361 420 00:06:01 00:07:00
## 14 C 539 540 00:08:59 00:09:00
## 15 C 984 1021 00:16:24 00:17:01
## 16 AB 159 182 00:02:39 00:03:02
## 17 AB 300 301 00:05:00 00:05:01
## 18 AB 361 420 00:06:01 00:07:00
## 19 AB 539 540 00:08:59 00:09:00
## 20 AB 4319 4741 01:11:59 01:19:01
## 21 CnoAB 984 1021 00:16:24 00:17:01
cm_code.overlap
provides all the spans (time/words) that are occupied by all of the given codes. For example, if we utilized cm_code.overlap
on code list X and Y the result would be any span where X and Y are both located. This is the AND of the Boolean search. Note that overlap.code.list
must be supplied as a list of named character vectors.
♦ cm_code.overlap
Single Time Word Example♦
(co1 <- cm_code.overlap(x, list(BC=qcv(BB, CC))))
## code start end
## 1 AA 0 10
## 2 BB 0 10
## 3 BB 18 19
## 4 CC 0 3
## 5 CC 4 6
## 6 BC 0 3
## 7 BC 4 6
♦ cm_code.overlap
Repeated Measures Word Example♦
overlist <- list(AB=qcv(AA, BB), ABC=qcv(AA, BB, CC))
(co2 <- cm_code.overlap(z, overlist, rm.var = "time"))
## code start end time
## 1 AA 0 10 foo
## 2 BB 0 10 foo
## 3 BB 18 19 foo
## 4 CC 0 3 foo
## 5 CC 4 6 foo
## 6 AB 0 10 foo
## 7 ABC 0 3 foo
## 8 ABC 4 6 foo
## 9 AA 3 8 foo2
## 10 BB 0 4 foo2
## 11 BB 9 12 foo2
## 12 CC 0 1 foo2
## 13 CC 10 11 foo2
## 14 CC 14 20 foo2
## 15 AB 3 4 foo2
♦ cm_code.overlap
Repeated Measures Time Span Example♦
overlist2 <- list(AB=qcv(A, B), BC=qcv(B, C), ABC=qcv(A, B, C))
(co3 <- cm_code.overlap(dats, overlist2, "time"))
## code start end Start End time
## 1 A 159 180 00:02:39 00:03:00 bar1
## 2 A 300 301 00:05:00 00:05:01 bar1
## 3 A 361 420 00:06:01 00:07:00 bar1
## 4 A 539 540 00:08:59 00:09:00 bar1
## 5 B 159 160 00:02:39 00:02:40 bar1
## 6 B 180 182 00:03:00 00:03:02 bar1
## 7 B 300 301 00:05:00 00:05:01 bar1
## 8 B 361 420 00:06:01 00:07:00 bar1
## 9 B 539 540 00:08:59 00:09:00 bar1
## 10 B 4319 4741 01:11:59 01:19:01 bar1
## 11 C 159 180 00:02:39 00:03:00 bar1
## 12 C 300 301 00:05:00 00:05:01 bar1
## 13 C 361 420 00:06:01 00:07:00 bar1
## 14 C 539 540 00:08:59 00:09:00 bar1
## 15 C 984 1021 00:16:24 00:17:01 bar1
## 16 AB 159 160 00:02:39 00:02:40 bar1
## 17 AB 300 301 00:05:00 00:05:01 bar1
## 18 AB 361 420 00:06:01 00:07:00 bar1
## 19 AB 539 540 00:08:59 00:09:00 bar1
## 20 BC 159 160 00:02:39 00:02:40 bar1
## 21 BC 300 301 00:05:00 00:05:01 bar1
## 22 BC 361 420 00:06:01 00:07:00 bar1
## 23 BC 539 540 00:08:59 00:09:00 bar1
## 24 ABC 159 160 00:02:39 00:02:40 bar1
## 25 ABC 300 301 00:05:00 00:05:01 bar1
## 26 ABC 361 420 00:06:01 00:07:00 bar1
## 27 ABC 539 540 00:08:59 00:09:00 bar1
## 28 A 159 180 00:02:39 00:03:00 bar2
## 29 A 300 301 00:05:00 00:05:01 bar2
## 30 A 361 420 00:06:01 00:07:00 bar2
## 31 A 539 540 00:08:59 00:09:00 bar2
## 32 B 159 160 00:02:39 00:02:40 bar2
## 33 B 180 182 00:03:00 00:03:02 bar2
## 34 B 300 301 00:05:00 00:05:01 bar2
## 35 B 361 420 00:06:01 00:07:00 bar2
## 36 B 539 540 00:08:59 00:09:00 bar2
## 37 B 4319 4741 01:11:59 01:19:01 bar2
## 38 C 159 180 00:02:39 00:03:00 bar2
## 39 C 300 301 00:05:00 00:05:01 bar2
## 40 C 361 420 00:06:01 00:07:00 bar2
## 41 C 539 540 00:08:59 00:09:00 bar2
## 42 C 1020 1021 00:17:00 00:17:01 bar2
## 43 AB 159 160 00:02:39 00:02:40 bar2
## 44 AB 300 301 00:05:00 00:05:01 bar2
## 45 AB 361 420 00:06:01 00:07:00 bar2
## 46 AB 539 540 00:08:59 00:09:00 bar2
## 47 BC 159 160 00:02:39 00:02:40 bar2
## 48 BC 300 301 00:05:00 00:05:01 bar2
## 49 BC 361 420 00:06:01 00:07:00 bar2
## 50 BC 539 540 00:08:59 00:09:00 bar2
## 51 ABC 159 160 00:02:39 00:02:40 bar2
## 52 ABC 300 301 00:05:00 00:05:01 bar2
## 53 ABC 361 420 00:06:01 00:07:00 bar2
## 54 ABC 539 540 00:08:59 00:09:00 bar2
cm_code.transform
Examplescm_code.transform
is merely a wrapper for cm_code.combine
, cm_code.exclude
, and cm_code.overlap
.
♦ cm_code.transform
- Example 1♦
ct1 <- cm_code.transform(x,
overlap.code.list = list(oABC=qcv(AA, BB, CC)),
combine.code.list = list(ABC=qcv(AA, BB, CC)),
exclude.code.list = list(ABnoC=qcv(AA, BB, CC))
)
ct1
## code start end
## 1 AA 0 10
## 2 BB 0 10
## 3 BB 18 19
## 4 CC 0 3
## 5 CC 4 6
## 6 oABC 0 3
## 7 oABC 4 6
## 8 ABC 0 10
## 9 ABC 18 19
## 10 ABnoC 3 4
## 11 ABnoC 6 10
## 12 ABnoC 18 19
♦ cm_code.transform
- Example 2♦
ct2 <-cm_code.transform(z,
overlap.code.list = list(oABC=qcv(AA, BB, CC)),
combine.code.list = list(ABC=qcv(AA, BB, CC)),
exclude.code.list = list(ABnoC=qcv(AA, BB, CC)), "time"
)
ct2
## code start end time
## 1 AA 0 10 foo
## 2 BB 0 10 foo
## 3 BB 18 19 foo
## 4 CC 0 3 foo
## 5 CC 4 6 foo
## 6 oABC 0 3 foo
## 7 oABC 4 6 foo
## 14 ABC 0 10 foo
## 15 ABC 18 19 foo
## 19 ABnoC 3 4 foo
## 20 ABnoC 6 10 foo
## 21 ABnoC 18 19 foo
## 8 AA 3 8 foo2
## 9 BB 0 4 foo2
## 10 BB 9 12 foo2
## 11 CC 0 1 foo2
## 12 CC 10 11 foo2
## 13 CC 14 20 foo2
## 16 ABC 0 8 foo2
## 17 ABC 9 12 foo2
## 18 ABC 14 20 foo2
## 22 ABnoC 1 8 foo2
## 23 ABnoC 9 10 foo2
## 24 ABnoC 11 12 foo2
♦ cm_code.transform
- Example 3♦
ct3 <-cm_code.transform(dat,
overlap.code.list = list(oABC=qcv(A, B, C)),
combine.code.list = list(ABC=qcv(A, B, C)),
exclude.code.list = list(ABnoC=qcv(A, B, C))
)
ct3
## code start end Start End
## 1 A 159 180 00:02:39 00:03:00
## 2 A 300 301 00:05:00 00:05:01
## 3 A 361 420 00:06:01 00:07:00
## 4 A 539 540 00:08:59 00:09:00
## 5 B 159 160 00:02:39 00:02:40
## 6 B 180 182 00:03:00 00:03:02
## 7 B 300 301 00:05:00 00:05:01
## 8 B 361 420 00:06:01 00:07:00
## 9 B 539 540 00:08:59 00:09:00
## 10 B 4319 4741 01:11:59 01:19:01
## 11 C 159 180 00:02:39 00:03:00
## 12 C 300 301 00:05:00 00:05:01
## 13 C 361 420 00:06:01 00:07:00
## 14 C 539 540 00:08:59 00:09:00
## 15 C 984 1021 00:16:24 00:17:01
## 16 oABC 159 160 00:02:39 00:02:40
## 17 oABC 300 301 00:05:00 00:05:01
## 18 oABC 361 420 00:06:01 00:07:00
## 19 oABC 539 540 00:08:59 00:09:00
## 20 ABC 159 182 00:02:39 00:03:02
## 21 ABC 300 301 00:05:00 00:05:01
## 22 ABC 361 420 00:06:01 00:07:00
## 23 ABC 539 540 00:08:59 00:09:00
## 24 ABC 984 1021 00:16:24 00:17:01
## 25 ABC 4319 4741 01:11:59 01:19:01
## 26 ABnoC 180 182 00:03:00 00:03:02
## 27 ABnoC 4319 4741 01:11:59 01:19:01
cm_code.blank
provides flexible Boolean comparisons between word.time spans. The overlap
argument takes a logical value, an integer or a character string of binary operator couple with an integer. It is important to understand how the function operates. This initial step calls cm_long2dummy
as seen below (stretching the spans to dummy coded columns), the comparison is conduted between columns, and then the columns are reverted back to spans via the cm)dummy2long
. This first example illustrates the stretching to dummy and reverting back to spans.
♦ Long to dummy and dummy to long ♦
long2dummy <- cm_long2dummy(x, "variable")
list(original =x,
long_2_dummy_format = long2dummy[[1]],
dummy_back_2_long = cm_dummy2long(long2dummy, "variable")
)
$original
code start end variable
1 AA 0 10 foo
2 BB 0 2 foo
3 BB 2 10 foo
4 BB 18 19 foo
5 CC 0 3 foo
6 CC 4 6 foo
$long_2_dummy_format
AA BB CC
0 1 1 1
1 1 1 1
2 1 1 1
3 1 1 0
4 1 1 1
5 1 1 1
6 1 1 0
7 1 1 0
8 1 1 0
9 1 1 0
10 0 0 0
11 0 0 0
12 0 0 0
13 0 0 0
14 0 0 0
15 0 0 0
16 0 0 0
17 0 0 0
18 0 1 0
19 0 0 0
$dummy_back_2_long
code start end variable
1 AA 0 10 foo
2 BB 0 10 foo
3 BB 18 19 foo
4 CC 0 3 foo
5 CC 4 6 foo
Now let's examine a few uses of cm_code.blank
. The first is to set overlap = TRUE
(the default behavior). This default behavior is identical to cm_code.overlap
as seen below.
♦ cm_code.blank
- overlap = TRUE
♦
(cb1 <- cm_code.blank(x, list(ABC=qcv(AA, BB, CC))))
## code start end
## 1 AA 0 10
## 2 BB 0 10
## 3 BB 18 19
## 4 CC 0 3
## 5 CC 4 6
## 6 ABC 0 3
## 7 ABC 4 6
Next we'll set overlap = FALSE
and see that it is identical to cm_code.combine
.
♦ cm_code.blank
- overlap = FALSE
♦
(cb2 <- cm_code.blank(x, list(ABC=qcv(AA, BB, CC)), overlap = FALSE))
## code start end
## 1 AA 0 10
## 2 BB 0 10
## 3 BB 18 19
## 4 CC 0 3
## 5 CC 4 6
## 6 ABC 0 10
## 7 ABC 18 19
By first combining all codes (see cb2
above) and then excluding the final code by setting
overlap = 1
the behavior of cm_code.exclude
can be mimicked.
♦ cm_code.blank
- mimicking cm_code.exclude
♦
## Using the output from `cb2` above.
(cb3 <- cm_code.blank(cb2, list(ABnoC=qcv(ABC, CC)), overlap = 1))
## code start end
## 1 AA 0 10
## 2 BB 0 10
## 3 BB 18 19
## 4 CC 0 3
## 5 CC 4 6
## 6 ABC 0 10
## 7 ABC 18 19
## 8 ABnoC 3 4
## 9 ABnoC 6 10
## 10 ABnoC 18 19
Next we shall find when at least two codes overlap by setting overlap = ">1"
.
♦ cm_code.blank
- At least 2 codes overlap ♦
blanklist <- list(AB=qcv(AA, BB), ABC=qcv(AA, BB, CC))
(cb4 <- cm_code.blank(z, blanklist, rm.var = "time", overlap = ">1"))
## code start end time
## 1 AA 0 10 foo
## 2 BB 0 10 foo
## 3 BB 18 19 foo
## 4 CC 0 3 foo
## 5 CC 4 6 foo
## 6 AB 0 10 foo
## 7 ABC 0 10 foo
## 8 AA 3 8 foo2
## 9 BB 0 4 foo2
## 10 BB 9 12 foo2
## 11 CC 0 1 foo2
## 12 CC 10 11 foo2
## 13 CC 14 20 foo2
## 14 AB 3 4 foo2
## 15 ABC 0 1 foo2
## 16 ABC 3 4 foo2
## 17 ABC 10 11 foo2
Last, we will find spans where not one of the codes occurred by setting overlap = "==0"
.
♦ cm_code.blank
- Spans where no code occurs ♦
blanklist2 <- list(noAB=qcv(AA, BB), noABC=qcv(AA, BB, CC))
(cb5 <- cm_code.blank(z, blanklist2, rm.var = "time", overlap = "==0"))
## code start end time
## 1 AA 0 10 foo
## 2 BB 0 10 foo
## 3 BB 18 19 foo
## 4 CC 0 3 foo
## 5 CC 4 6 foo
## 6 noAB 10 18 foo
## 7 noAB 19 20 foo
## 8 noABC 10 18 foo
## 9 noABC 19 20 foo
## 10 AA 3 8 foo2
## 11 BB 0 4 foo2
## 12 BB 9 12 foo2
## 13 CC 0 1 foo2
## 14 CC 10 11 foo2
## 15 CC 14 20 foo2
## 16 noAB 8 9 foo2
## 17 noAB 12 21 foo2
## 18 noABC 8 9 foo2
## 19 noABC 12 14 foo2
## 20 noABC 20 21 foo2
The cm_ family of functions has three approaches to initial analysis of codes. The researcher may want to summarize, visualize or determine the proximity of codes to one another. The following functions accomplish these tasks:
Most of the cm_ family of functions have a summary
method to allows for summaries of codes by group. Note that these summaries can be wrapped with plot
to print a heat map of the table of summaries.
♦ Example 1: Summarizing Transcript/List Approach ♦
## Two transcript lists
A <- list(
person_greg = qcv(terms='7:11, 20:24, 30:33, 49:56'),
person_researcher = qcv(terms='42:48'),
person_sally = qcv(terms='25:29, 37:41'),
person_sam = qcv(terms='1:6, 16:19, 34:36'),
person_teacher = qcv(terms='12:15'),
adult_0 = qcv(terms='1:11, 16:41, 49:56'),
adult_1 = qcv(terms='12:15, 42:48'),
AA = qcv(terms="1"),
BB = qcv(terms="1:2, 3:10, 19"),
CC = qcv(terms="1:9, 100:150")
)
B <- list(
person_greg = qcv(terms='7:11, 20:24, 30:33, 49:56'),
person_researcher = qcv(terms='42:48'),
person_sally = qcv(terms='25:29, 37:41'),
person_sam = qcv(terms='1:6, 16:19, 34:36'),
person_teacher = qcv(terms='12:15'),
adult_0 = qcv(terms='1:11, 16:41, 49:56'),
adult_1 = qcv(terms='12:15, 42:48'),
AA = qcv(terms="40"),
BB = qcv(terms="50:90"),
CC = qcv(terms="60:90, 100:120, 150"),
DD = qcv(terms="")
)
## Long format for transcript/list approach
v <- cm_2long(A, B, v.name = "time")
head(v)
## code start end time
## 1 person_greg 6 11 A
## 2 person_greg 19 24 A
## 3 person_greg 29 33 A
## 4 person_greg 48 56 A
## 5 person_researcher 41 48 A
## 6 person_sally 24 29 A
## Summary of the data and plotting the summary
summary(v)
time code total percent_total n percent_n ave min max mean(sd)
1 a person_greg 22 12.0% 4 18.2% 5.5 4 8 5.5(1.7)
2 a person_researcher 7 3.8% 1 4.5% 7.0 7 7 7.0(0)
3 a person_sally 10 5.4% 2 9.1% 5.0 5 5 5.0(0)
4 a person_sam 13 7.1% 3 13.6% 4.3 3 6 4.3(1.5)
5 a person_teacher 4 2.2% 1 4.5% 4.0 4 4 4.0(0)
6 a adult_0 45 24.5% 3 13.6% 15.0 8 26 15.0(9.6)
7 a adult_1 11 6.0% 2 9.1% 5.5 4 7 5.5(2.1)
8 a AA 1 .5% 1 4.5% 1.0 1 1 1.0(0)
9 a BB 11 6.0% 3 13.6% 3.7 1 8 3.7(3.8)
10 a CC 60 32.6% 2 9.1% 30.0 9 51 30.0(29.7)
11 b person_greg 22 10.6% 4 19.0% 5.5 4 8 5.5(1.7)
12 b person_researcher 7 3.4% 1 4.8% 7.0 7 7 7.0(0)
13 b person_sally 10 4.8% 2 9.5% 5.0 5 5 5.0(0)
14 b person_sam 13 6.3% 3 14.3% 4.3 3 6 4.3(1.5)
15 b person_teacher 4 1.9% 1 4.8% 4.0 4 4 4.0(0)
16 b adult_0 45 21.7% 3 14.3% 15.0 8 26 15.0(9.6)
17 b adult_1 11 5.3% 2 9.5% 5.5 4 7 5.5(2.1)
18 b AA 1 .5% 1 4.8% 1.0 1 1 1.0(0)
19 b BB 41 19.8% 1 4.8% 41.0 41 41 41.0(0)
20 b CC 53 25.6% 3 14.3% 17.7 1 31 17.7(15.3)
============================
Unit of measure: words
plot(summary(v))
plot(summary(v), facet.vars = "time")
♦ Example 2: Summarizing Time Spans Approach ♦
## Single time list
x <- list(
transcript_time_span = qcv(00:00 - 1:12:00),
A = qcv(terms = "2.40:3.00, 5.01, 6.02:7.00, 9.00"),
B = qcv(terms = "2.40, 3.01:3.02, 5.01, 6.02:7.00,
9.00, 1.12.00:1.19.01"),
C = qcv(terms = "2.40:3.00, 5.01, 6.02:7.00, 9.00, 17.01")
)
## Long format for time span approach
z <-cm_2long(x)
head(z)
## code start end Start End variable
## 1 A 159 180 00:02:39 00:03:00 x
## 2 A 300 301 00:05:00 00:05:01 x
## 3 A 361 420 00:06:01 00:07:00 x
## 4 A 539 540 00:08:59 00:09:00 x
## 5 B 159 160 00:02:39 00:02:40 x
## 6 B 180 182 00:03:00 00:03:02 x
## Summary of the data and plotting the summary
summary(z)
code total percent_total n percent_n ave min max mean(sd)
1 A 01:22 12.6% 4 26.7% 20.5 1 59 20.5(27.3)
2 B 08:06 74.7% 6 40.0% 81.0 1 422 81.0(168.6)
3 C 01:23 12.7% 5 33.3% 16.6 1 59 16.6(25.2)
============================
Unit of measure: time
Columns measured in seconds unless in the form hh:mm:ss
plot(summary(z))
♦ Trouble Shooting Summary: Suppress Measurement Units ♦
## suppress printing measurement units
suppressMessages(print(summary(z)))
code total percent_total n percent_n ave min max mean(sd)
1 A 01:22 12.6% 4 26.7% 20.5 1 59 20.5(27.3)
2 B 08:06 74.7% 6 40.0% 81.0 1 422 81.0(168.6)
3 C 01:23 12.7% 5 33.3% 16.6 1 59 16.6(25.2)
♦ Trouble Shooting Summary: Print as Dataframe ♦
## remove print method
class(z) <- "data.frame"
z
## code start end Start End variable
## 1 A 159 180 00:02:39 00:03:00 x
## 2 A 300 301 00:05:00 00:05:01 x
## 3 A 361 420 00:06:01 00:07:00 x
## 4 A 539 540 00:08:59 00:09:00 x
## 5 B 159 160 00:02:39 00:02:40 x
## 6 B 180 182 00:03:00 00:03:02 x
## 7 B 300 301 00:05:00 00:05:01 x
## 8 B 361 420 00:06:01 00:07:00 x
## 9 B 539 540 00:08:59 00:09:00 x
## 10 B 4319 4741 01:11:59 01:19:01 x
## 11 C 159 180 00:02:39 00:03:00 x
## 12 C 300 301 00:05:00 00:05:01 x
## 13 C 361 420 00:06:01 00:07:00 x
## 14 C 539 540 00:08:59 00:09:00 x
## 15 C 1020 1021 00:17:00 00:17:01 x
Like summary
, most of the cm_ family of functions have a plot
method as well that allows a Gantt plot visualization of codes by group.
♦ Gantt Plot of Transcript/List or Time Spans Data ♦
## Two transcript lists
A <- list(
person_greg = qcv(terms='7:11, 20:24, 30:33, 49:56'),
person_researcher = qcv(terms='42:48'),
person_sally = qcv(terms='25:29, 37:41'),
person_sam = qcv(terms='1:6, 16:19, 34:36'),
person_teacher = qcv(terms='12:15'),
adult_0 = qcv(terms='1:11, 16:41, 49:56'),
adult_1 = qcv(terms='12:15, 42:48'),
AA = qcv(terms="1"),
BB = qcv(terms="1:2, 3:10, 19"),
CC = qcv(terms="1:9, 100:150")
)
B <- list(
person_greg = qcv(terms='7:11, 20:24, 30:33, 49:56'),
person_researcher = qcv(terms='42:48'),
person_sally = qcv(terms='25:29, 37:41'),
person_sam = qcv(terms='1:6, 16:19, 34:36'),
person_teacher = qcv(terms='12:15'),
adult_0 = qcv(terms='1:11, 16:41, 49:56'),
adult_1 = qcv(terms='12:15, 42:48'),
AA = qcv(terms="40"),
BB = qcv(terms="50:90"),
CC = qcv(terms="60:90, 100:120, 150"),
DD = qcv(terms="")
)
## Long format
x <- cm_2long(A, v.name = "time")
y <- cm_2long(A, B, v.name = "time")
## cm_code family
combs <- list(sam_n_sally = qcv(person_sam, person_sally))
z <- cm_code.combine(v, combs, "time")
plot(x, title = "Single")
plot(y, title = "Repeated Measure")
plot(z, title = "Combined Codes")
Often a research will want to know which codes are clustering closer to other codes (regardless of whether the codes represent word or time spans). cm_distance
allows the research to find the distances between codes and standardize the mean of the differences to allow for comparisons similar to a correlation. The matrix output from cm_distance
is arrived at by taking the means and standard deviations of the differences between codes and scaling them (without centering) and then multiplying the two together. This results in a standardized distance measure that is non-negative, with values closer to zero indicating codes that are found in closer proximity.
The researcher may also access the means, standard deviations and number of codes by indexing the list output for each transcript. This distance measure compliments the Gantt plot.
Note that the argument causal = FALSE (the default) does not assume Code A comes before Code B whereas causal = TRUE assumes the first code precedes the second code. Generally, setting causal = FALSE will result in larger mean of differences and accompanying standardized values. Also note that row names are the first code and column names are the second comparison code. The values for Code A compared to Code B will not be the same as Code B compared to Code A. This is because, unlike a true distance measure, cm_distance
's matrix is asymmetrical. cm_distance
computes the distance by taking each span (start and end) for Code A and comparing it to the nearest start or end for Code B. So for example there may be 6 Code A spans and thus six differences between A and B, whereas Code B may only have 3 spans and thus three differences between B and A. This fact alone will lead to differences in A compared to B versus B compared to A.
♦ cm_distance
- Initial Data Setup ♦
x <- list(
transcript_time_span = qcv(00:00 - 1:12:00),
A = qcv(terms = "2.40:3.00, 6.32:7.00, 9.00,
10.00:11.00, 33.23:40.00, 59.56"),
B = qcv(terms = "3.01:3.02, 5.01, 19.00, 1.12.00:1.19.01"),
C = qcv(terms = "2.40:3.00, 5.01, 6.32:7.00, 9.00, 17.01, 38.09:40.00")
)
y <- list(
transcript_time_span = qcv(00:00 - 1:12:00),
A = qcv(terms = "2.40:3.00, 6.32:7.00, 9.00,
10.00:11.00, 23.44:25.00, 59.56"),
B = qcv(terms = "3.01:3.02, 5.01, 7.05:8.00 19.30, 1.12.00:1.19.01"),
C = qcv(terms = "2.40:3.00, 5.01, 6.32:7.30, 9.00, 17.01, 25.09:27.00")
)
## Long format
dat <- cm_2long(x, y)
♦ cm_distance
- Non-Causal Distance ♦
## a cm_distance output
(out1 <- cm_distance(dat, time.var = "variable"))
x
standardized:
A B C
A 0.00 1.04 0.82
B 0.88 0.00 3.89
C 0.09 0.95 0.00
y
standardized:
A B C
A 0.00 0.38 1.97
B 0.47 0.00 4.94
C 0.08 0.09 0.00
## The elements available from the output
names(out1)
[1] "x" "y"
## A list containing means, standard deviations and other
## descriptive statistics for the differences between codes
out1$x
$mean
A B C
A 0.00 367.67 208.67
B 322.50 0.00 509.00
C 74.67 265.00 0.00
$sd
A B C
A 0.00 347.51 483.27
B 337.47 0.00 940.94
C 143.77 440.92 0.00
$n
A B C
A 6 6 6
B 4 4 4
C 6 6 6
$combined
A B C
A n=6 367.67(347.51)n=6 208.67(483.27)n=6
B 322.5(337.47)n=4 n=4 509(940.94)n=4
C 74.67(143.77)n=6 265(440.92)n=6 n=6
$standardized
A B C
A 0.00 1.04 0.82
B 0.88 0.00 3.89
C 0.09 0.95 0.00
♦ cm_distance
- Causal Distance ♦
## a cm_distance output `causal = TRUE`
cm_distance(dat, time.var = "variable", causal = TRUE)
x
standardized:
A B C
A 0.66 0.84 0.08
B 0.29 3.96 0.49
C 0.40 0.86 0.37
y
standardized:
A B C
A 1.11 1.63 0.08
B 0.03 2.95 0.04
C 0.70 1.27 0.11
A researcher often needs to quickly gather frequency counts for various words/word types. qdap offers multiple functions designed to efficiently generate descriptive word statistics by any combination of grouping variables. Many of the functions also offer proportional usage to more fairly compare between groups. Additionally, many functions also have plotting methods to better visualize the data that is transformed.
Often a researcher may want to get a general sense of how words are functioning for different grouping variables. The word_stats
function enables a quick picture of what is occurring within the data. The displayed (printed) output is a dataframe, however, the output from word_stats is actually a list. Use ?word_stats
to learn more.
The displayed output is a wide dataframe, hence the abbreviated column names. The following column names and meanings will provide guidance in understanding the output:
word_stats
Column Names♦ word_stats
Example ♦
Note that the initial output is broken into three dataframe outputs because of the width of printed output from word_stats
being so large. The user will see that these three dataframes are actually one wide dataframe in the R output.
(desc_wrds <- with(mraja1spl, word_stats(dialogue, person, tot = tot)))
person n.tot n.sent n.words n.char n.syl n.poly sptot wptot
1 Romeo 49 113 1163 4757 1441 48 2.3 23.7
2 Benvolio 34 51 621 2563 780 25 1.5 18.3
3 Nurse 20 59 599 2274 724 20 3.0 29.9
4 Sampson 20 28 259 912 294 7 1.4 12.9
5 Juliet 16 24 206 789 238 5 1.5 12.9
6 Gregory 15 20 149 553 166 1 1.3 9.9
7 Capulet 14 72 736 2900 902 35 5.1 52.6
8 Lady Capulet 12 27 288 1205 370 10 2.2 24.0
9 Mercutio 11 24 549 2355 704 29 2.2 49.9
10 Servant 10 19 184 725 226 5 1.9 18.4
11 Tybalt 8 17 160 660 207 9 2.1 20.0
12 Montague 6 13 217 919 284 13 2.2 36.2
13 Abraham 5 6 24 79 26 0 1.2 4.8
14 First Servant 3 7 69 294 87 2 2.3 23.0
15 Second Servant 3 4 41 160 49 0 1.3 13.7
16 Lady Montague 2 4 28 88 30 0 2.0 14.0
17 Paris 2 3 32 124 41 2 1.5 16.0
18 Second Capulet 2 2 17 64 21 0 1.0 8.5
19 Prince 1 9 167 780 228 17 9.0 167.0
20 First Citizen 1 5 16 79 22 3 5.0 16.0
person wps cps sps psps cpw spw pspw n.state n.quest n.exclm
1 Romeo 10.3 42.1 12.8 0.4 4.1 1.2 0.0 69 22 22
2 Benvolio 12.2 50.3 15.3 0.5 4.1 1.3 0.0 39 8 4
3 Nurse 10.2 38.5 12.3 0.3 3.8 1.2 0.0 37 9 13
4 Sampson 9.2 32.6 10.5 0.2 3.5 1.1 0.0 27 1 0
5 Juliet 8.6 32.9 9.9 0.2 3.8 1.2 0.0 16 5 3
6 Gregory 7.5 27.6 8.3 0.0 3.7 1.1 0.0 14 3 3
7 Capulet 10.2 40.3 12.5 0.5 3.9 1.2 0.0 40 10 22
8 Lady Capulet 10.7 44.6 13.7 0.4 4.2 1.3 0.0 20 6 1
9 Mercutio 22.9 98.1 29.3 1.2 4.3 1.3 0.1 20 2 2
10 Servant 9.7 38.2 11.9 0.3 3.9 1.2 0.0 14 2 3
11 Tybalt 9.4 38.8 12.2 0.5 4.1 1.3 0.1 13 2 2
12 Montague 16.7 70.7 21.8 1.0 4.2 1.3 0.1 11 2 0
13 Abraham 4.0 13.2 4.3 0.0 3.3 1.1 0.0 3 2 1
14 First Servant 9.9 42.0 12.4 0.3 4.3 1.3 0.0 3 2 2
15 Second Servant 10.2 40.0 12.2 0.0 3.9 1.2 0.0 4 0 0
16 Lady Montague 7.0 22.0 7.5 0.0 3.1 1.1 0.0 2 2 0
17 Paris 10.7 41.3 13.7 0.7 3.9 1.3 0.1 2 1 0
18 Second Capulet 8.5 32.0 10.5 0.0 3.8 1.2 0.0 2 0 0
19 Prince 18.6 86.7 25.3 1.9 4.7 1.4 0.1 7 1 1
20 First Citizen 3.2 15.8 4.4 0.6 4.9 1.4 0.2 0 0 5
person p.state p.quest p.exclm n.hapax n.dis grow.rate prop.dis
1 Romeo 0.6 0.2 0.2 365 84 0.3 0.1
2 Benvolio 0.8 0.2 0.1 252 43 0.4 0.1
3 Nurse 0.6 0.2 0.2 147 48 0.2 0.1
4 Sampson 1.0 0.0 0.0 81 22 0.3 0.1
5 Juliet 0.7 0.2 0.1 94 22 0.5 0.1
6 Gregory 0.7 0.2 0.2 72 17 0.5 0.1
7 Capulet 0.6 0.1 0.3 232 46 0.3 0.1
8 Lady Capulet 0.7 0.2 0.0 135 28 0.5 0.1
9 Mercutio 0.8 0.1 0.1 253 28 0.5 0.1
10 Servant 0.7 0.1 0.2 71 19 0.4 0.1
11 Tybalt 0.8 0.1 0.1 79 17 0.5 0.1
12 Montague 0.8 0.2 0.0 117 21 0.5 0.1
13 Abraham 0.5 0.3 0.2 3 7 0.1 0.3
14 First Servant 0.4 0.3 0.3 33 8 0.5 0.1
15 Second Servant 1.0 0.0 0.0 32 3 0.8 0.1
16 Lady Montague 0.5 0.5 0.0 24 2 0.9 0.1
17 Paris 0.7 0.3 0.0 25 2 0.8 0.1
18 Second Capulet 1.0 0.0 0.0 7 5 0.4 0.3
19 Prince 0.8 0.1 0.1 83 15 0.5 0.1
20 First Citizen 0.0 0.0 1.0 9 2 0.6 0.1
## The following shows all the available elements in the `word_stats` output
names(desc_wrds)
## [1] "ts" "gts" "mpun" "word.elem" "sent.elem" "omit"
## [7] "digits"
word_stats
has a plot method that plots the output as a heat map. This can be useful for finding high/low elements in the data set.
♦ word_stats
Plot ♦
plot(desc_wrds)
plot(desc_wrds, label=TRUE, lab.digits = 1)
It takes considerable time to run word_stats
because it is calculating syllable counts. The user may re-use the object output from one run and bass this as the text variable (text.var
) in a subsequent run with different grouping variables (grouping.vars
) as long as the text variable has not changed. The example below demonstrates how to re-use the output from one word_stats
run in another run.
♦ word_stats
Re-use ♦
with(mraja1spl, word_stats(desc_wrds, list(sex, fam.aff, died), tot = tot))
Many analyses with words involve a matrix based on the words. qdap uses a word frequency matrix (wfm
) or the less malleable dataframe version, word frequency dataframe (wfdf
). The wfm
is a count of word usages per grouping variable(s). This is a similar concept to the tm package's Term Document Matrix, though instead of documents we are interested in the grouping variable's usage of terms. wfm
is the general function that should be used, however, the wfdf
function does provide options for margin sums (row and column). Also note that the wfm_expanded
and wfm_combine
can expand or combine terms within a word frequency matrix.
♦ wfm
Examples ♦
## By a single grouping variable
with(DATA, wfm(state, person))[1:15, ]
## greg researcher sally sam teacher
## about 0 0 1 0 0
## already 1 0 0 0 0
## am 1 0 0 0 0
## are 0 0 1 0 0
## be 0 0 1 0 0
## can 0 0 1 0 0
## certain 0 0 1 0 0
## computer 0 0 0 1 0
## distrust 0 0 0 1 0
## do 0 0 0 0 1
## dumb 1 0 0 0 0
## eat 1 0 0 0 0
## fun 0 0 0 2 0
## good 0 1 0 0 0
## how 0 0 1 0 0
## By two grouping variables
with(DATA, wfm(state, list(sex, adult)))[1:15, ]
## f.0 f.1 m.0 m.1
## about 1 0 0 0
## already 0 0 1 0
## am 0 0 1 0
## are 1 0 0 0
## be 1 0 0 0
## can 1 0 0 0
## certain 1 0 0 0
## computer 0 0 1 0
## distrust 0 0 1 0
## do 0 0 0 1
## dumb 0 0 1 0
## eat 0 0 1 0
## fun 0 0 2 0
## good 0 1 0 0
## how 1 0 0 0
♦ wfm
: Keep Two Word Phrase as a Single Term ♦
## insert double tilde ("~~") to keep phrases(e. g., first last name)
space_keeps <- c(" fun", "I ")
state2 <- space_fill(DATA$state, space_keeps, rm.extra = FALSE)
with(DATA, wfm(state2, list(sex, adult)))[1:18, ]
## f.0 f.1 m.0 m.1
## about 1 0 0 0
## already 0 0 1 0
## are 1 0 0 0
## be 1 0 0 0
## can 1 0 0 0
## certain 1 0 0 0
## computer 0 0 1 0
## do 0 0 0 1
## dumb 0 0 1 0
## eat 0 0 1 0
## good 0 1 0 0
## how 1 0 0 0
## hungry 0 0 1 0
## i'm 0 0 1 0
## i am 0 0 1 0
## i distrust 0 0 1 0
## is 0 0 1 0
## is fun 0 0 1 0
At times it may be useful to view the correlation between word occurrences between turns of talk or other useful groupings. The user can utilize the output from wfm
to accomplish this.
♦ **wfm
: Word Correlations** ♦
library(reports)
x <- factor(with(rajSPLIT, paste(act, pad(TOT(tot)), sep = "|")))
dat <- wfm(rajSPLIT$dialogue, x)
cor(t(dat)[, c("romeo", "juliet")])
## romeo juliet
## romeo 1.000 0.111
## juliet 0.111 1.000
cor(t(dat)[, c("romeo", "banished")])
## romeo banished
## romeo 1.000 0.343
## banished 0.343 1.000
cor(t(dat)[, c("romeo", "juliet", "hate", "love")])
## romeo juliet hate love
## romeo 1.00000 0.110981 -0.04456 0.208612
## juliet 0.11098 1.000000 -0.03815 0.005002
## hate -0.04456 -0.038149 1.00000 0.158720
## love 0.20861 0.005002 0.15872 1.000000
dat2 <- wfm(DATA$state, id(DATA))
qheat(cor(t(dat2)), low = "yellow", high = "red",
grid = "grey90", diag.na = TRUE, by.column = NULL)
♦ wfdf
Examples: Add Margins ♦
with(DATA, wfdf(state, person, margins = TRUE))[c(1:15, 41:42), ]
## Words greg researcher sally sam teacher TOTAL.USES
## 1 about 0 0 1 0 0 1
## 2 already 1 0 0 0 0 1
## 3 am 1 0 0 0 0 1
## 4 are 0 0 1 0 0 1
## 5 be 0 0 1 0 0 1
## 6 can 0 0 1 0 0 1
## 7 certain 0 0 1 0 0 1
## 8 computer 0 0 0 1 0 1
## 9 distrust 0 0 0 1 0 1
## 10 do 0 0 0 0 1 1
## 11 dumb 1 0 0 0 0 1
## 12 eat 1 0 0 0 0 1
## 13 fun 0 0 0 2 0 2
## 14 good 0 1 0 0 0 1
## 15 how 0 0 1 0 0 1
## 41 you 1 0 1 2 0 4
## 42 TOTAL.WORDS -> 20 6 10 13 4 53
with(DATA, wfdf(state, list(sex, adult), margins = TRUE))[c(1:15, 41:42), ]
## Words f.0 f.1 m.0 m.1 TOTAL.USES
## 1 about 1 0 0 0 1
## 2 already 0 0 1 0 1
## 3 am 0 0 1 0 1
## 4 are 1 0 0 0 1
## 5 be 1 0 0 0 1
## 6 can 1 0 0 0 1
## 7 certain 1 0 0 0 1
## 8 computer 0 0 1 0 1
## 9 distrust 0 0 1 0 1
## 10 do 0 0 0 1 1
## 11 dumb 0 0 1 0 1
## 12 eat 0 0 1 0 1
## 13 fun 0 0 2 0 2
## 14 good 0 1 0 0 1
## 15 how 1 0 0 0 1
## 41 you 1 0 3 0 4
## 42 TOTAL.WORDS -> 10 6 33 4 53
♦ wfm_expanded
: Expand the wfm ♦
## Start with a word frequency matrix
z <- wfm(DATA$state, DATA$person)
## Note a single `you`
z[30:41, ]
## greg researcher sally sam teacher
## stinks 0 0 0 1 0
## talking 0 0 1 0 0
## telling 1 0 0 0 0
## the 1 0 0 0 0
## then 0 1 0 0 0
## there 1 0 0 0 0
## too 0 0 0 1 0
## truth 1 0 0 0 0
## way 1 0 0 0 0
## we 0 1 1 0 1
## what 0 0 1 0 1
## you 1 0 1 2 0
## Note that there are two `you`s in the expanded version
wfm_expanded(z)[33:45, ]
## greg researcher sally sam teacher
## stinks 0 0 0 1 0
## talking 0 0 1 0 0
## telling 1 0 0 0 0
## the 1 0 0 0 0
## then 0 1 0 0 0
## there 1 0 0 0 0
## too 0 0 0 1 0
## truth 1 0 0 0 0
## way 1 0 0 0 0
## we 0 1 1 0 1
## what 0 0 1 0 1
## you 1 0 1 1 0
## you 0 0 0 1 0
♦ wfm_combine
: Combine Terms in the wfm ♦
## Start with a word frequency matrix
x <- wfm(DATA$state, DATA$person)
## The terms to exclude
WL <- list(
random = c("the", "fun", "i"),
yous = c("you", "your", "you're")
)
## Combine the terms
(out <- wfm_combine(x, WL))
## greg researcher sally sam teacher
## random 2 0 0 3 0
## yous 1 0 1 2 0
## else.words 17 6 9 8 4
## Pass the combined version to Chi Squared Test
chisq.test(out)
##
## Pearson's Chi-squared test
##
## data: out
## X-squared = 7.661, df = 8, p-value = 0.4673
♦ wfm
: Correspondence Analysis Example ♦
library(ca)
## Grab Just the Candidates
dat <- pres_debates2012
dat <- dat[dat$person %in% qcv(ROMNEY, OBAMA), ]
## Stem the text
speech <- stemmer(dat$dialogue)
## With 25 words removed
mytable1 <- with(dat, wfm(speech, list(person, time), stopwords = Top25Words))
## CA
fit <- ca(mytable)
summary(fit)
plot(fit)
plot3d.ca(fit, labels=1)
## With 200 words removed
mytable2 <- with(dat, wfm(speech, list(person, time), stopwords = Top200Words))
## CA
fit2 <- ca(mytable2)
summary(fit2)
plot(fit2)
plot3d.ca(fit2, labels=1)
Some packages that could further the analysis of qdap expect a Document Term or Term Document Matrix. qdap's wfm
is similar to the tm package's TermDocumentMatrix and DocumentTermMatrix. qdap does not try to replicate the extensive work of thetm package, however, the as.tdm
and as.dtm
do attempt to extend the work the researcher conducts in qdap to be utilized in other R packages. For a vignette describing qdap-tm compatability use browseVignettes(package = "qdap")
or \r HR2("http://cran.r-project.org/web/packages/qdap/vignettes/tm_package_compatibility.pdf, "Click Here")
.
♦ as.tdm
Use ♦
x <- wfm(DATA$state, DATA$person)
## Term Document Matrix
as.tdm(x)
## <<TermDocumentMatrix (terms: 41, documents: 5)>>
## Non-/sparse entries: 49/156
## Sparsity : 76%
## Maximal term length: 8
## Weighting : term frequency (tf)
## Document Term Matrix
as.dtm(x)
## <<DocumentTermMatrix (documents: 5, terms: 41)>>
## Non-/sparse entries: 49/156
## Sparsity : 76%
## Maximal term length: 8
## Weighting : term frequency (tf)
The termco
family of functions are some of the most useful qdap functions for quantitative discourse analysis. termco
searches for (an optionally groups) terms and outputs a raw count, percent, and combined (raw/percent) matrix of term counts by grouping variable. The term_match
all_words
syn
, exclude
, and spaste
are complementary functions that are useful in developing word lists to provide to the match.list.
The match.list acts to search for similarly grouped themes. For example c(“ read ”, “ reads”, “ reading”, “ reader”) may be a search for words associated with reading. It is good practice to name the vectors of words that are stored in the match.list . This is the general form for how to set up a match.list:
themes <- list(
theme_1 = c(),
theme_2 = c(),
theme_n = c()
)
It is important to understand how the match.list is handled by termco
. The match.list is (optionally) case and character sensitive. Spacing is an important way to grab specific words and requires careful thought. For example using “read” will find the words “bread”, “read”, “reading”, and “ready”. If you want to search for just the word “read” supply a vector of c(“ read ”, “ reads”, “ reading”, “ reader”). Notice the leading and trailing spaces. A space acts as a boundary whereas starting/ending with a nonspace allows for greedy matching that will find words that contain this term. A leading, trailing or both may be used to control how termco
searches for the supplied terms. So the reader may ask why not supply one string spaced as “ read”? Keep in mind that termco
would also find the word “ready”
This section's examples will first view the complementary functions that augment the themes supplied to match.list and then main termco
function will be explored.
term_match
looks through a text variable (usually the text found in the transcript) and finds/returns a vector of words containing a term(s).
♦ term_match
and exclude
Examples♦
term_match(text.var = DATA$state, terms = qcv(the, trust), return.list = FALSE)
## [1] "distrust" "the" "then" "there"
term_match(DATA$state, "i", FALSE)
## [1] "certain" "distrust" "i" "i'm" "is" "it"
## [7] "it's" "liar" "stinks" "talking" "telling"
exclude(term_match(DATA$state, "i", FALSE), talking, telling)
## [1] "certain" "distrust" "i" "i'm" "is" "it"
## [7] "it's" "liar" "stinks"
all_words
is similar to term_match
, however, the function looks at all the words found in a text variable (usually the transcript text) and returns words that begin with or contain the term(s). The output can be arrange alphabetically or by frequency. The output is a dataframe which helps the researcher to make decisions with regard to frequency of word use.
♦ all_words
Examples♦
x1 <- all_words(raj$dialogue, begins.with="re")
head(x1, 10)
## WORD FREQ
## 1 re 2
## 2 reach 1
## 3 read 6
## 4 ready 5
## 5 rearward 1
## 6 reason 5
## 7 reason's 1
## 8 rebeck 1
## 9 rebellious 1
## 10 receipt 1
all_words(raj$dialogue, begins.with="q")
## WORD FREQ
## 1 qualities 1
## 2 quarrel 11
## 3 quarrelled 1
## 4 quarrelling 2
## 5 quarrels 1
## 6 quarter 1
## 7 queen 1
## 8 quench 2
## 9 question 2
## 10 quick 2
## 11 quickly 5
## 12 quiet 4
## 13 quinces 1
## 14 quit 1
## 15 quite 2
## 16 quivering 1
## 17 quivers 1
## 18 quote 1
## 19 quoth 5
all_words(raj$dialogue, contains="conc")
## WORD FREQ
## 1 conceal'd 1
## 2 conceit 2
## 3 conceive 1
## 4 concludes 1
## 5 reconcile 1
x2 <- all_words(raj$dialogue)
head(x2, 10)
## WORD FREQ
## 1 'tis 9
## 2 a 445
## 3 a' 1
## 4 abate 1
## 5 abbey 1
## 6 abed 1
## 7 abhorred 1
## 8 abhors 1
## 9 able 2
## 10 ableeding 1
x3 <- all_words(raj$dialogue, alphabetical = FALSE)
head(x3, 10)
## WORD FREQ
## 1 and 666
## 2 the 656
## 3 i 573
## 4 to 517
## 5 a 445
## 6 of 378
## 7 my 358
## 8 is 344
## 9 that 344
## 10 in 312
The synonyms
(short hand: syn
) function finds words that are synonyms of a given set of terms and returns either a list of vector that can be passed to termco
's match.list.
♦ synonyms
Examples♦
synonyms(c("the", "cat", "job", "environment", "read", "teach"))
## $cat.def_1
## [1] "feline" "gib" "grimalkin" "kitty" "malkin"
##
## $cat.def_2
## [1] "moggy"
##
## $cat.def_3
## [1] "mouser" "puss"
##
## $cat.def_4
## [1] "pussy"
##
## $cat.def_5
## [1] "tabby"
##
## $job.def_1
## [1] "affair" "assignment" "charge" "chore"
## [5] "concern" "contribution" "duty" "enterprise"
## [9] "errand" "function" "pursuit" "responsibility"
## [13] "role" "stint" "task" "undertaking"
## [17] "venture" "work"
##
## $job.def_2
## [1] "business" "calling" "capacity" "career" "craft"
## [6] "employment" "function" "livelihood" "metier" "occupation"
## [11] "office" "position" "post" "profession" "situation"
## [16] "trade" "vocation"
##
## $job.def_3
## [1] "allotment" "assignment" "batch" "commission" "consignment"
## [6] "contract" "lot" "output" "piece" "portion"
## [11] "product" "share"
##
## $environment.def_1
## [1] "atmosphere" "background" "conditions" "context"
## [5] "domain" "element" "habitat" "locale"
## [9] "medium" "milieu" "scene" "setting"
## [13] "situation" "surroundings" "territory"
##
## $environment.def_2
## [1] "The environment is the natural world of land"
## [2] "sea"
## [3] "air"
## [4] "plants"
## [5] "and animals."
##
## $read.def_1
## [1] "glance at" "look at" "peruse"
## [4] "pore over" "refer to" "run one's eye over"
## [7] "scan" "study"
##
## $read.def_2
## [1] "announce" "declaim" "deliver" "recite" "speak" "utter"
##
## $read.def_3
## [1] "comprehend" "construe"
## [3] "decipher" "discover"
## [5] "interpret" "perceive the meaning of"
## [7] "see" "understand"
##
## $read.def_4
## [1] "display" "indicate" "record" "register" "show"
##
## $teach.def_1
## [1] "advise" "coach" "demonstrate"
## [4] "direct" "discipline" "drill"
## [7] "edify" "educate" "enlighten"
## [10] "give lessons in" "guide" "impart"
## [13] "implant" "inculcate" "inform"
## [16] "instil" "instruct" "school"
## [19] "show" "train" "tutor"
head(syn(c("the", "cat", "job", "environment", "read", "teach"),
return.list = FALSE), 30)
## [1] "feline" "gib" "grimalkin" "kitty"
## [5] "malkin" "moggy" "mouser" "puss"
## [9] "pussy" "tabby" "affair" "assignment"
## [13] "charge" "chore" "concern" "contribution"
## [17] "duty" "enterprise" "errand" "function"
## [21] "pursuit" "responsibility" "role" "stint"
## [25] "task" "undertaking" "venture" "work"
## [29] "business" "calling"
syn(c("the", "cat", "job", "environment", "read", "teach"), multiwords = FALSE)
## $cat.def_1
## [1] "feline" "gib" "grimalkin" "kitty" "malkin"
##
## $cat.def_2
## [1] "moggy"
##
## $cat.def_3
## [1] "mouser" "puss"
##
## $cat.def_4
## [1] "pussy"
##
## $cat.def_5
## [1] "tabby"
##
## $job.def_1
## [1] "affair" "assignment" "charge" "chore"
## [5] "concern" "contribution" "duty" "enterprise"
## [9] "errand" "function" "pursuit" "responsibility"
## [13] "role" "stint" "task" "undertaking"
## [17] "venture" "work"
##
## $job.def_2
## [1] "business" "calling" "capacity" "career" "craft"
## [6] "employment" "function" "livelihood" "metier" "occupation"
## [11] "office" "position" "post" "profession" "situation"
## [16] "trade" "vocation"
##
## $job.def_3
## [1] "allotment" "assignment" "batch" "commission" "consignment"
## [6] "contract" "lot" "output" "piece" "portion"
## [11] "product" "share"
##
## $environment.def_1
## [1] "atmosphere" "background" "conditions" "context"
## [5] "domain" "element" "habitat" "locale"
## [9] "medium" "milieu" "scene" "setting"
## [13] "situation" "surroundings" "territory"
##
## $environment.def_2
## [1] "sea" "air" "plants"
##
## $read.def_1
## [1] "peruse" "scan" "study"
##
## $read.def_2
## [1] "announce" "declaim" "deliver" "recite" "speak" "utter"
##
## $read.def_3
## [1] "comprehend" "construe" "decipher" "discover" "interpret"
## [6] "see" "understand"
##
## $read.def_4
## [1] "display" "indicate" "record" "register" "show"
##
## $teach.def_1
## [1] "advise" "coach" "demonstrate" "direct" "discipline"
## [6] "drill" "edify" "educate" "enlighten" "guide"
## [11] "impart" "implant" "inculcate" "inform" "instil"
## [16] "instruct" "school" "show" "train" "tutor"
♦ termco
- Simple Example♦
## Make a small dialogue data set
(dat2 <- data.frame(dialogue=c("@bryan is bryan good @br",
"indeed", "@ brian"), person=qcv(A, B, A)))
## dialogue person
## 1 @bryan is bryan good @br A
## 2 indeed B
## 3 @ brian A
## The word list to search for
ml <- list(
wrds=c("bryan", "indeed"),
"@",
bryan=c("bryan", "@ br", "@br")
)
## Search by person
with(dat2, termco(dialogue, person, match.list=ml))
## person word.count wrds @ bryan
## 1 A 6 2(33.33%) 3(50.00%) 5(83.33%)
## 2 B 1 1(100.00%) 0 0
## Search by person proportion output
with(dat2, termco(dialogue, person, match.list=ml, percent = FALSE))
## person word.count wrds @ bryan
## 1 A 6 2(.33) 3(.50) 5(.83)
## 2 B 1 1(1.00) 0 0
♦ termco
- Romeo and Juliet Act 1 Example♦
## Word list to search for
## Note: In the last vector using "the" will actually
## include the other 3 versions
ml2 <- list(
theme_1 = c(" the ", " a ", " an "),
theme_2 = c(" I'" ),
"good",
the_words = c("the", " the ", " the", "the ")
)
(out <- with(raj.act.1, termco(dialogue, person, ml2)))
## person word.count theme_1 theme_2 good the_words
## 1 Abraham 24 0 0 0 0
## 2 Benvolio 621 32(5.15%) 2(.32%) 2(.32%) 123(19.81%)
## 3 Capulet 736 39(5.30%) 3(.41%) 3(.41%) 93(12.64%)
## 4 First Citizen 16 2(12.50%) 0 0 10(62.50%)
## 5 First Servant 69 8(11.59%) 0 1(1.45%) 20(28.99%)
## 6 Gregory 149 9(6.04%) 0 0 48(32.21%)
## 7 Juliet 206 5(2.43%) 1(.49%) 1(.49%) 20(9.71%)
## 8 Lady Capulet 286 20(6.99%) 0 0 63(22.03%)
## 9 Lady Montague 28 2(7.14%) 0 0 0
## 10 Mercutio 552 49(8.88%) 0 2(.36%) 146(26.45%)
## 11 Montague 217 12(5.53%) 0 1(.46%) 41(18.89%)
## 12 Nurse 598 44(7.36%) 1(.17%) 2(.33%) 103(17.22%)
## 13 Paris 32 0 0 0 1(3.12%)
## 14 Prince 167 8(4.79%) 0 0 35(20.96%)
## 15 Romeo 1164 56(4.81%) 3(.26%) 3(.26%) 142(12.20%)
## 16 Sampson 259 19(7.34%) 0 1(.39%) 70(27.03%)
## 17 Second Capulet 17 0 0 0 0
## 18 Second Servant 41 2(4.88%) 0 1(2.44%) 8(19.51%)
## 19 Servant 183 12(6.56%) 1(.55%) 1(.55%) 46(25.14%)
## 20 Tybalt 160 11(6.88%) 1(.62%) 0 24(15.00%)
## Available elements in the termco output (use dat$...)
names(out)
## [1] "raw" "prop" "rnp" "zero.replace"
## [5] "percent" "digits"
## Raw and proportion - useful for presenting in tables
out$rnp
## person word.count theme_1 theme_2 good the_words
## 1 Abraham 24 0 0 0 0
## 2 Benvolio 621 32(5.15%) 2(.32%) 2(.32%) 123(19.81%)
## 3 Capulet 736 39(5.30%) 3(.41%) 3(.41%) 93(12.64%)
## 4 First Citizen 16 2(12.50%) 0 0 10(62.50%)
## 5 First Servant 69 8(11.59%) 0 1(1.45%) 20(28.99%)
## 6 Gregory 149 9(6.04%) 0 0 48(32.21%)
## 7 Juliet 206 5(2.43%) 1(.49%) 1(.49%) 20(9.71%)
## 8 Lady Capulet 286 20(6.99%) 0 0 63(22.03%)
## 9 Lady Montague 28 2(7.14%) 0 0 0
## 10 Mercutio 552 49(8.88%) 0 2(.36%) 146(26.45%)
## 11 Montague 217 12(5.53%) 0 1(.46%) 41(18.89%)
## 12 Nurse 598 44(7.36%) 1(.17%) 2(.33%) 103(17.22%)
## 13 Paris 32 0 0 0 1(3.12%)
## 14 Prince 167 8(4.79%) 0 0 35(20.96%)
## 15 Romeo 1164 56(4.81%) 3(.26%) 3(.26%) 142(12.20%)
## 16 Sampson 259 19(7.34%) 0 1(.39%) 70(27.03%)
## 17 Second Capulet 17 0 0 0 0
## 18 Second Servant 41 2(4.88%) 0 1(2.44%) 8(19.51%)
## 19 Servant 183 12(6.56%) 1(.55%) 1(.55%) 46(25.14%)
## 20 Tybalt 160 11(6.88%) 1(.62%) 0 24(15.00%)
## Raw - useful for performing calculations
out$raw
## person word.count theme_1 theme_2 good the_words
## 1 Abraham 24 0 0 0 0
## 2 Benvolio 621 32 2 2 123
## 3 Capulet 736 39 3 3 93
## 4 First Citizen 16 2 0 0 10
## 5 First Servant 69 8 0 1 20
## 6 Gregory 149 9 0 0 48
## 7 Juliet 206 5 1 1 20
## 8 Lady Capulet 286 20 0 0 63
## 9 Lady Montague 28 2 0 0 0
## 10 Mercutio 552 49 0 2 146
## 11 Montague 217 12 0 1 41
## 12 Nurse 598 44 1 2 103
## 13 Paris 32 0 0 0 1
## 14 Prince 167 8 0 0 35
## 15 Romeo 1164 56 3 3 142
## 16 Sampson 259 19 0 1 70
## 17 Second Capulet 17 0 0 0 0
## 18 Second Servant 41 2 0 1 8
## 19 Servant 183 12 1 1 46
## 20 Tybalt 160 11 1 0 24
## Proportion - useful for performing calculations
out$prop
## person word.count theme_1 theme_2 good the_words
## 1 Abraham 24 0.000 0.0000 0.0000 0.000
## 2 Benvolio 621 5.153 0.3221 0.3221 19.807
## 3 Capulet 736 5.299 0.4076 0.4076 12.636
## 4 First Citizen 16 12.500 0.0000 0.0000 62.500
## 5 First Servant 69 11.594 0.0000 1.4493 28.986
## 6 Gregory 149 6.040 0.0000 0.0000 32.215
## 7 Juliet 206 2.427 0.4854 0.4854 9.709
## 8 Lady Capulet 286 6.993 0.0000 0.0000 22.028
## 9 Lady Montague 28 7.143 0.0000 0.0000 0.000
## 10 Mercutio 552 8.877 0.0000 0.3623 26.449
## 11 Montague 217 5.530 0.0000 0.4608 18.894
## 12 Nurse 598 7.358 0.1672 0.3344 17.224
## 13 Paris 32 0.000 0.0000 0.0000 3.125
## 14 Prince 167 4.790 0.0000 0.0000 20.958
## 15 Romeo 1164 4.811 0.2577 0.2577 12.199
## 16 Sampson 259 7.336 0.0000 0.3861 27.027
## 17 Second Capulet 17 0.000 0.0000 0.0000 0.000
## 18 Second Servant 41 4.878 0.0000 2.4390 19.512
## 19 Servant 183 6.557 0.5464 0.5464 25.137
## 20 Tybalt 160 6.875 0.6250 0.0000 15.000
♦ Using termco
with term_match
and exclude
♦
## Example 1
termco(DATA$state, DATA$person, exclude(term_match(DATA$state, qcv(th),
FALSE), "truth"))
## person word.count the then there
## 1 greg 20 2(10.00%) 0 1(5.00%)
## 2 researcher 6 1(16.67%) 1(16.67%) 0
## 3 sally 10 0 0 0
## 4 sam 13 0 0 0
## 5 teacher 4 0 0 0
## Example 2
MTCH.LST <- exclude(term_match(DATA$state, qcv(th, i)), qcv(truth, stinks))
termco(DATA$state, DATA$person, MTCH.LST)
## person word.count th i
## 1 greg 20 3(15.00%) 13(65.00%)
## 2 researcher 6 2(33.33%) 0
## 3 sally 10 0 4(40.00%)
## 4 sam 13 0 11(84.62%)
## 5 teacher 4 0 0
syns <- synonyms("doubt")
syns[1]
## $doubt.def_1
## [1] "discredit" "distrust" "fear"
## [4] "lack confidence in" "misgive" "mistrust"
## [7] "query" "question" "suspect"
termco(DATA$state, DATA$person, unlist(syns[1]))
## person word.count discredit distrust fear query question
## 1 greg 20 0 0 0 0 0
## 2 researcher 6 0 0 0 0 0
## 3 sally 10 0 0 0 0 0
## 4 sam 13 0 1(7.69%) 0 0 0
## 5 teacher 4 0 0 0 0 0
synonyms("doubt", FALSE)
## [1] "discredit" "distrust" "fear"
## [4] "lack confidence in" "misgive" "mistrust"
## [7] "query" "question" "suspect"
## [10] "apprehension" "disquiet" "incredulity"
## [13] "lack of faith" "misgiving" "qualm"
## [16] "scepticism" "suspicion" "be dubious"
## [19] "be uncertain" "demur" "fluctuate"
## [22] "hesitate" "scruple" "vacillate"
## [25] "waver" "dubiety" "hesitancy"
## [28] "hesitation" "indecision" "irresolution"
## [31] "lack of conviction" "suspense" "uncertainty"
## [34] "vacillation" "confusion" "difficulty"
## [37] "dilemma" "perplexity" "problem"
## [40] "quandary" "admittedly" "assuredly"
## [43] "certainly" "doubtless" "doubtlessly"
## [46] "probably" "surely"
termco(DATA$state, DATA$person, list(doubt = synonyms("doubt", FALSE)))
## person word.count doubt
## 1 greg 20 0
## 2 researcher 6 0
## 3 sally 10 0
## 4 sam 13 1(7.69%)
## 5 teacher 4 0
termco(DATA$state, DATA$person, syns)
## person word.count doubt.def_1 doubt.def_2 doubt.def_5 doubt.def_6
## 1 greg 20 0 0 0 0
## 2 researcher 6 0 0 0 0
## 3 sally 10 0 0 0 0
## 4 sam 13 1(7.69%) 1(7.69%) 0 0
## 5 teacher 4 0 0 0 0
termco
also has a plot method that plots a heat map of the termco
output based on the percent usage by grouping variable. This allows for rapid visualizations of patterns and enables fast spotting of extreme values. Here are some plots from the Romeo and Juliet Act 1 Example above.
♦ Using termco
Plotting♦
plot(out)
plot(out, label = TRUE)
A researcher may be interested in classifying and investigating the types of questions used within dialogue.
question_type
provides question classification. The algorithm searches for the following interrogative words (and optionally, their negative contraction form as well):
are* had* must* what why
can* has ok when will*
correct have* right where would*
could how shall which implied do/does/did
did* is should who
do* may was* whom
does* might* were* whose
The interrogative word that is found first (with the exception of “ok”, “right” and “correct”) in the question determines the sentence type. “ok”, “right” and “correct” sentence types are determined if the sentence is a question with no other interrogative words found and “ok”, “right” or “correct” is the last word of the sentence. Those interrogative sentences beginning with the word “you”, “wanna”, or “want” are categorized as implying do/does/did question type, though the use of do/does is not explicit. Those sentence beginning with “you” followed by a select interrogative word (and or their negative counter parts) above (marked with *) or 1-2 amplifier(s) followed by the select interrogative word are categorized by the select word rather than an implied do/does/did question type. A sentence that is marked “ok” over rides an implied do/does/did label. Those with undetermined sentence type are labeled unknown.
♦ question_type
- Basic Example♦
## Basic Example
(x <- question_type(DATA.SPLIT$state, DATA.SPLIT$person))
## person tot.quest what how shall implied_do/does/did
## 1 greg 1 0 0 0 1(100%)
## 2 researcher 1 0 0 1(100%) 0
## 3 sally 2 1(50%) 1(50%) 0 0
## 4 teacher 1 1(100%) 0 0 0
## 5 sam 0 0 0 0 0
## Available elements from output
names(x)
## [1] "raw" "count" "prop" "rnp"
## [5] "inds" "missing" "percent" "zero.replace"
## [9] "digits"
## Table of counts useful for additional analysis
x$count
## person tot.quest what how shall implied_do/does/did
## 1 greg 1 0 0 0 1
## 2 researcher 1 0 0 1 0
## 3 sally 2 1 1 0 0
## 4 teacher 1 1 0 0 0
## 5 sam 0 0 0 0 0
## The raw output with question types
truncdf(x$raw, 15)
## person raw.text n.row endmark strip.text q.type
## 1 teacher What should we 4 ? what should we what
## 2 sally How can we be c 7 ? how can we be how
## 3 sally What are you ta 10 ? what are you t what
## 4 researcher Shall we move o 11 ? shall we move shall
## 5 greg You already? 15 ? you already implied_do/does
question_type
also has a plot method that plots a heat map of the output. This allows for rapid visualizations of patterns and enables fast spotting of extreme values.
♦ question_type
- Plotting Method♦
plot(x)
plot(x, label = TRUE, high = "red", low = "yellow", grid = NULL)
Negative forms of questions such as Don't you want the robots to leave? are, by default, grouped with their equivalent positive Do forms, such as Do you want the robots to leave?. The researcher may choose to keep the two forms separate using the argument neg.cont = TRUE
♦ question_type
- Include Negative Questions♦
## Create a Dataframe with Do and Don't
(DATA.SPLIT2 <- rbind(DATA.SPLIT,
c("sam", "1.1", "1", "m", "0", "K1", "Don't you think so?", "x"),
c("sam", "1.1", "1", "m", "0", "K1", "Do you think so?", "x")
))[, c(1, 7)]
## person state
## 1 sam Computer is fun.
## 2 sam Not too fun.
## 3 greg No it's not, it's dumb.
## 4 teacher What should we do?
## 5 sam You liar, it stinks!
## 6 greg I am telling the truth!
## 7 sally How can we be certain?
## 8 greg There is no way.
## 9 sam I distrust you.
## 10 sally What are you talking about?
## 11 researcher Shall we move on?
## 12 researcher Good then.
## 13 greg I'm hungry.
## 14 greg Let's eat.
## 15 greg You already?
## 16 sam Don't you think so?
## 17 sam Do you think so?
## Do and Don't Grouped Together
question_type(DATA.SPLIT2$state, DATA.SPLIT2$person)
## person tot.quest what do how shall implied_do/does/did
## 1 greg 1 0 0 0 0 1(100%)
## 2 researcher 1 0 0 0 1(100%) 0
## 3 sally 2 1(50%) 0 1(50%) 0 0
## 4 sam 2 0 2(100%) 0 0 0
## 5 teacher 1 1(100%) 0 0 0 0
## Do and Don't Grouped Separately
question_type(DATA.SPLIT2$state, DATA.SPLIT2$person, neg.cont = TRUE)
person tot.quest what don't do how shall implied_do/does/did
1 greg 1 0 0 0 0 0 1(100%)
2 researcher 1 0 0 0 0 1(100%) 0
3 sally 2 1(50%) 0 0 1(50%) 0 0
4 sam 2 0 1(50%) 1(50%) 0 0 0
5 teacher 1 1(100%) 0 0 0 0 0
It may be helpful to access the indices of the question types in the x[[“inds”]] output or access x[[“raw”]][, “n.row”] for use with the trans_context
function as seen below.
♦ question_type
- Passing to trans_context
♦
## The indices of all questions
x <- question_type(DATA.SPLIT$state, DATA.SPLIT$person)
(inds1 <- x[["inds"]])
## [1] 4 7 10 11 15
with(DATA.SPLIT, trans_context(state, person, inds = inds1, n.before = 2))
===================================
Event 1: [lines 2-6]
sam: Computer is fun. Not too fun.
greg: No it's not, it's dumb.
** teacher: What should we do?
sam: You liar, it stinks!
greg: I am telling the truth!
===================================
Event 2: [lines 5-9]
sam: You liar, it stinks!
greg: I am telling the truth!
** sally: How can we be certain?
greg: There is no way.
sam: I distrust you.
===================================
Event 3: [lines 8-12]
greg: There is no way.
sam: I distrust you.
** sally: What are you talking about?
researcher: Shall we move on? Good then.
greg: I'm hungry. Let's eat. You already?
===================================
Event 4: [lines 9-13]
sam: I distrust you.
sally: What are you talking about?
** researcher: Shall we move on? Good then.
greg: I'm hungry. Let's eat. You already?
===================================
Event 5: [lines 13-15]
sally: What are you talking about?
researcher: Shall we move on? Good then.
** greg: I'm hungry. Let's eat. You already?
## Find what and how questions
inds2 <- x[["raw"]][x[["raw"]]$q.type %in% c("what", "how"), "n.row"]
with(DATA.SPLIT, trans_context(state, person, inds = inds2, n.before = 2))
===================================
Event 1: [lines 2-6]
sam: Computer is fun. Not too fun.
greg: No it's not, it's dumb.
** teacher: What should we do?
sam: You liar, it stinks!
greg: I am telling the truth!
===================================
Event 2: [lines 5-9]
sam: You liar, it stinks!
greg: I am telling the truth!
** sally: How can we be certain?
greg: There is no way.
sam: I distrust you.
===================================
Event 3: [lines 8-12]
greg: There is no way.
sam: I distrust you.
** sally: What are you talking about?
researcher: Shall we move on? Good then.
greg: I'm hungry. Let's eat. You already?
A research may have the need to view simple word or character counts for the sake of comparisons between grouping variables. word_count
(wc
), word_list
, character_count
, character_table
(char_table
) serve the purposes of counting words and characters with word_list
producing a lists of words usage by grouping variable and character_table
producing a count table of characters. The following examples demonstrate the uses of these functions.
♦ word_count
Examples♦
word_count(DATA$state)
## [1] 6 5 4 4 5 5 4 3 5 6 6
## `wc a shortened version of `word_count`
wc(DATA$state)
## [1] 6 5 4 4 5 5 4 3 5 6 6
## Retain the text
wc(DATA$state, names = TRUE)
## Computer is fun. Not too fun.
## 6
## No it's not, it's dumb.
## 5
## What should we do?
## 4
## You liar, it stinks!
## 4
## I am telling the truth!
## 5
## How can we be certain?
## 5
## There is no way.
## 4
## I distrust you.
## 3
## What are you talking about?
## 5
## Shall we move on? Good then.
## 6
## I'm hungry. Let's eat. You already?
## 6
## Setting `byrow=FALSE` gives a total for the text variable
word_count(DATA$state, byrow=FALSE, names = TRUE)
## [1] 53
## identical to `byrow=FALSE` above
sum(word_count(DATA$state))
## [1] 53
## By grouping variable
tapply(DATA$state, DATA$person, wc, byrow=FALSE)
## greg researcher sally sam teacher
## 20 6 10 13 4
♦ word_count
Plotting Centered Word Counts♦
## Scale variable
raj2 <- raj
raj2$scaled <- unlist(tapply(wc(raj$dialogue), raj2$act, scale))
raj2$scaled2 <- unlist(tapply(wc(raj$dialogue), raj2$act, scale, scale = FALSE))
raj2$ID <- factor(unlist(tapply(raj2$act, raj2$act, seq_along)))
## Plot with ggplot2
library(ggplot2); library(grid)
ggplot(raj2, aes(x = ID, y = scaled, fill =person)) +
geom_bar(stat="identity", position="identity") +
facet_grid(act~.) +
ylab("Standard Deviations") + xlab("Turns of Talk") +
guides(fill = guide_legend(nrow = 5, byrow = TRUE)) +
theme(legend.position="bottom", legend.key.size = unit(.35, "cm"),
axis.text.x = element_blank(), axis.ticks.x = element_blank()) +
ggtitle("Standardized Word Counts\nPer Turn of Talk")
♦ character_count
Examples♦
character_count(DATA$state)
## [1] 22 17 14 15 18 17 12 12 22 21 27
## Setting `byrow=FALSE` gives a total for the text variable
character_count(DATA$state, byrow=FALSE)
## [1] 197
## identical to `byrow=FALSE` above
sum(character_count(DATA$state))
## [1] 197
## By grouping variable
tapply(DATA$state, DATA$person, character_count, byrow=FALSE)
## greg researcher sally sam teacher
## 74 21 39 49 14
♦ character_table
Example♦
x <- character_table(DATA$state, DATA$person)
names(x)
## [1] "raw" "prop" "rnp"
counts(x)
## person ' ! , . ? a b c d e f g h i k l m n o p r s t u v w y
## 1 greg 4 16 1 1 7 1 5 1 0 2 7 0 2 4 6 0 4 3 5 4 0 4 4 10 4 0 1 4
## 2 researcher 0 5 0 0 1 1 1 0 0 1 3 0 1 2 0 0 2 1 2 4 0 0 1 1 0 1 1 0
## 3 sally 0 8 0 0 1 2 6 2 2 0 4 0 1 2 2 1 1 0 3 3 0 2 0 4 2 0 3 1
## 4 sam 0 10 1 1 5 0 1 0 1 1 1 2 0 0 6 1 1 1 4 6 1 3 5 7 6 0 0 2
## 5 teacher 0 3 0 0 0 1 1 0 0 2 1 0 0 2 0 0 1 0 0 2 0 0 1 1 1 0 2 0
proportions(x)[, 1:10]
## person ' ! , . ? a b c
## 1 greg 4 16.00 1.000 1.000 7.000 1.000 5.000 1 0.000
## 2 researcher 0 17.86 0.000 0.000 3.571 3.571 3.571 0 0.000
## 3 sally 0 16.00 0.000 0.000 2.000 4.000 12.000 4 4.000
## 4 sam 0 15.15 1.515 1.515 7.576 0.000 1.515 0 1.515
## 5 teacher 0 16.67 0.000 0.000 0.000 5.556 5.556 0 0.000
scores(x)[, 1:7]
## person ' ! , . ?
## 1 greg 4(4%) 16(16.00%) 1(1.00%) 1(1.00%) 7(7.00%) 1(1.00%)
## 2 researcher 0 5(17.86%) 0 0 1(3.57%) 1(3.57%)
## 3 sally 0 8(16.00%) 0 0 1(2.00%) 2(4.00%)
## 4 sam 0 10(15.15%) 1(1.52%) 1(1.52%) 5(7.58%) 0
## 5 teacher 0 3(16.67%) 0 0 0 1(5.56%)
## Combine Columns
vowels <- c("a", "e", "i", "o", "u")
cons <- letters[!letters %in% c(vowels, qcv(j, q, x, z))]
colcomb2class(x, list(vowels = vowels, consonants = cons, other = 2:7))
## person vowels consonants other
## 1 greg 26(.26%) 44(.44%) 30(.30%)
## 2 researcher 8(.29%) 13(.46%) 7(.25%)
## 3 sally 17(.34%) 22(.44%) 11(.22%)
## 4 sam 20(.30%) 29(.44%) 17(.26%)
## 5 teacher 5(.28%) 9(.50%) 4(.22%)
♦ character_table
Plot Method♦
plot(x)
plot(x, label = TRUE, high = "red", lab.digits = 1, zero.replace = "")
♦ character_table
Additional Plotting♦
library(ggplot2);library(reshape2)
dat <- char_table(DATA$state, list(DATA$sex, DATA$adult))
dat2 <- colsplit2df(melt(dat$raw), keep.orig = TRUE)
dat2$adult2 <- lookup(as.numeric(as.character(dat2$adult)),
c(0, 1), c("child", "adult"))
head(dat2, 15)
## sex&adult sex adult variable value adult2
## 1 f.0 f 0 ' 0 child
## 2 f.1 f 1 ' 0 adult
## 3 m.0 m 0 ' 4 child
## 4 m.1 m 1 ' 0 adult
## 5 f.0 f 0 8 child
## 6 f.1 f 1 5 adult
## 7 m.0 m 0 26 child
## 8 m.1 m 1 3 adult
## 9 f.0 f 0 ! 0 child
## 10 f.1 f 1 ! 0 adult
## 11 m.0 m 0 ! 2 child
## 12 m.1 m 1 ! 0 adult
## 13 f.0 f 0 , 0 child
## 14 f.1 f 1 , 0 adult
## 15 m.0 m 0 , 2 child
ggplot(data = dat2, aes(y = variable, x = value, colour=sex)) +
facet_grid(adult2~.) +
geom_line(size=1, aes(group =variable), colour = "black") +
geom_point()
ggplot(data = dat2, aes(x = variable, y = value)) +
geom_bar(aes(fill = variable), stat = "identity") +
facet_grid(sex ~ adult2, margins = TRUE) +
theme(legend.position="none")
It is helpful to view the frequency distributions for a vector, matrix or dataframe. The dist_tab
function allows the researcher to quickly generate frequency distributions.
♦ dist_tab
Examples♦
dist_tab(rnorm(10000), 10)
## interval freq cum.freq percent cum.percent
## 1 (-3.98,-3.17] 7 7 0.07 0.07
## 2 (-3.17,-2.37] 71 78 0.71 0.78
## 3 (-2.37,-1.58] 526 604 5.26 6.04
## 4 (-1.58,-0.778] 1612 2216 16.12 22.16
## 5 (-0.778,0.0204] 2864 5080 28.64 50.80
## 6 (0.0204,0.818] 2923 8003 29.23 80.03
## 7 (0.818,1.62] 1499 9502 14.99 95.02
## 8 (1.62,2.41] 420 9922 4.20 99.22
## 9 (2.41,3.21] 69 9991 0.69 99.91
## 10 (3.21,4.02] 9 10000 0.09 100.00
dist_tab(sample(c("red", "blue", "gray"), 100, T), right = FALSE)
## interval freq cum.freq percent cum.percent
## 1 blue 34 34 34 34
## 2 gray 31 65 31 65
## 3 red 35 100 35 100
dist_tab(CO2, 4)
## $Plant
## interval freq cum.freq percent cum.percent
## 1 Qn1 7 7 8.33 8.33
## 2 Qn2 7 14 8.33 16.67
## 3 Qn3 7 21 8.33 25.00
## 4 Qc1 7 28 8.33 33.33
## 5 Qc3 7 35 8.33 41.67
## 6 Qc2 7 42 8.33 50.00
## 7 Mn3 7 49 8.33 58.33
## 8 Mn2 7 56 8.33 66.67
## 9 Mn1 7 63 8.33 75.00
## 10 Mc2 7 70 8.33 83.33
## 11 Mc3 7 77 8.33 91.67
## 12 Mc1 7 84 8.33 100.00
##
## $Type
## interval freq cum.freq percent cum.percent
## 1 Quebec 42 42 50 50
## 2 Mississippi 42 84 50 100
##
## $Treatment
## interval freq cum.freq percent cum.percent
## 1 nonchilled 42 42 50 50
## 2 chilled 42 84 50 100
##
## $conc
## interval freq cum.freq percent cum.percent
## 1 (94.1,321] 36 36 42.86 42.86
## 2 (321,548] 24 60 28.57 71.43
## 3 (548,774] 12 72 14.29 85.71
## 4 (774,1e+03] 12 84 14.29 100.00
##
## $uptake
## interval freq cum.freq percent cum.percent
## 1 (7.66,17.1] 19 19 22.62 22.62
## 2 (17.1,26.6] 18 37 21.43 44.05
## 3 (26.6,36] 25 62 29.76 73.81
## 4 (36,45.5] 22 84 26.19 100.00
wdst <- with(mraja1spl, word_stats(dialogue, list(sex, fam.aff, died)))
dist_tab(wdst$gts[1:4], 5)
$`sex&fam.aff&died`
interval Freq cum.Freq percent cum.percent
1 f.cap.FALSE 1 1 9.09 9.09
2 f.cap.TRUE 1 2 9.09 18.18
3 f.mont.TRUE 1 3 9.09 27.27
4 m.cap.FALSE 1 4 9.09 36.36
5 m.cap.TRUE 1 5 9.09 45.45
6 m.escal.FALSE 1 6 9.09 54.55
7 m.escal.TRUE 1 7 9.09 63.64
8 m.mont.FALSE 1 8 9.09 72.73
9 m.mont.TRUE 1 9 9.09 81.82
10 m.none.FALSE 1 10 9.09 90.91
11 none.none.FALSE 1 11 9.09 100.00
$n.sent
interval Freq cum.Freq percent cum.percent
1 (3.85,34.7] 7 7 63.64 63.64
2 (34.7,65.6] 0 7 0.00 63.64
3 (65.6,96.4] 2 9 18.18 81.82
4 (96.4,127] 1 10 9.09 90.91
5 (127,158] 1 11 9.09 100.00
$n.words
interval Freq cum.Freq percent cum.percent
1 (14.4,336] 6 6 54.55 54.55
2 (336,658] 2 8 18.18 72.73
3 (658,981] 1 9 9.09 81.82
4 (981,1.3e+03] 1 10 9.09 90.91
5 (1.3e+03,1.62e+03] 1 11 9.09 100.00
$n.char
interval Freq cum.Freq percent cum.percent
1 (72.7,1.34e+03] 6 6 54.55 54.55
2 (1.34e+03,2.6e+03] 2 8 18.18 72.73
3 (2.6e+03,3.86e+03] 1 9 9.09 81.82
4 (3.86e+03,5.12e+03] 1 10 9.09 90.91
5 (5.12e+03,6.39e+03] 1 11 9.09 100.00
In some analysis of text the research may wish to gather information about parts of speech (POS). The function pos
and it's grouping variable counterpart, pos_by
, can provide this functionality. The pos
functions are wrappers for POS related functions from the openNLP package. The pos_tags
function provides a quick reference to what the POS tags utilized by openNLP mean. For more information on the POS tags see the Penn Treebank Project.
The following examples utilize the pos_by
function as the pos
function is used identically, except without specifying a grouping.var
. It is important to realize that POS tagging is a very slow process. The speed can be increased by using the parallel = TRUE argument. Additionally, the user can recycle the output from one run of pos
, pos_by
or formality
and use it interchangeably between the pos_by
and formality
functions. This reuses the POS tagging which is the time intensive part (and can be extracted via YOUR_OUTPUT_HERE[[“POStagged”]] from any of the above objects).
♦ pos_tags
- Interpreting POS Tags♦
pos_tags()
## Tag Description
## 1 CC Coordinating conjunction
## 2 CD Cardinal number
## 3 DT Determiner
## 4 EX Existential there
## 5 FW Foreign word
## 6 IN Preposition or subordinating conjunction
## 7 JJ Adjective
## 8 JJR Adjective, comparative
## 9 JJS Adjective, superlative
## 10 LS List item marker
## 11 MD Modal
## 12 NN Noun, singular or mass
## 13 NNS Noun, plural
## 14 NNP Proper noun, singular
## 15 NNPS Proper noun, plural
## 16 PDT Predeterminer
## 17 POS Possessive ending
## 18 PRP Personal pronoun
## 19 PRP$ Possessive pronoun
## 20 RB Adverb
## 21 RBR Adverb, comparative
## 22 RBS Adverb, superlative
## 23 RP Particle
## 24 SYM Symbol
## 25 TO to
## 26 UH Interjection
## 27 VB Verb, base form
## 28 VBD Verb, past tense
## 29 VBG Verb, gerund or present participle
## 30 VBN Verb, past participle
## 31 VBP Verb, non-3rd person singular present
## 32 VBZ Verb, 3rd person singular present
## 33 WDT Wh-determiner
## 34 WP Wh-pronoun
## 35 WP$ Possessive wh-pronoun
## 36 WRB Wh-adverb
♦ pos_by
- POS by Group(s)♦
posbydat <- with(DATA, pos_by(state, list(adult, sex)))
## Available elements
names(posbydat)
[1] "text" "POStagged" "POSprop" "POSfreq"
[5] "POSrnp" "percent" "zero.replace" "pos.by.freq"
[9] "pos.by.prop" "pos.by.rnp"
## Inspecting the truncated output
lview(posbydat)
$text
[1] "computer is fun not too fun" "no its not its dumb"
[3] "what should we do" "you liar it stinks"
[5] "i am telling the truth" "how can we be certain"
[7] "there is no way" "i distrust you"
[9] "what are you talking about" "shall we move on good then"
[11] "im hungry lets eat you already"
$POStagged
POStagged POStags word.count
1 computer/NN is/VBZ fun/NN ...RB fun/NN NN, VBZ, NN, RB, RB, NN 6
2 no/DT its/PRP$ not/RB its/PRP$ dumb/JJ DT, PRP$, RB, PRP$, JJ 5
3 what/WP should/MD we/PRP do/VB WP, MD, PRP, VB 4
4 you/PRP liar/VBP it/PRP stinks/VB PRP, VBP, PRP, VB 4
5 i/PRP am/VBP telling/VBG...DT truth/NN PRP, VBP, VBG, DT, NN 5
6 how/WRB can/MD we/PRP ...VB certain/JJ WRB, MD, PRP, VB, JJ 5
7 there/EX is/VBZ no/DT way/NN EX, VBZ, DT, NN 4
8 i/FW distrust/NN you/PRP FW, NN, PRP 3
9 what/WP are/VBP you/PR...VBG about/IN WP, VBP, PRP, VBG, IN 5
10 shall/MD we/PRP move/VB ...JJ then/RB MD, PRP, VB, IN, JJ, RB 6
11 im/PRP hungry/JJ let...PRP already/RB PRP, JJ, VBZ, VB, PRP, RB 6
$POSprop
wrd.cnt propDT propEX propFW propIN propJJ propMD ... propWRB
1 6 0 0 0.00000 0.00000 0.00000 0.00000 0
2 5 20 0 0.00000 0.00000 20.00000 0.00000 0
3 4 0 0 0.00000 0.00000 0.00000 25.00000 0
4 4 0 0 0.00000 0.00000 0.00000 0.00000 0
5 5 20 0 0.00000 0.00000 0.00000 0.00000 0
6 5 0 0 0.00000 0.00000 20.00000 20.00000 20
7 4 25 25 0.00000 0.00000 0.00000 0.00000 0
8 3 0 0 33.33333 0.00000 0.00000 0.00000 0
9 5 0 0 0.00000 20.00000 0.00000 0.00000 0
10 6 0 0 0.00000 16.66667 16.66667 16.66667 0
11 6 0 0 0.00000 0.00000 16.66667 0.00000 0
$POSfreq
wrd.cnt DT EX FW IN JJ MD NN PRP PRP$ RB VB VBG VBP VBZ WP WRB
1 6 0 0 0 0 0 0 3 0 0 2 0 0 0 1 0 0
2 5 1 0 0 0 1 0 0 0 2 1 0 0 0 0 0 0
3 4 0 0 0 0 0 1 0 1 0 0 1 0 0 0 1 0
4 4 0 0 0 0 0 0 0 2 0 0 1 0 1 0 0 0
5 5 1 0 0 0 0 0 1 1 0 0 0 1 1 0 0 0
6 5 0 0 0 0 1 1 0 1 0 0 1 0 0 0 0 1
7 4 1 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0
8 3 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0
9 5 0 0 0 1 0 0 0 1 0 0 0 1 1 0 1 0
10 6 0 0 0 1 1 1 0 1 0 1 1 0 0 0 0 0
11 6 0 0 0 0 1 0 0 2 0 1 1 0 0 1 0 0
$POSrnp
wrd.cnt DT EX FW IN JJ ... WRB
1 6 0 0 0 0 0 0
2 5 1(20.0%) 0 0 0 1(20.0%) 0
3 4 0 0 0 0 0 0
4 4 0 0 0 0 0 0
5 5 1(20.0%) 0 0 0 0 0
6 5 0 0 0 0 1(20.0%) 1(20.0%)
7 4 1(25.0%) 1(25.0%) 0 0 0 0
8 3 0 0 1(33.3%) 0 0 0
9 5 0 0 0 1(20.0%) 0 0
10 6 0 0 0 1(16.7%) 1(16.7%) 0
11 6 0 0 0 0 1(16.7%) 0
$percent
[1] TRUE
$zero.replace
[1] 0
$pos.by.freq
adult&sex wrd.cnt DT EX FW IN JJ MD NN PRP PRP$ RB VB VBG VBP VBZ WP WRB
1 0.f 10 0 0 0 1 1 1 0 2 0 0 1 1 1 0 1 1
2 0.m 33 3 1 1 0 2 0 6 6 2 4 2 1 2 3 0 0
3 1.f 6 0 0 0 1 1 1 0 1 0 1 1 0 0 0 0 0
4 1.m 4 0 0 0 0 0 1 0 1 0 0 1 0 0 0 1 0
$pos.by.prop
adult&sex wrd.cnt DT EX FW IN JJ ... WP
1 0.f 10 0.000000 0.000000 0.000000 10.00000 10.000000 10
2 0.m 33 9.090909 3.030303 3.030303 0.00000 6.060606 0
3 1.f 6 0.000000 0.000000 0.000000 16.66667 16.666667 0
4 1.m 4 0.000000 0.000000 0.000000 0.00000 0.000000 25
$pos.by.rnp
adult&sex wrd.cnt DT EX FW IN JJ ... WP
1 0.f 10 0 0 0 1(10.0%) 1(10.0%) 1(10.0%)
2 0.m 33 3(9.1%) 1(3.0%) 1(3.0%) 0 2(6.1%) 0
3 1.f 6 0 0 0 1(16.7%) 1(16.7%) 0
4 1.m 4 0 0 0 0 0 1(25.0%)
♦ Plot Method♦
plot(posbydat, values = TRUE, digits = 2)
♦ pos_by
- Recycling Saves Time♦
posbydat2 <- with(DATA, pos_by(posbydat, list(person, sex)))
system.time(with(DATA, pos_by(posbydat, list(person, sex))))
user system elapsed
0.07 0.00 0.07
## `pos_by` output Recycled for `formality`
with(DATA, formality(posbydat, list(person, sex)))
## person&sex word.count formality
## 1 greg.m 20 35.00
## 2 sam.m 13 34.62
## 3 researcher.f 6 33.33
## 4 sally.f 10 20.00
## 5 teacher.m 4 0.00
Examining syllable counts can be a useful source of information in associating with education level, age, SES, gender, etc. Several readability scores rely on syllable and polysyllable word counts. qdap defines a polysyllable word as a word with 3 or more syllables, though some in the linguistics/literacy fields may include two syllable words. syllable_count
is the base function for syllable_sum
, polysyllable_sum
, and combo_syllable_sum
, though is generally not of direct use to the researcher conducting discourse analysis. syllable_count
uses a dictionary lookup method augmented with a syllable algorithm for words not found in the dictionary. Words not found in the dictionary are denoted with a NF in the in.dictionary column of the output.
Here is a list of qdap syllabication
functions and their descriptions:
syllable_count | Count the number of syllables in a single text string. |
syllable_sum | Count the number of syllables per row of text. |
polysyllable_sum | Count the number of polysyllables per row of text. |
combo_syllable_sum | Count the number of both syllables and polysyllables per row of text. |
♦ syllabication
Examples♦
syllable_count("Robots like Dason lie.")
words syllables in.dictionary
1 robots 2 -
2 like 1 -
3 dason 2 NF
4 lie 1 -
## The text variable for reference
DATA$state
[1] "Computer is fun. Not too fun."
[2] "No it's not, it's dumb."
[3] "What should we do?"
[4] "You liar, it stinks!"
[5] "I am telling the truth!"
[6] "How can we be certain?"
[7] "There is no way."
[8] "I distrust you."
[9] "What are you talking about?"
[10] "Shall we move on? Good then."
[11] "I'm hungry. Let's eat. You already?"
syllable_sum(DATA$state)
[1] 8 5 4 5 6 6 4 4 7 6 9
polysyllable_sum(DATA$state)
[1] 1 0 0 0 0 0 0 0 0 0 1
combo_syllable_sum(DATA$state)
syllable.count polysyllable.count
1 8 1
2 5 0
3 4 0
4 5 0
5 6 0
6 6 0
7 4 0
8 4 0
9 7 0
10 6 0
11 9 1
qdap offers a number of word statistics and scores applied by grouping variable. Some functions are original to qdap, while others are taken from academic papers. Complete references for statistics/scores based on others' work are provided in the help manual where appropriate. It is assumed that the reader is familiar, or can become acquainted, with the theory and methods for qdap functions based on the work of others. For qdap functions that are original to qdap a more robust description of the use and theory is provided.
Readability scores were originally designed to measure the difficulty of text. Scores are generally based on, number of words, syllables, polly-syllables and word length. While these scores are not specifically designed for, or tested on, speech, they can be useful indicators of speech complexity. The following score examples demonstrate the use of the following readability scores:
♦ Automated Readability Index♦
with(rajSPLIT, automated_readability_index(dialogue, list(sex, fam.aff)))
sex&fam.aff word.count sentence.count character.count Aut._Read._Index
1 f.cap 9458 929 37474 2.3
2 f.mont 28 4 88 -3.1
3 m.cap 1204 133 4615 1.2
4 m.escal 3292 262 13406 4.0
5 m.mont 6356 555 26025 3.6
6 m.none 3233 250 13527 4.7
7 none.none 156 12 665 5.1
♦ Coleman Liau♦
with(rajSPLIT, coleman_liau(dialogue, list(fam.aff, act)))
fam.aff&act word.count sentence.count character.count Coleman_Liau
1 cap.1 2636 272 10228 4.0
2 cap.2 2113 193 8223 4.4
3 cap.3 3540 339 14183 4.9
4 cap.4 2159 232 8620 4.5
5 cap.5 214 26 835 3.5
6 escal.1 748 36 3259 8.4
♦ SMOG♦
with(rajSPLIT, SMOG(dialogue, list(person, act)))
person&act word.count sentence.count polysyllable.count SMOG
1 Benvolio.1 621 51 25 7.1
2 Capulet.1 736 72 35 7.1
3 Capulet.3 749 69 28 6.8
4 Capulet.4 569 73 25 6.5
5 Friar Laurence.2 699 42 36 8.4
6 Friar Laurence.3 675 61 32 7.3
7 Friar Laurence.4 656 42 25 7.5
8 Friar Laurence.5 696 54 32 7.5
9 Juliet.2 1289 113 48 6.9
10 Juliet.3 1722 152 64 6.8
11 Juliet.4 932 61 37 7.6
12 Lady Capulet.3 393 39 15 6.7
13 Mercutio.2 964 82 43 7.3
14 Mercutio.3 578 54 19 6.5
15 Nurse.1 599 59 20 6.5
16 Nurse.2 779 76 24 6.3
17 Nurse.3 579 68 14 5.7
18 Nurse.4 250 50 9 5.6
19 Romeo.1 1158 113 48 6.9
20 Romeo.2 1289 109 46 6.8
21 Romeo.3 969 87 48 7.4
22 Romeo.5 1216 103 52 7.2
♦ Flesch Kincaid♦
with(rajSPLIT, flesch_kincaid(dialogue, list(sex, fam.aff)))
sex&fam.aff word.count sentence.count syllable.count FK_grd.lvl FK_read.ease
1 f.cap 9458 929 11641 2.9 92.375
2 f.mont 28 4 30 -0.2 109.087
3 m.cap 1204 133 1452 2.2 95.621
4 m.escal 3292 262 4139 4.1 87.715
5 m.mont 6356 555 7965 3.7 89.195
6 m.none 3233 250 4097 4.4 86.500
7 none.none 156 12 195 4.2 87.890
Note that the Fry score is a graphical display, rather than text as the other readability scores are. This is in keeping with the original procedures outlined by Fry.
♦ Fry♦
with(rajSPLIT, fry(dialogue, list(sex, fam.aff)))
## sex&fam.aff ave.syll.per.100 ave.sent.per.100
## 1 f.cap 123.0 8.728
## 2 m.cap 115.3 13.190
## 3 m.escal 126.7 9.824
## 4 m.mont 122.0 11.642
## 5 m.none 123.0 8.107
♦ Linsear Write♦
with(rajSPLIT, linsear_write(dialogue, person))
person sent.per.100 hard_easy_sum Linsear_Write
1 Balthasar 9.556 110 4.76
2 Benvolio 4.143 108 12.03
3 Capulet 11.469 115 4.01
4 Chorus 3.071 104 15.93
5 First Watchman 14.222 114 3.01
6 Friar Laurence 4.263 108 11.67
7 Gregory 11.000 100 3.55
8 Juliet 3.446 110 14.96
9 Lady Capulet 7.267 110 6.57
10 Mercutio 5.625 102 8.07
11 Montague 6.000 114 8.50
12 Nurse 12.098 102 3.22
13 Paris 9.091 110 5.05
14 Peter 10.357 110 4.31
15 Prince 10.842 110 4.07
16 Romeo 9.250 114 5.16
17 Sampson 9.421 107 4.68
18 Servant 9.667 104 4.38
19 Tybalt 9.591 112 4.84
Dissimilarity is another term for distance that is often used in text analysis to measure the pairwise proximity of grouping variables. The qdap Dissimilarity
function is a wrapper for the R stats package's dist function designed to handle text. Dissimilarity
takes all the same method types as dist but also includes the default method = “prop” (1 - “binary”) that is focused on the similarity between grouping variables.
♦ Dissimilarity
Examples♦
with(DATA, Dissimilarity(state, list(sex, adult)))
f.0 f.1 m.0
f.1 0.067
m.0 0.029 0.000
m.1 0.167 0.111 0.000
with(DATA, Dissimilarity(state, person))
greg researcher sally sam
researcher 0.000
sally 0.037 0.067
sam 0.160 0.000 0.050
teacher 0.000 0.111 0.167 0.000
with(DATA, Dissimilarity(state, person, method = "minkowski"))
greg researcher sally sam
researcher 5.477
sally 5.657 3.742
sam 5.568 4.796 4.796
teacher 5.292 2.828 3.162 4.583
dat <- pres_debates2012[pres_debates2012$person %in% qcv(OBAMA, ROMNEY),]
with(dat, Dissimilarity(dialogue, list(person, time)))
OBAMA.1 OBAMA.2 OBAMA.3 ROMNEY.1 ROMNEY.2
OBAMA.2 0.340
OBAMA.3 0.300 0.341
ROMNEY.1 0.340 0.287 0.258
ROMNEY.2 0.291 0.349 0.296 0.321
ROMNEY.3 0.264 0.297 0.329 0.290 0.338
♦ Dissimilarity
Clustering (Dendrogram)♦
x <- with(pres_debates2012, Dissimilarity(dialogue, list(person, time)))
fit <- hclust(x)
plot(fit)
## draw dendogram with colored borders around the 3 clusters
rect.hclust(fit, k=3, border=c("red", "purple", "seagreen"))
The Kullback Leibler is often used as a measure of distance, though the matrix is asymmetrical. qdap's kullback_leibler
compares the differences between two probability distributions and often leads to results similar to those from Dissimilarity
. Note that unlike many other qdap functions the user must either supply a word frequency matric (wfm
) to x or some other matrix format. This allows the function to be flexibly used with termco
and other functions that produce count matrices.
♦ kullback_leibler
Example - Compare to Dissimilarity
♦
dat <- pres_debates2012[pres_debates2012$person %in% qcv(OBAMA, ROMNEY),]
(KL <- (kullback_leibler(with(dat, wfm(dialogue, list(person, time))))))
OBAMA.1 OBAMA.2 OBAMA.3 ROMNEY.1 ROMNEY.2 ROMNEY.3
OBAMA.1 0.000 0.237 0.221 0.195 0.250 0.264
OBAMA.2 0.104 0.000 0.161 0.148 0.142 0.223
OBAMA.3 0.119 0.152 0.000 0.142 0.180 0.168
ROMNEY.1 0.207 0.297 0.279 0.000 0.216 0.224
ROMNEY.2 0.194 0.195 0.262 0.116 0.000 0.234
ROMNEY.3 0.160 0.182 0.141 0.101 0.140 0.000
plot(KL, high = "red", values = TRUE)
Diversity, as applied to dialogue, is a measure of the richness of language being used. Specifically, it measures how expansive the vocabulary is while taking into account the number of total words used and the different words being used. qdap's diversity
function provides output for the Simpson, Shannon, Collision, Berger Parker, and Brillouin measures.
♦ diversity
Example♦
(div.mod <- with(mraja1spl, diversity(dialogue, person)))
person wc simpson shannon collision berger_parker brillouin
1 Abraham 24 0.942 2.405 2.331 0.167 1.873
2 Benvolio 621 0.994 5.432 4.874 0.037 4.809
3 Capulet 736 0.993 5.358 4.805 0.027 4.813
4 First Citizen 16 0.958 2.393 2.287 0.188 1.718
5 First Servant 69 0.983 3.664 3.464 0.072 2.961
6 Gregory 149 0.991 4.405 4.141 0.054 3.686
7 Juliet 206 0.993 4.676 4.398 0.039 3.971
8 Lady Capulet 288 0.993 4.921 4.519 0.042 4.231
9 Lady Montague 28 0.995 3.233 3.199 0.071 2.375
10 Mercutio 549 0.991 5.302 4.564 0.051 4.663
11 Montague 217 0.993 4.805 4.496 0.041 4.063
12 Nurse 599 0.991 5.111 4.561 0.040 4.588
13 Paris 32 0.990 3.276 3.194 0.094 2.449
14 Prince 167 0.990 4.463 4.161 0.048 3.757
15 Romeo 1163 0.994 5.650 4.917 0.026 5.124
16 Sampson 259 0.987 4.531 4.106 0.066 3.947
17 Second Capulet 17 0.963 2.425 2.371 0.118 1.767
18 Second Servant 41 0.993 3.532 3.457 0.073 2.687
19 Servant 184 0.987 4.389 4.023 0.060 3.744
20 Tybalt 160 0.993 4.539 4.345 0.044 3.801
♦ diversity
Plot Method♦
plot(div.mod, low = "yellow", grid = FALSE, values = TRUE)
Formality is how contextualize a person's language use is. In situations involving what may be new content/context for an audience, a speaker may be more formal in their speech (Heylighen & Dewaele, 1999a, 1999b, 2002). Heylighen & Dewaele (2002) have developed a measure of formality based on categorizing parts of speech into contextual/formal categories. While qdap is not the creator of the algorithm for calculating formality
, Heylighen & Dewaele's (2002) F-measure (formality) is less known than other qdap word measures and thus more explanation is provide to the reader than say the Dissimilarity
measures above. Heylighen & Dewaele's (2002) F-measure is calculated by finding the difference of all of the formal parts (\(f\)) of speech (noun, adjective, preposition, article) and contextual (\(c\)) parts of speech (pronoun, verb, adverb, interjection) divided by the sum of all formal & contextual speech plus conjunctions (\(N\)). This quotient is added to one and multiplied by 50 to ensure a measure between 0 and 1, with scores closer to 100 being more formal and those approaching 0 being more contextual.
\[ F = 50(\frac{n_{f}-n_{c}}{N} + 1) \]
Where:
\[ f = \left \{noun, \;adjective, \;preposition, \;article\right \} \]
\[ c = \left \{pronoun, \;verb, \;adverb, \;interjection\right \} \]
\[ N = \sum{(f \;+ \;c \;+ \;conjunctions)} \]
Note that formality utilize parts of speech tagging. This is computationally expensive. The user may gain speed by setting parallel = TRUE if multiple cores are available. The user can also “recycle” the output from pos
, pos_by
, or formality
for the same text. This save considerable time as the parts of speech is saved in the output from these functions as demonstrated in the Recycled Formality Example below.
formality
also has a plotting method that allows for easy visualization and comparison of formality scores, word counts, formal/contextual parts of speech all by grouping variable(s). Please note that Heylighen & Dewaele (2002) state, “At present, a sample would probably need to contain a few hundred words for the measure to be minimally reliable. For single sentences, the F-value should only be computed for purposes of illustration” (p. 24).
♦ formality
Example♦
form <- with(raj, formality(dialogue, person))
♦ formality
Plot Method♦
plot(form)
♦ Recycling formality
♦
(form2 <- with(raj, formality(form, act)))
act word.count formality
1 5 3379 58.38
2 2 5358 58.10
3 1 5525 57.59
4 3 6354 57.22
5 4 3167 55.89
plot(form2, bar.colors=c("Set2", "RdBu"))
Polarity assignment, a form of sentiment analysis, is using an algorithm to determine the polarity of a sentence/statement. While the use polarity scores is applied to many forms of written social dialogue (e.g., Twitter, Facebook, etc.) it has not typically been applied to spoken dialogue. qdap offers a flexible function, polarity
to determine polarity at the sentence level as well as to assign an average polarity score to individual groups within the grouping variable(s). The frame work for polarity
is flexible in that the user may supply a polarized dictionary and optional weights. Many dictionaries used in sentiment analysis are designed for written, adult, online interaction.
The polarity score generated by polarity
is dependent upon the polarity dictionary used. This function defaults to the word polarity dictionary used by Hu & Liu (2004), however, this may not be appropriate for the context of children in a classroom. For instance the word “sick” in a high school setting may mean that something is good, whereas “sick” used by a typical adult indicates something is not right or negative connotation. The user may (is encouraged) to provide/augment/alter the dictionary (see the sentiment_frame
function). Development of context specific dictionaries, that are better suited for spoken dialogue in a school setting, is an exciting prospect that could lead to greater understanding of the emotional aspects of the spoken word on students. The user may add a dictionary with optional weights as a dataframe within an environment. The sentiment_frame
function aides the user in creating the polarity environment.
The equation used by the algorithm to assign value to polarity of each sentence fist utilizes the sentiment dictionary (Hu & Liu, 2004) to tag polarized words. A context cluster of words is pulled from around this polarized word (default 4 words before and two words after) to be considered as valence shifters. The words in this context cluster are tagged as neutral (\(x_i^{0}\)), negator (\(x_i^{N}\)), amplifier (\(x_i^{a}\)), or de-amplifier (\(x_i^{d}\)). Neutral words hold no value in the equation but do affect word count (\(n\)). Each polarized word is then weighted \(w\) based on the weights from the polarity.frame argument and then further weighted by the number and position of the valence shifters directly surrounding the positive or negative word. The researcher may provide a weight \(c\) to be utilized with amplifiers/de-amplifiers (default is .8; deamplifier weight is constrained to -1 lower bound). Last, these context cluster (\(x_i^{T}\)) are summed and divided by the square root of the word count (\(\sqrt{n}\)) yielding an unbounded polarity score (\(\delta\)). Note that context clusters containing a comma before the polarized word will only consider words found after the comma.
\[ \delta=\frac{x_i^T}{\sqrt{n}} \]
Where:
\[
x_i^T=\sum{((1 + c(x_i^{A} - x_i^{D}))\cdot w(-1)^{\sum{x_i^{N}}})}
\]
\[
x_i^{A}=\sum{(w_{neg}\cdot x_i^{a})}
\]
\[
x_i^D = \max(x_i^{D'}, -1)
\]
\[
x_i^{D'}=\sum{(- w_{neg}\cdot x_i^{a} + x_i^{d})}
\]
\[
w_{neg}= \left(\sum{x_i^{N}}\right) \bmod {2}
\]
The following examples demonstrate how the polarity
and sentiment_frame
functions operate. Here the polarity for the mraja1spl
data set (Act 1 of Romeo and Juliet). The gender, family affiliation and binary died/didn't die are used as the grouping variables.
♦ polarity
Example♦
(poldat <- with(mraja1spl, polarity(dialogue, list(sex, fam.aff, died))))
POLARITY BY GROUP
=================
sex&fam.aff&died tot.sent tot.word ave.polarity sd.polarity sd.mean.polarity
1 f.cap.FALSE 158 1810 0.076 0.262 0.292
2 f.cap.TRUE 24 221 0.042 0.209 0.204
3 f.mont.TRUE 4 29 0.079 0.398 0.199
4 m.cap.FALSE 73 717 0.026 0.256 0.104
5 m.cap.TRUE 17 185 -0.160 0.313 -0.510
6 m.escal.FALSE 9 195 -0.153 0.313 -0.488
7 m.escal.TRUE 27 646 -0.069 0.256 -0.272
8 m.mont.FALSE 70 952 -0.044 0.384 -0.114
9 m.mont.TRUE 114 1273 -0.004 0.409 -0.009
10 m.none.FALSE 7 78 0.062 0.107 0.583
11 none.none.FALSE 5 18 -0.282 0.439 -0.642
names(poldat)
## [1] "all" "group" "digits"
♦ polarity
- Sentence Level Polarity Scores♦
htruncdf(counts(poldat), 20, 10)
## sex&fam.af wc polarity pos.words neg.words text.var
## 1 m.cap.FALS 10 0 - - Gregory, o
## 2 m.cap.FALS 8 0 - - No, for th
## 3 m.cap.FALS 11 0 - - I mean, an
## 4 m.cap.FALS 13 0 - - Ay, while
## 5 m.cap.FALS 6 -0.4082482 - strike I strike q
## 6 m.cap.FALS 8 0.35355339 - strike But thou a
## 7 m.cap.FALS 9 0 - - A dog of t
## 8 m.cap.FALS 12 0.28867513 valiant - To move is
## 9 m.cap.FALS 10 0 - - therefore,
## 10 m.cap.FALS 10 0 - - A dog of t
## 11 m.cap.FALS 12 0 - - I will tak
## 12 m.cap.FALS 13 -0.5547001 - c("weak", That shows
## 13 m.cap.FALS 16 -0.25 - weaker True; and
## 14 m.cap.FALS 17 0 - - therefore
## 15 m.cap.FALS 10 0 masters quarrel The quarre
## 16 m.cap.FALS 10 -0.3162277 - tyrant 'Tis all o
## 17 m.cap.FALS 21 -0.2182178 - cruel when I hav
## 18 m.cap.FALS 5 0 - - The heads
## 19 m.cap.FALS 18 -0.2357022 - wilt Ay, the he
## 20 m.cap.FALS 9 0 - - They must
♦ polarity
Plot Method♦
plot(poldat)
♦ polarity
Plot Group Polarity as Heat Map♦
qheat(scores(poldat), high="blue", low="yellow", grid=NULL, order.b="ave.polarity")
♦ sentiment_frame
- Specify Your Own Polarity Environment♦
(POLENV <- sentiment_frame(positive.words, negative.words))
x y
1: a plus 1
2: abnormal -1
3: abolish -1
4: abominable -1
5: abominably -1
---
6775: zealously -1
6776: zenith 1
6777: zest 1
6778: zippy 1
6779: zombie -1
♦ Polarity Over Time ♦
poldat4 <- with(rajSPLIT, polarity(dialogue, act, constrain = TRUE))
polcount <- na.omit(counts(poldat4)$polarity)
len <- length(polcount)
cummean <- function(x){cumsum(x)/seq_along(x)}
cumpolarity <- data.frame(cum_mean = cummean(polcount), Time=1:len)
## Calculate background rectangles
ends <- cumsum(rle(counts(poldat4)$act)$lengths)
starts <- c(1, head(ends + 1, -1))
rects <- data.frame(xstart = starts, xend = ends + 1,
Act = c("I", "II", "III", "IV", "V"))
library(ggplot2)
ggplot() + theme_bw() +
geom_rect(data = rects, aes(xmin = xstart, xmax = xend,
ymin = -Inf, ymax = Inf, fill = Act), alpha = 0.17) +
geom_smooth(data = cumpolarity, aes(y=cum_mean, x = Time)) +
geom_hline(y=mean(polcount), color="grey30", size=1, alpha=.3, linetype=2) +
annotate("text", x = mean(ends[1:2]), y = mean(polcount), color="grey30",
label = "Average Polarity", vjust = .3, size=3) +
geom_line(data = cumpolarity, aes(y=cum_mean, x = Time), size=1) +
ylab("Cumulative Average Polarity") + xlab("Duration") +
scale_x_continuous(expand = c(0,0)) +
geom_text(data=rects, aes(x=(xstart + xend)/2, y=-.04,
label=paste("Act", Act)), size=3) +
guides(fill=FALSE) +
scale_fill_brewer(palette="Set1")
It is helpful to finds words associated (or negatively associated) with or correlations between selected words in understanding language selection. The word_cor
function calculates correlations (based on the wfm
function) for words nested within grouping variables (turn of talk is an obvious choice for a grouping variable). Running bootstrapping with a random sample can help the researcher determine if a co-occurrence of words is by chance. wordword_cor
is even more flexible in that it can actually take a frequency matrix (e.g., the wfm_combine
function or the cm_ family of functions).
♦ word_cor
- Single Words♦
library(reports)
x <- factor(with(rajSPLIT, paste(act, pad(TOT(tot)), sep = "|")))
word_cor(rajSPLIT$dialogue, x, "romeo", .45)
$romeo
that tybalt
0.4540979 0.4831937
word_cor(rajSPLIT$dialogue, x, "love", .5)
$love
likewise
0.5013104
♦ word_cor
- Negative Correlation♦
word_cor(rajSPLIT$dialogue, x, "you", -.1)
with(rajSPLIT, word_cor(dialogue, list(person, act), "hate"))
$hate
eyesight knight prison smooth vex'd
0.7318131 0.7318131 0.7318131 0.7318131 0.7318131
♦ word_cor
- Multiple Words♦
words <- c("hate", "i", "love", "ghost")
with(rajSPLIT, word_cor(dialogue, x, words, r = .5))
$hate
beasts beseeming bills bred
0.6251743 0.6251743 0.6251743 0.6251743
canker'd capulethold clubs coward
0.6251743 0.6251743 0.6251743 0.6251743
crutch disturb'd flourishes fountains
0.6251743 0.6251743 0.6251743 0.6251743
issuing mistemper'd neighbourstained partisans
0.6251743 0.6251743 0.6251743 0.6251743
pernicious profaners purple rebellious
0.6251743 0.6251743 0.6251743 0.6251743
streets subjects sword thrice
0.5027573 0.6251743 0.6164718 0.6251743
throw wield
0.6251743 0.6251743
$i
and have me my thee to
0.5150992 0.5573359 0.5329341 0.5134372 0.5101593 0.5533506
$love
likewise
0.5013104
$ghost
bone brains club dash drink keys kinsman's methinks
0.7056134 0.7056134 1.0000000 1.0000000 0.5749090 1.0000000 1.0000000 0.5749090
rage rapier's seeking spices spit
0.5749090 1.0000000 1.0000000 1.0000000 1.0000000
♦ word_cor
- Correlations Between Terms: Example 1♦
## Set r = NULL to get matrix between words
with(rajSPLIT, word_cor(dialogue, x, words, r = NULL))
hate i love ghost
hate 1.00000000 0.05142236 0.15871966 -0.01159382
i 0.05142236 1.00000000 0.36986172 0.01489943
love 0.15871966 0.36986172 1.00000000 -0.02847837
ghost -0.01159382 0.01489943 -0.02847837 1.00000000
♦ word_cor
- Correlations Between Terms: Example 2♦
dat <- pres_debates2012
dat$TOT <- factor(with(dat, paste(time, pad(TOT(tot)), sep = "|")))
dat <- dat[dat$person %in% qcv(OBAMA, ROMNEY), ]
dat$person <- factor(dat$person)
dat.split <- with(dat, split(dat, list(person, time)))
wrds <- qcv(america, debt, dollar, people, tax, health)
lapply(dat.split, function(x) {
word_cor(x[, "dialogue"], x[, "TOT"], wrds, r=NULL)
})
$`OBAMA.time 1`
america dollar people tax health
america 1.000000000 -0.005979775 0.6117618 -0.005979775 0.13803797
dollar -0.005979775 1.000000000 0.1650493 -0.004219409 -0.01092353
people 0.611761819 0.165049280 1.0000000 0.165049280 0.50398555
tax -0.005979775 -0.004219409 0.1650493 1.000000000 0.20572642
health 0.138037974 -0.010923527 0.5039855 0.205726420 1.00000000
$`ROMNEY.time 1`
america dollar people tax health
america 1.00000000 0.07493271 0.2336551 0.07033784 0.14986684
dollar 0.07493271 1.00000000 0.5859944 0.11109650 0.33821359
people 0.23365513 0.58599441 1.0000000 0.20584588 0.61333714
tax 0.07033784 0.11109650 0.2058459 1.00000000 -0.01723713
health 0.14986684 0.33821359 0.6133371 -0.01723713 1.00000000
$`OBAMA.time 2`
america dollar people tax health
america 1.00000000 -0.01526328 0.41353310 0.07609871 0.25733977
dollar -0.01526328 1.00000000 0.11671525 0.51222872 -0.01220067
people 0.41353310 0.11671525 1.00000000 0.03761852 0.11285926
tax 0.07609871 0.51222872 0.03761852 1.00000000 0.03431397
health 0.25733977 -0.01220067 0.11285926 0.03431397 1.00000000
$`ROMNEY.time 2`
america debt dollar people tax
america 1.00000000 -0.018370290 0.07531545 0.59403781 0.291238391
debt -0.01837029 1.000000000 0.53340505 0.02329285 -0.009432552
dollar 0.07531545 0.533405053 1.00000000 0.33346752 0.600125943
people 0.59403781 0.023292854 0.33346752 1.00000000 0.516577197
tax 0.29123839 -0.009432552 0.60012594 0.51657720 1.000000000
health 0.06384509 -0.008308652 0.68299026 0.25536510 0.658231340
health
america 0.063845090
debt -0.008308652
dollar 0.682990261
people 0.255365102
tax 0.658231340
health 1.000000000
$`OBAMA.time 3`
america debt dollar people tax
america 1.00000000 -0.01224452 -0.02326653 0.1182189 -0.02326653
debt -0.01224452 1.00000000 0.37361771 0.1765301 0.75525297
dollar -0.02326653 0.37361771 1.00000000 0.1909401 0.70993297
people 0.11821887 0.17653013 0.19094008 1.0000000 0.19094008
tax -0.02326653 0.75525297 0.70993297 0.1909401 1.00000000
$`ROMNEY.time 3`
america debt dollar people
america 1.0000000 0.2130341 0.2675978 0.3007027
debt 0.2130341 1.0000000 0.8191341 0.4275521
dollar 0.2675978 0.8191341 1.0000000 0.4666635
people 0.3007027 0.4275521 0.4666635 1.0000000
♦ word_cor
- Matrix from wfm_combine
♦
worlis <- list(
pronouns = c("you", "it", "it's", "we", "i'm", "i"),
negative = qcv(no, dumb, distrust, not, stinks),
literacy = qcv(computer, talking, telling)
)
y <- wfdf(DATA$state, id(DATA, prefix = TRUE))
z <- wfm_combine(y, worlis)
word_cor(t(z), word = c(names(worlis), "else.words"), r = NULL)
pronouns negative literacy else.words
pronouns 1.0000000 0.2488822 -0.4407045 -0.5914760
negative 0.2488822 1.0000000 -0.2105380 -0.7146856
literacy -0.4407045 -0.2105380 1.0000000 0.2318694
else.words -0.5914760 -0.7146856 0.2318694 1.0000000
qdap offers a number of plot methods for various outputs from functions (use plot(qdap_FUNCTION_OUTPUT)). In addition to the numerous plot methods qdap also has several functions dedicated solely to plotting purposes. Many of these functions rely on the ggplot2 package (Wickham, 2009) to produce plots.
The lexical dispersion plot is a useful tool (particularly in the early stage of analysis) for looking at the dispersion of a word throughout the dialogue. dispersion_plot
provides the means to look at and compare multiple word dispersions across repeated measures and/or grouping variables. This can be useful in conjunction with a correlation analysis.
The search mechanism used by dispersion_plot
is identical to termco
and term_match
. For example, “ love ” will not yield the same search as “love”. The search example below demonstrates the way the search functions. For more information see the termco search description above.
♦ dispersion_plot
- Understand the Search♦
term_match(raj$dialogue, c(" love ", "love", " night ", "night"))
$` love `
[1] "love"
$love
[1] "love" "love's" "lovers" "loved" "lovely"
[6] "lovest" "lover" "beloved" "loves" "newbeloved"
[11] "glove" "lovesong" "lovedevouring" "loveperforming" "dearloved"
$` night `
[1] "night"
$night
[1] "night" "fortnight" "nights" "tonight" "night's" "knight"
[7] "nightingale" "nightly" "yesternight"
♦ dispersion_plot
- Example 1♦
with(rajSPLIT , dispersion_plot(dialogue, c("love", "night"),
grouping.var = list(fam.aff, sex), rm.vars = act))
♦ dispersion_plot
- Example 2: Color Schemes♦
with(rajSPLIT, dispersion_plot(dialogue, c("love", "night"),
bg.color = "black", grouping.var = list(fam.aff, sex),
color = "yellow", total.color = "white", horiz.color="grey20"))
Using dispersion_plot
with freq_terms
's [[“rfswl”]][[“all”]] can be a useful means of viewing the dispersion of high frequency words after stopword removal.
♦ dispersion_plot
- Example 3: Using with freq_terms
♦
wrds <- freq_terms(pres_debates2012$dialogue, stopwords = Top200Words)
## Add leading/trailing spaces if desired
wrds2 <- spaste(wrds)
## Use `~~` to maintain spaces
wrds2 <- c(" governor~~romney ", wrds2[-c(3, 12)])
## Plot
with(pres_debates2012 , dispersion_plot(dialogue, wrds2, rm.vars = time,
color="black", bg.color="white"))
Wordclouds can be a useful tool to help find words/phrases that are used frequently. It allows for the entire dialogue to be contained in pictorial form. The word cloud becomes more useful in discovering themes when color can be used in a meaningful way (i.e., the information contained in the word size and word color are not redundant). qdap has two word cloud functions (both are wrappers for wordcloud from the wordcloud package). The trans_cloud
function produces word clouds with optional theme coloring by grouping variable. The gradient_cloud
function produces a gradient word cloud colored by a binary grouping variable.
trans_cloud
is passed a list of named vectors to target.words in much the same way as match.list in termco
.
Format for Named Vectors
list(
theme_1 = c("word1", "word2", "word3"),
theme_2 = c("word4", "word5"),
theme_3 = c("word6", "word7", "word8")
)
The cloud.colors argument takes a single color or a vector of colors 1 greater than the number of vectors of target.words. The order of cloud.colors corresponds to the order of target.words with the extra, final color being utilized for all words not matched to target.words.
♦ trans_cloud
Example 1♦
## Generate themes/terms to color by
terms <- list(
I=c("i", "i'm"),
mal=qcv(stinks, dumb, distrust),
articles=qcv(the, a, an),
pronoun=qcv(we, you)
)
with(DATA, trans_cloud(state, person, target.words=terms,
cloud.colors=qcv(red, green, blue, black, gray65),
expand.target=FALSE, proportional=TRUE, legend=c(names(terms),
"other")))
♦ trans_cloud
Example 2 - Polarity♦
## Rearrange the data
DATA2 <- qdap::DATA
DATA2[1, 4] <- "This is not good!"
DATA2[8, 4] <- "I don't distrust you."
DATA2$state <- space_fill(DATA2$state, paste0(negation.words, " "),
rm.extra = FALSE)
txt <- gsub("~~", " ", breaker(DATA2$state))
rev.neg <- sapply(negation.words, paste, negative.words)
rev.pos <- sapply(negation.words, paste, positive.words)
## Generate themes/terms to color by
tw <- list(
positive=c(positive.words, rev.neg[rev.neg %in% txt]),
negative=c(negative.words, rev.pos[rev.pos %in% txt])
)
with(DATA2, trans_cloud(state, person,
target.words=tw,
cloud.colors=qcv(darkgreen, red, gray65),
expand.target=FALSE, proportional=TRUE, legend=names(tw)))
♦ gradient_cloud
Examples♦
## Fuse two words
DATA2 <- DATA
DATA2$state <- space_fill(DATA$state, c("is fun", "too fun", "you liar"))
gradient_cloud(DATA2$state, DATA2$sex, title="Lying Fun", max.word.size = 5,
min.word.size = .025)
gradient_cloud(DATA2$state, DATA2$sex, title="Houghton Colors",
max.word.size = 8, min.word.size = .01, X ="purple" , Y = "yellow")
Many of the plot methods utilized by other functions' classes are a wrapper for gantt_plot
or gantt_wrap
. gantt_plot
wraps the gantt
, gantt_rep
and gantt_wrap
functions to allow for direct input of text dialogue and grouping variables. The gantt_plot
function is a fast way to make Gantt charts that can be faceted and filled by grouping variables. A Gantt plot allows the user to find trends and patterns in dialogue across time. It essentially allows for a visual representation of an entire exchange of dialogue. The following examples show the flexibility of gantt_plot
; many of these techniques can also be utilized in plot methods for qdap classes that utilize gantt_plot
and gantt_wrap
. It is also prudent to be aware of gantt_wrap
, that is its arguments and how to utilize it, as it is less convenient yet more flexible and powerful than gantt_plot
.
♦ gantt_plot
- Single Time/Single Grouping Variable♦
with(rajSPLIT, gantt_plot(text.var = dialogue,
grouping.var = person, size=4))
♦ gantt_plot
- Single Time/Multiple Grouping Variable♦
with(rajSPLIT, gantt_plot(text.var = dialogue,
grouping.var = list(fam.aff, sex), rm.var = act,
title = "Romeo and Juliet's dialogue"))
Sometimes the location of the facets may not be ideal to show the data (i.e., you may want to reverse the x and y axis). By setting transform = TRUE the user can make this switch.
♦ gantt_plot
- Transforming♦
with(rajSPLIT, gantt_plot(dialogue, list(fam.aff, sex), act,
transform=T))
Often the default colors are less useful in displaying the trends in a way that is most meaningful. Because gantt_plot
is a wrapper for ggplot2 the color palettes can easily be extended to use with the output from gantt_plot
.
♦ gantt_plot
- Color Palette Examples♦
## Load needed packages
library(ggplot2); library(scales); library(RColorBrewer); library(grid)
## Duplicate a new data set and make alterations
rajSPLIT2 <- rajSPLIT
rajSPLIT2$newb <- as.factor(sample(LETTERS[1:2], nrow(rajSPLIT2),
replace=TRUE))
z <- with(rajSPLIT2, gantt_plot(dialogue, list(fam.aff, sex),
list(act, newb), size = 4))
z + theme(panel.margin = unit(1, "lines")) + scale_colour_grey()
z + scale_colour_brewer(palette="Dark2")
z + scale_colour_manual(values=rep("black", 7))
## vector of colors
cols <- c("black", "red", "blue", "yellow", "orange", "purple", "grey40")
z + scale_colour_manual(values=cols)
At times it may be useful to fill the bar colors by another grouping variable. The fill.var argument allows another coloring variable to be utilized.
♦ gantt_plot
- Fill Variable Example 1♦
## Generate an end mark variable set to fill by
dat <- rajSPLIT[rajSPLIT$act == 1, ]
dat$end_mark <- factor(end_mark(dat$dialogue))
with(dat, gantt_plot(text.var = dialogue, grouping.var = list(person, sex),
fill.var=end_mark))
♦ gantt_plot
- Fill Variable Example 2♦
## Generate an end mark variable data set to fill by
rajSPLIT2 <- rajSPLIT
rajSPLIT2$end_mark <- end_mark(rajSPLIT2$dialogue)
with(rajSPLIT2, gantt_plot(text.var = dialogue,
grouping.var = list(fam.aff), rm.var = list(act),
fill.var=end_mark, title = "Romeo and Juliet's dialogue"))
Be wary though of using coloring to show what faceting would show better. Here is an example of faceting versus the color fill used in the Fill Variable Example 1 above.
♦ gradient_plot
- Facet Instead of Fill Variable♦
## Repeated Measures Sentence Type Example
with(rajSPLIT2, gantt_plot(text.var = dialogue,
grouping.var = list(fam.aff, sex), rm.var = list(end_mark, act),
title = "Romeo and Juliet's dialogue"))
Heatmaps are a powerful way to visualize patterns in matrices. The gradient allows the user to quickly pick out high and low values. qheat
(quick heat map) is a heat map function that accepts matrices and dataframes and has some nice pre-sets that work well with the way qdap data is structured. Two of these assumptions to be aware of is that dataframe is numeric with the exception of a single grouping variable column with the possibility of additional non-numeric columns passed to facet.vars. qheat
also assumes that matrices are all numeric with row names serving as the grouping variable. If passing a dataframe, qheat
the grouping variable column is assumed to be the first column.
The following examples demonstrate various uses of qheat
.
♦ qheat
- Basic Example♦
## word stats data set
ws.ob <- with(DATA.SPLIT, word_stats(state, list(sex, adult), tot=tot))
# same as `plot(ws.ob)`
qheat(ws.ob)
♦ qheat
- Color Group Labels Example♦
qheat(ws.ob, xaxis.col = c("red", "black", "green", "blue"))
♦ qheat
- Order By Numeric Variable Examples♦
## Order by sptot
qheat(ws.ob, order.by = "sptot")
## Reverse order by sptot
qheat(ws.ob, order.by = "-sptot")
♦ qheat
- Cell Labels Examples♦
qheat(ws.ob, values = TRUE)
qheat(ws.ob, values = TRUE, text.color = "red")
♦ qheat
- Custom Cell Labels Example♦
## Create a data set and matching labels
dat1 <- data.frame(G=LETTERS[1:5], matrix(rnorm(20), ncol = 4))
dat2 <- data.frame(matrix(LETTERS[1:25], ncol=5))
qheat(dat1, high = "orange", values=TRUE, text.color = "black")
qheat(dat1, high = "orange", values=TRUE, text.color = "black", mat2=dat2)
♦ qheat
- Grid Examples♦
qheat(ws.ob, "yellow", "red", grid = FALSE)
qheat(ws.ob, high = "red", grid = "black")
♦ qheat
- Facet Examples♦
qheat(mtcars, facet.vars = "cyl")
qheat(mtcars, facet.vars = c("gear", "cyl"))
♦ qheat
- Transposing Examples♦
qheat(t(mtcars), by.column=FALSE)
qheat(mtcars, plot = FALSE) + coord_flip()
When plotting a correlation/distance matrix set diag.na = TRUE to keep these extreme values from effecting the scaling.
♦ qheat
- Correlation Matrix Examples♦
qheat(cor(mtcars), diag.na=TRUE, by.column = NULL)
Rank Frequency Plots are a way of visualizing word rank versus frequencies as related to Zipf's law which states that the rank of a word is inversely related to its frequency. The rank_freq_mplot
and rank_freq_plot
provide the means to plot the ranks and frequencies of words (with rank_freq_mplot
plotting by grouping variable(s)).
rank_freq_mplot
utilizes the ggplot2 package, whereas, rank_freq_plot
employs base graphics. rank_freq_mplot
is more general, flexible, and takes text/grouping variables directly; in most cases rank_freq_mplot
should be preferred (though rank_freq_plot
will render quicker). The rank_freq_mplot
family of functions also outputs a list of rank/frequency dataframes used plot the visuals and other related descriptive statistics.
## Plot log-log version
x2 <- rank_freq_mplot(mraja1spl$dialogue, mraja1spl$person, ncol = 5,
hap.col = "purple")
## View output
ltruncdf(x2, 10)
## $WORD_COUNTS
## group word freq
## 1 Abraham sir 4
## 2 Abraham you 3
## 3 Abraham at 2
## 4 Abraham bite 2
## 5 Abraham do 2
## 6 Abraham no 2
## 7 Abraham thumb 2
## 8 Abraham us 2
## 9 Abraham your 2
## 10 Abraham better 1
##
## $RANK_AND_FREQUENCY_STATS
## group n.words freq rank
## 1 Abraham 1 4 1
## 2 Abraham 1 3 2
## 3 Abraham 7 2 3
## 4 Abraham 3 1 4
## 5 Benvolio 1 23 1
## 6 Benvolio 1 18 2
## 7 Benvolio 1 14 3
## 8 Benvolio 1 13 4
## 9 Benvolio 1 12 5
## 10 Benvolio 2 10 6
##
## $LEGOMENA_STATS
## person hapax_lego dis_legome
## 1 Abraham 25 58.33
## 2 Benvolio 73.68 12.57
## 3 Capulet 67.84 13.45
## 4 First Citi 75 16.67
## 5 First Serv 71.74 17.39
## 6 Gregory 72.73 17.17
## 7 Juliet 71.21 16.67
## 8 Lady Capul 74.59 15.47
## 9 Lady Monta 92.31 7.69
## 10 Mercutio 79.56 8.81
## Plot standard rank-freq version
invisible(rank_freq_mplot(mraja1spl$dialogue, mraja1spl$person, ncol = 5,
log.freq = FALSE, log.rank = FALSE, jitter = .6, hap.col = "purple"))
♦ rank_freq_mplot
- Using alpha♦
invisible(rank_freq_mplot(raj$dialogue, jitter = .5, shape = 19, alpha = 1/15))
The rank_freq_plot
plots more quickly but does not handle multiple groups and does not take text/grouping variables directly.
♦ rank_freq_plot
Example ♦
## Generate data for `rank_freq_plot` via `word_list` function
mod <- with(mraja1spl , word_list(dialogue, person, cut.n = 10,
cap.list=unique(mraja1spl$person)))
## Plot it
x3 <- rank_freq_plot(mod$fwl$Romeo$WORD, mod$fwl$Romeo$FREQ,
title.ext = 'Romeo')
## View output
ltruncdf(x3, 10)
## $WORD_COUNTS
## words freq
## 1 I 30
## 2 a 27
## 3 the 27
## 4 and 25
## 5 is 24
## 6 of 24
## 7 my 22
## 8 that 22
## 9 to 22
## 10 in 21
##
## $RANK_AND_FREQUENCY_STATS
## n.words freq rank per.of.tot
## 1 1 30 1 2.591
## 2 2 27 2 2.332
## 3 1 25 3 2.159
## 4 2 24 4 2.073
## 5 3 22 5 1.9
## 6 1 21 6 1.813
## 7 1 17 7 1.468
## 8 1 15 8 1.295
## 9 1 14 9 1.209
## 10 1 13 10 1.123
##
## $LEGOMENA_STATS
## dataframe
## 1 69.143
## 2 16
It is often useful to view the lengths of turns of talk as a bar plot, particularly if the bars are colored by grouping variable. The tot_plot
function plots dialogue as a bar plot with the option to color by grouping variables and facet by repeated measure variables. This can enable the entire dialogue to be viewed in a succinct way, possibly leading the researcher to see patterns that may have otherwise escaped attention.
Within the tot_plot
function the turn of talk argument (tot) may be the “tot” column from sentSplit
output (tot = TRUE), the row numbers (tot = FALSE), the character name of a column (tot = “COLUMN NAME”), or a separate numeric/character vector equal in length to the text.var.
♦ tot_plot
- Examples ♦
dataframe <- sentSplit(DATA, "state")
tot_plot(dataframe, text.var = "state")
## Change space between bars
tot_plot(dataframe, text.var = "state", bar.space = .03)
## Color bars by grouping variable(s)
tot_plot(dataframe, text.var = "state", grouping.var = "sex")
## Use rownames as tot: color by family affiliation
tot_plot(mraja1, "dialogue", grouping.var = "fam.aff", tot = FALSE)
## Use rownames as tot: color by death
tot_plot(mraja1, "dialogue", grouping.var = "died", tot = FALSE)
♦ tot_plot
- Facet Variables ♦
rajSPLIT2 <- do.call(rbind, lapply(split(rajSPLIT, rajSPLIT$act), head, 25))
tot_plot(rajSPLIT2, "dialogue", grouping.var = "fam.aff", facet.var = "act")
Because tot_plot
is based on the ggplot2 package (Wickham, 2009) and tot_plot
invisibly returns the ggplot2 object, the output (of the class “ggplot”) can be altered in the same way that another ggplot2 object can be. In the following examples the color palette is altered.
♦ tot_plot
- Alter Colors ♦
base <- tot_plot(mraja1, "dialogue", grouping.var = c("sex", "fam.aff"),
tot=FALSE, plot=FALSE)
base + scale_fill_hue(l=40)
base + scale_fill_brewer(palette="Spectral")
base + scale_fill_brewer(palette="Set1")
♦ tot_plot
- Add Mean +2/+3 sd ♦
base +
scale_fill_brewer(palette="Set1") +
geom_hline(aes(yintercept=mean(word.count))) +
geom_hline(aes(yintercept=mean(word.count) + (2 *sd(word.count)))) +
geom_hline(aes(yintercept=mean(word.count) + (3 *sd(word.count)))) +
geom_text(parse=TRUE, hjust=0, vjust=0, size = 3, aes(x = 2,
y = mean(word.count) + 2, label = "bar(x)")) +
geom_text(hjust=0, vjust=0, size = 3, aes(x = 1,
y = mean(word.count) + (2 *sd(word.count)) + 2, label = "+2 sd")) +
geom_text(hjust=0, vjust=0, size = 3, aes(x = 1,
y = mean(word.count) + (3 *sd(word.count)) + 2, label = "+3 sd"))
The Venn diagram can be a useful way to visualize similarity between grouping variables with respect to word use when the number of groups is relatively small. The trans_venn
function wraps the venneuler package to produce Venn diagrams. The user must keep in mind that producing this output is computationally slow, thus consideration must be given with regard to data size and number of groups when using trans_venn
to avoid over plotting and lengthy plot production. The use of the stopwords argument can also be useful to reduce the overlap of common words between grouping variables.
If a data set is larger the user may want to consider representing the data as a Dissimilarity
matrix or as an adjacency matrix that can be plotted with the igraph package as seen in the presidential examples above.
In the following example the reader will notice the centers of the circles (i.e., the person labels) are very similar to the positioning (the distances between nodes) of the same data in the adjacency matrix plot of the same data above.
with(DATA , trans_venn(state, person, legend.location = "topright"))
Viewing connections between words within grouping variables (particularly turns of talk) are a useful means of examining what words are connected together. For example, this may be useful to a researcher who is looking at particular vocabulary usage by a teacher. The researcher may wish to know what other terms are supporting/supported by/connected to the terms of interest. word_network_plot
a wrapper for the igraph package. This approach may be used on concert with correlations between words. Not that terms could also be combined via the wfm_combine
before running a correlation in order to represent the clustering of words that word_network_plot
handles. Further analysis of the word correlations can be tested via bootstrapping of the attained correlation against a random correlation. It is worth noting that the word_associate
function is a wrapper for word_network_plot
(for mor see this example above).
♦ word_network_plot
- Between Turns of Talk: All Words ♦
word_network_plot(text.var=DATA$state, stopwords=NULL, label.cex = .95)
♦ word_network_plot
- Between People ♦
word_network_plot(text.var=DATA$state, DATA$person)
word_network_plot(text.var=DATA$state, DATA$person, stopwords=NULL)
♦ word_network_plot
- Between sex and adult ♦
word_network_plot(text.var=DATA$state, grouping.var=list(DATA$sex,
DATA$adult))
♦ word_network_plot
- log.labels
♦
word_network_plot(text.var=DATA$state, grouping.var=DATA$person,
title.name = "TITLE", log.labels=TRUE, label.size = .9)
qdap provies a few general functions for categorizing sentence types. This section will outline these functions and some of their uses.
It is often helpful to determine if a sentence (a row) is incomplete, as this may effect some forms of analysis. The end_inc
provides this functionality after incomplete_replace
has replaced various incomplete sentence notation with the standard qdap notation (|).
♦ end_inc
Examples ♦
## Alter the DATA.SPLIT data set to have incomplete sentences.
dat <- DATA.SPLIT[, c("person", "state")]
dat[c(1:2, 7), "state"] <- c("the...", "I.?", "threw..")
dat[, "state"] <- incomplete_replace(dat$state)
dat
## person state
## 1 sam the|
## 2 sam I|
## 3 greg No it's not, it's dumb.
## 4 teacher What should we do?
## 5 sam You liar, it stinks!
## 6 greg I am telling the truth!
## 7 sally threw|
## 8 greg There is no way.
## 9 sam I distrust you.
## 10 sally What are you talking about?
## 11 researcher Shall we move on?
## 12 researcher Good then.
## 13 greg I'm hungry.
## 14 greg Let's eat.
## 15 greg You already?
## Remove incomplete sentences and warn.
end_inc(dat, "state")
## person state
## 3 greg No it's not, it's dumb.
## 4 teacher What should we do?
## 5 sam You liar, it stinks!
## 6 greg I am telling the truth!
## 8 greg There is no way.
## 9 sam I distrust you.
## 10 sally What are you talking about?
## 11 researcher Shall we move on?
## 12 researcher Good then.
## 13 greg I'm hungry.
## 14 greg Let's eat.
## 15 greg You already?
## Remove incomplete sentences and no warning.
end_inc(dat, "state", warning.report = FALSE)
## person state
## 3 greg No it's not, it's dumb.
## 4 teacher What should we do?
## 5 sam You liar, it stinks!
## 6 greg I am telling the truth!
## 8 greg There is no way.
## 9 sam I distrust you.
## 10 sally What are you talking about?
## 11 researcher Shall we move on?
## 12 researcher Good then.
## 13 greg I'm hungry.
## 14 greg Let's eat.
## 15 greg You already?
## List of logical checks for which are not/are incomplete
end_inc(dat, "state", which.mode = TRUE)
## $NOT
## [1] FALSE FALSE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE
## [12] TRUE TRUE TRUE TRUE
##
## $INC
## [1] TRUE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE
It is often useful to determine what sentence type (end mark) a sentence is. The end_mark
extracts the end marks from a sentence. The output can also be used logically grab sentence types.
♦ end_mark
Example ♦
end_mark(DATA.SPLIT$state)
## [1] "." "." "." "?" "!" "!" "?" "." "." "?" "?" "." "." "." "?"
♦ end_mark
- Grab Sentence Types♦
## Grab questions
ques <- mraja1spl[end_mark(mraja1spl$dialogue) == "?", ]
htruncdf(ques)
## person tot sex fam.aff died dialogue stem.text
## 1 Gregory 14.1 m cap FALSE The heads The head o
## 2 Gregory 20.2 m cap FALSE turn thy b Turn thi b
## 3 Abraham 26.1 m mont FALSE Do you bit Do you bit
## 4 Abraham 28.1 m mont FALSE Do you bit Do you bit
## 5 Sampson 29.1 m cap FALSE [Aside to Is the law
## 6 Gregory 32.1 m cap FALSE Do you qua Do you qua
## 7 Tybalt 42.1 m cap TRUE What, art What art t
## 8 Capulet 46.1 f cap FALSE What noise What nois
## 9 Lady Capul 47.2 f cap FALSE why call y Whi call y
## 10 Prince 51.1 m escal FALSE Rebellious Rebelli su
## Grab non questions
non.ques <- mraja1spl[end_mark(mraja1spl$dialogue) != "?", ]
htruncdf(non.ques, 12)
## person tot sex fam.aff died dialogue stem.text
## 1 Sampson 1.1 m cap FALSE Gregory, o Gregori o
## 2 Gregory 2.1 m cap FALSE No, for th No for the
## 3 Sampson 3.1 m cap FALSE I mean, an I mean an
## 4 Gregory 4.1 m cap FALSE Ay, while Ay while y
## 5 Sampson 5.1 m cap FALSE I strike q I strike q
## 6 Gregory 6.1 m cap FALSE But thou a But thou a
## 7 Sampson 7.1 m cap FALSE A dog of t A dog of t
## 8 Gregory 8.1 m cap FALSE To move is To move is
## 9 Gregory 8.2 m cap FALSE therefore, Therefor i
## 10 Sampson 9.1 m cap FALSE A dog of t A dog of t
## 11 Sampson 9.2 m cap FALSE I will tak I will tak
## 12 Gregory 10.1 m cap FALSE That shows That show
## Grab ? and . ending sentences
ques.per <- mraja1spl[end_mark(mraja1spl$dialogue) %in% c(".", "?"), ]
htruncdf(ques.per, 12)
## person tot sex fam.aff died dialogue stem.text
## 1 Sampson 1.1 m cap FALSE Gregory, o Gregori o
## 2 Gregory 2.1 m cap FALSE No, for th No for the
## 3 Sampson 3.1 m cap FALSE I mean, an I mean an
## 4 Gregory 4.1 m cap FALSE Ay, while Ay while y
## 5 Sampson 5.1 m cap FALSE I strike q I strike q
## 6 Gregory 6.1 m cap FALSE But thou a But thou a
## 7 Sampson 7.1 m cap FALSE A dog of t A dog of t
## 8 Gregory 8.1 m cap FALSE To move is To move is
## 9 Gregory 8.2 m cap FALSE therefore, Therefor i
## 10 Sampson 9.1 m cap FALSE A dog of t A dog of t
## 11 Sampson 9.2 m cap FALSE I will tak I will tak
## 12 Gregory 10.1 m cap FALSE That shows That show
The ID
is a shortcut approach to providing row or element IDs on the fly.
♦ ID
- Grab Sentence Types♦
id(list(1, 4, 6))
## [1] "1" "2" "3"
id(matrix(1:10, ncol=1))
## [1] "01" "02" "03" "04" "05" "06" "07" "08" "09" "10"
id(mtcars)
## [1] "01" "02" "03" "04" "05" "06" "07" "08" "09" "10" "11" "12" "13" "14"
## [15] "15" "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28"
## [29] "29" "30" "31" "32"
id(mtcars, FALSE)
## [1] "01" "02" "03" "04" "05" "06" "07" "08" "09" "10" "11" "12" "13" "14"
## [15] "15" "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28"
## [29] "29" "30" "31" "32"
question_type(DATA.SPLIT$state, id(DATA.SPLIT, TRUE))
## TRUE tot.quest what how shall implied_do/does/did
## 1 X.04 1 1(100%) 0 0 0
## 2 X.07 1 0 1(100%) 0 0
## 3 X.10 1 1(100%) 0 0 0
## 4 X.11 1 0 0 1(100%) 0
## 5 X.15 1 0 0 0 1(100%)
## 6 X.01 0 0 0 0 0
## 7 X.02 0 0 0 0 0
## 8 X.03 0 0 0 0 0
## 9 X.05 0 0 0 0 0
## 10 X.06 0 0 0 0 0
## 11 X.08 0 0 0 0 0
## 12 X.09 0 0 0 0 0
## 13 X.12 0 0 0 0 0
## 14 X.13 0 0 0 0 0
## 15 X.14 0 0 0 0 0
qdap allows for the detection of imperative sentences via the imperative
function. The function detects and optionally remarks as imperative, an asterisk (*) is used, however, imperative
is sensitive to choppy, comma riddled sentences and dialects such as African American Vernacular English. The algorithm is complex and thus slower.
♦ imperative
- Imperative Data♦
(dat <- data.frame(name=c("sue", rep(c("greg", "tyler", "phil",
"sue"), 2)), statement=c("go get it|", "I hate to read.",
"Stop running!", "I like it!", "You are terrible!", "Don't!",
"Greg, go to the red, brick office.", "Tyler go to the gym.",
"Alex don't run."), stringsAsFactors = FALSE))
## name statement
## 1 sue go get it|
## 2 greg I hate to read.
## 3 tyler Stop running!
## 4 phil I like it!
## 5 sue You are terrible!
## 6 greg Don't!
## 7 tyler Greg, go to the red, brick office.
## 8 phil Tyler go to the gym.
## 9 sue Alex don't run.
name statement
1 sue go get it|
2 greg I hate to read.
3 tyler Stop running!
4 phil I like it!
5 sue You are terrible!
6 greg Don't!
7 tyler Greg, go to the red, brick office.
8 phil Tyler go to the gym.
9 sue Alex don't run.
♦ imperative
- Re-mark End Marks♦
imperative(dat, "name", "statement", additional.names = c("Alex"))
name statement
1 sue go get it*|
2 greg I hate to read.
3 tyler Stop running*!
4 phil I like it!
5 sue You are terrible!
6 greg Don't*!
7 tyler Greg, go to the red, brick office*.
8 phil Tyler go to the gym*.
9 sue Alex don't run*.
♦ imperative
- Handle Incomplete Sentences♦
imperative(dat, "name", "statement", lock.incomplete = TRUE, "Alex")
name statement
1 sue go get it|
2 greg I hate to read.
3 tyler Stop running*!
4 phil I like it!
5 sue You are terrible!
6 greg Don't*!
7 tyler Greg, go to the red, brick office*.
8 phil Tyler go to the gym*.
9 sue Alex don't run*.
♦ imperative
- Warning Report♦
imperative(dat, "name", "statement", additional.names = "Alex", warning=TRUE)
name statement warnings
1 sue go get it*| -
2 greg I hate to read. read
3 tyler Stop running*! -
4 phil I like it! -
5 sue You are terrible! -
6 greg Don't*! -
7 tyler Greg, go to the red, brick office*. 2 commas
8 phil Tyler go to the gym*. -
9 sue Alex don't run*. AAVE
The tm package is a heavily regarded and utilized R package for text mining purposes. The primary data forms for the tm package are Corpus and TermDocumentMatrix/DocumentTermMatrix. Because tm is a dependancy for many R text mining packages it is prudent to provide a set of tools to convert between tm and qdap data types. This section demos a few of the tools designed to achieve qdap-tm compatibility. For a more thorough vignette describing qdap-tm compatability use browseVignettes(package = "qdap")
or Click Here.
♦ as.tdm
& as.dtm
- From Raw Text Example 1♦
as.tdm(DATA$state, DATA$person)
## <<TermDocumentMatrix (terms: 41, documents: 5)>>
## Non-/sparse entries: 49/156
## Sparsity : 76%
## Maximal term length: 8
## Weighting : term frequency (tf)
as.dtm(DATA$state, DATA$person)
## <<DocumentTermMatrix (documents: 5, terms: 41)>>
## Non-/sparse entries: 49/156
## Sparsity : 76%
## Maximal term length: 8
## Weighting : term frequency (tf)
♦ as.tdm
& as.dtm
- From Raw Text Example 2♦
(pres <- as.tdm(pres_debates2012$dialogue, pres_debates2012$person))
## <<TermDocumentMatrix (terms: 3363, documents: 6)>>
## Non-/sparse entries: 5769/14409
## Sparsity : 71%
## Maximal term length: 16
## Weighting : term frequency (tf)
library(tm)
plot(pres, corThreshold = 0.8)
(pres2 <- removeSparseTerms(pres, .3))
## <<TermDocumentMatrix (terms: 131, documents: 6)>>
## Non-/sparse entries: 715/71
## Sparsity : 9%
## Maximal term length: 14
## Weighting : term frequency (tf)
plot(pres2, corThreshold = 0.95)
x <- wfm(DATA$state, DATA$person)
as.tdm(x)
## <<TermDocumentMatrix (terms: 41, documents: 5)>>
## Non-/sparse entries: 49/156
## Sparsity : 76%
## Maximal term length: 8
## Weighting : term frequency (tf)
as.dtm(x)
## <<DocumentTermMatrix (documents: 5, terms: 41)>>
## Non-/sparse entries: 49/156
## Sparsity : 76%
## Maximal term length: 8
## Weighting : term frequency (tf)
plot(as.tdm(x))
library(tm); data(crude)
(dtm_in <- DocumentTermMatrix(crude, control = list(stopwords = TRUE)))
## <<DocumentTermMatrix (documents: 20, terms: 1200)>>
## Non-/sparse entries: 1930/22070
## Sparsity : 92%
## Maximal term length: 17
## Weighting : term frequency (tf)
summary(as.wfm(dtm_in))
## <<A word-frequency matrix (1200 terms, 20 groups)>>
##
## Non-/sparse entries : 1930/22070
## Sparsity : 92%
## Maximal term length : 17
## Less than four characters : 6%
## Hapax legomenon : 768(64%)
## Dis legomenon : 210(18%)
## Shannon's diversity index : 6.57
apply_as_tm
allows the user to apply functions intended to be used on the tm
package's TermDocumentMatrix to a wfm
object. apply_as_tm
attempts to simplify back to a wfm
or wfm_weight
format. In the examples belows we first create a wfm
and then apply functions designed for a TermDocumentMatrix.
library(tm); library(proxy)
## Create a wfm
a <- with(DATA, wfm(state, list(sex, adult)))
summary(a)
## <<A word-frequency matrix (41 terms, 4 groups)>>
##
## Non-/sparse entries : 45/119
## Sparsity : 73%
## Maximal term length : 8
## Less than four characters : 49%
## Hapax legomenon : 32(78%)
## Dis legomenon : 7(17%)
## Shannon's diversity index : 3.62
## Apply as tm
(out <- apply_as_tm(a, tm:::removeSparseTerms, sparse=0.6))
## f.0 f.1 m.0 m.1
## we 1 1 0 1
## what 1 0 0 1
## you 1 0 3 0
summary(out)
## <<A word-frequency matrix (3 terms, 4 groups)>>
##
## Non-/sparse entries : 7/5
## Sparsity : 42%
## Maximal term length : 4
## Less than four characters : 67%
## Hapax legomenon : 0(0%)
## Dis legomenon : 1(33%)
## Shannon's diversity index : 1.06
apply_as_tm(a, tm:::findAssocs, "computer", .8)
## computer
## already 1.00
## am 1.00
## distrust 1.00
## dumb 1.00
## eat 1.00
## fun 1.00
## hungry 1.00
## i 1.00
## i'm 1.00
## is 1.00
## it 1.00
## it's 1.00
## let's 1.00
## liar 1.00
## no 1.00
## not 1.00
## stinks 1.00
## telling 1.00
## the 1.00
## there 1.00
## too 1.00
## truth 1.00
## way 1.00
## you 0.94
apply_as_tm(a, tm:::findFreqTerms, 2, 3)
## [1] "fun" "i" "is" "it's" "no" "not" "we" "what"
apply_as_tm(a, tm:::Zipf_plot)
## (Intercept) x
## 1.2003 -0.3672
apply_as_tm(a, tm:::Heaps_plot)
## (Intercept) x
## 0.3492 0.8495
apply_as_tm(a, tm:::plot.TermDocumentMatrix, corThreshold = 0.4)
apply_as_tm(a, tm:::weightBin)
## f.0 f.1 m.0 m.1
## about 1 0 0 0
## already 0 0 1 0
## am 0 0 1 0
## are 1 0 0 0
## be 1 0 0 0
## can 1 0 0 0
## certain 1 0 0 0
## computer 0 0 1 0
## distrust 0 0 1 0
## do 0 0 0 1
## dumb 0 0 1 0
## eat 0 0 1 0
## fun 0 0 1 0
## good 0 1 0 0
## how 1 0 0 0
## hungry 0 0 1 0
## i 0 0 1 0
## i'm 0 0 1 0
## is 0 0 1 0
## it 0 0 1 0
## it's 0 0 1 0
## let's 0 0 1 0
## liar 0 0 1 0
## move 0 1 0 0
## no 0 0 1 0
## not 0 0 1 0
## on 0 1 0 0
## shall 0 1 0 0
## should 0 0 0 1
## stinks 0 0 1 0
## talking 1 0 0 0
## telling 0 0 1 0
## the 0 0 1 0
## then 0 1 0 0
## there 0 0 1 0
## too 0 0 1 0
## truth 0 0 1 0
## way 0 0 1 0
## we 1 1 0 1
## what 1 0 0 1
## you 1 0 1 0
## attr(,"class")
## [1] "weighted_wfm" "matrix"
apply_as_tm(a, tm:::weightBin, to.qdap = FALSE)
## <<TermDocumentMatrix (terms: 41, documents: 4)>>
## Non-/sparse entries: 45/119
## Sparsity : 73%
## Maximal term length: 8
## Weighting : binary (bin)
apply_as_tm(a, tm:::weightSMART)
## f.0 f.1 m.0 m.1
## about 1 0 0 0
## already 0 0 1 0
## am 0 0 1 0
## are 1 0 0 0
## be 1 0 0 0
## can 1 0 0 0
## certain 1 0 0 0
## computer 0 0 1 0
## distrust 0 0 1 0
## do 0 0 0 1
## dumb 0 0 1 0
## eat 0 0 1 0
## fun 0 0 2 0
## good 0 1 0 0
## how 1 0 0 0
## hungry 0 0 1 0
## i 0 0 2 0
## i'm 0 0 1 0
## is 0 0 2 0
## it 0 0 1 0
## it's 0 0 2 0
## let's 0 0 1 0
## liar 0 0 1 0
## move 0 1 0 0
## no 0 0 2 0
## not 0 0 2 0
## on 0 1 0 0
## shall 0 1 0 0
## should 0 0 0 1
## stinks 0 0 1 0
## talking 1 0 0 0
## telling 0 0 1 0
## the 0 0 1 0
## then 0 1 0 0
## there 0 0 1 0
## too 0 0 1 0
## truth 0 0 1 0
## way 0 0 1 0
## we 1 1 0 1
## what 1 0 0 1
## you 1 0 3 0
## attr(,"class")
## [1] "weighted_wfm" "matrix"
apply_as_tm(a, tm:::weightTfIdf)
## f.0 f.1 m.0 m.1
## about 0.2000 0.00000 0.00000 0.0000
## already 0.0000 0.00000 0.06061 0.0000
## am 0.0000 0.00000 0.06061 0.0000
## are 0.2000 0.00000 0.00000 0.0000
## be 0.2000 0.00000 0.00000 0.0000
## can 0.2000 0.00000 0.00000 0.0000
## certain 0.2000 0.00000 0.00000 0.0000
## computer 0.0000 0.00000 0.06061 0.0000
## distrust 0.0000 0.00000 0.06061 0.0000
## do 0.0000 0.00000 0.00000 0.5000
## dumb 0.0000 0.00000 0.06061 0.0000
## eat 0.0000 0.00000 0.06061 0.0000
## fun 0.0000 0.00000 0.12121 0.0000
## good 0.0000 0.33333 0.00000 0.0000
## how 0.2000 0.00000 0.00000 0.0000
## hungry 0.0000 0.00000 0.06061 0.0000
## i 0.0000 0.00000 0.12121 0.0000
## i'm 0.0000 0.00000 0.06061 0.0000
## is 0.0000 0.00000 0.12121 0.0000
## it 0.0000 0.00000 0.06061 0.0000
## it's 0.0000 0.00000 0.12121 0.0000
## let's 0.0000 0.00000 0.06061 0.0000
## liar 0.0000 0.00000 0.06061 0.0000
## move 0.0000 0.33333 0.00000 0.0000
## no 0.0000 0.00000 0.12121 0.0000
## not 0.0000 0.00000 0.12121 0.0000
## on 0.0000 0.33333 0.00000 0.0000
## shall 0.0000 0.33333 0.00000 0.0000
## should 0.0000 0.00000 0.00000 0.5000
## stinks 0.0000 0.00000 0.06061 0.0000
## talking 0.2000 0.00000 0.00000 0.0000
## telling 0.0000 0.00000 0.06061 0.0000
## the 0.0000 0.00000 0.06061 0.0000
## then 0.0000 0.33333 0.00000 0.0000
## there 0.0000 0.00000 0.06061 0.0000
## too 0.0000 0.00000 0.06061 0.0000
## truth 0.0000 0.00000 0.06061 0.0000
## way 0.0000 0.00000 0.06061 0.0000
## we 0.0415 0.06917 0.00000 0.1038
## what 0.1000 0.00000 0.00000 0.2500
## you 0.1000 0.00000 0.09091 0.0000
## attr(,"class")
## [1] "weighted_wfm" "matrix"
## Convert dataframe to a Corpus
(x <- with(DATA2, as.Corpus(state, list(person, class, day))))
## <<VCorpus (documents: 20, metadata (corpus/indexed): 0/3)>>
library(tm)
inspect(x)
## <<VCorpus (documents: 20, metadata (corpus/indexed): 0/3)>>
##
## [[1]]
## <<PlainTextDocument (metadata: 2)>>
## No its not, its dumb. I am telling the truth! There is no way. Im hungry. Lets eat. You already?
##
## [[2]]
## <<PlainTextDocument (metadata: 2)>>
## Im hungry. Lets eat. You already? No its not, its dumb. There is no way. I am telling the truth! I am telling the truth! No its not, its dumb. There is no way. There is no way.
##
## [[3]]
## <<PlainTextDocument (metadata: 2)>>
## There is no way. I am telling the truth! No its not, its dumb. Im hungry. Lets eat. You already? I am telling the truth! There is no way. Im hungry. Lets eat. You already? I am telling the truth!
##
## [[4]]
## <<PlainTextDocument (metadata: 2)>>
## I am telling the truth! Im hungry. Lets eat. You already? I am telling the truth! I am telling the truth! Im hungry. Lets eat. You already? There is no way. There is no way. No its not, its dumb.
##
## [[5]]
## <<PlainTextDocument (metadata: 2)>>
## Shall we move on? Good then.
##
## [[6]]
## <<PlainTextDocument (metadata: 2)>>
## Shall we move on? Good then. Shall we move on? Good then.
##
## [[7]]
## <<PlainTextDocument (metadata: 2)>>
## Shall we move on? Good then.
##
## [[8]]
## <<PlainTextDocument (metadata: 2)>>
## Shall we move on? Good then. Shall we move on? Good then. Shall we move on? Good then.
##
## [[9]]
## <<PlainTextDocument (metadata: 2)>>
## How can we be certain? What are you talking about?
##
## [[10]]
## <<PlainTextDocument (metadata: 2)>>
## How can we be certain?
##
## [[11]]
## <<PlainTextDocument (metadata: 2)>>
## What are you talking about? How can we be certain?
##
## [[12]]
## <<PlainTextDocument (metadata: 2)>>
## What are you talking about? What are you talking about? What are you talking about? How can we be certain? What are you talking about? How can we be certain?
##
## [[13]]
## <<PlainTextDocument (metadata: 2)>>
## Computer is fun. Not too fun. You liar, it stinks! I distrust you.
##
## [[14]]
## <<PlainTextDocument (metadata: 2)>>
## Computer is fun. Not too fun. I distrust you. I distrust you. You liar, it stinks! Computer is fun. Not too fun. You liar, it stinks! You liar, it stinks! Computer is fun. Not too fun. I distrust you. Computer is fun. Not too fun. You liar, it stinks!
##
## [[15]]
## <<PlainTextDocument (metadata: 2)>>
## I distrust you. You liar, it stinks! You liar, it stinks!
##
## [[16]]
## <<PlainTextDocument (metadata: 2)>>
## I distrust you. You liar, it stinks! I distrust you. I distrust you.
##
## [[17]]
## <<PlainTextDocument (metadata: 2)>>
## What should we do?
##
## [[18]]
## <<PlainTextDocument (metadata: 2)>>
## What should we do? What should we do? What should we do?
##
## [[19]]
## <<PlainTextDocument (metadata: 2)>>
## What should we do?
##
## [[20]]
## <<PlainTextDocument (metadata: 2)>>
## What should we do? What should we do?
class(x)
## [1] "VCorpus" "Corpus"
## Convert Back
htruncdf(as.data.frame(x), 15, 30)
## docs labels author text
## 1 greg.ELA.day 1 greg.ELA.day 1 trinker No its not, its dumb. I am tel
## 2 greg.ELA.day 2 greg.ELA.day 2 trinker Im hungry. Lets eat. You alr
## 3 greg.math.day 1 greg.math.day 1 trinker There is no way. I am telling
## 4 greg.math.day 2 greg.math.day 2 trinker I am telling the truth! Im hun
## 5 researcher.ELA.day 1 researcher.ELA.day 1 trinker Shall we move on? Good then.
## 6 researcher.ELA.day 2 researcher.ELA.day 2 trinker Shall we move on? Good then.
## 7 researcher.math.day 1 researcher.math.day 1 trinker Shall we move on? Good then.
## 8 researcher.math.day 2 researcher.math.day 2 trinker Shall we move on? Good then.
## 9 sally.ELA.day 1 sally.ELA.day 1 trinker How can we be certain? What ar
## 10 sally.ELA.day 2 sally.ELA.day 2 trinker How can we be certain?
## 11 sally.math.day 1 sally.math.day 1 trinker What are you talking about? Ho
## 12 sally.math.day 2 sally.math.day 2 trinker What are you talking about? Wh
## 13 sam.ELA.day 1 sam.ELA.day 1 trinker Computer is fun. Not too fun.
## 14 sam.ELA.day 2 sam.ELA.day 2 trinker Computer is fun. Not too fun.
## 15 sam.math.day 1 sam.math.day 1 trinker I distrust you. You liar, it s
qdap utilizes the following dictionaries/wordlists from the qdapDictionaries package.
If there is a discrepancy between the R and Java architectures you will have to download the appropriate version of Java compatible with the version of R you’re using. For more see Tal Galili’s blog post regarding rJava issues.
For more on natural language processing see the related CRAN NLP task view.
The qdap package was my first R package and a learning process. Several people contributed immensely to my learning. I'd like to particularly thank Dason Kurkiewicz for his constant mentoring/assistance in learning the R language, GitHub and package development as well as collaboration on numerous qdap functions. Thank you to Bryan Goodrich for his teaching, feedback and collaboration on several qdap functions. Thank you to Dr. Hadley Wickham for roxygen2, ggplot2, devtools and GitHub repos which I referenced often. I'd also like to thank the many folks at talkstats.com and stackoverflow.com for their help in answering many R questions related to qdap.
If the reader spots an error in this Vignette or would like to suggest an improvement please contact me @ Tyler Rinker<tyler.rinker@gmail.com>. To submit bug reports and feature requests related to the qdap package please visit qdap's GitHub issues page.
*Vignette created with the reports package (Rinker, 2013b)