In this vignette we show how the quanteda package can be used to replicate the analysis from Matthew Jockers’ book Text Analysis with R for Students of Literature (London: Springer, 2014). Most of the Jockers book consists of loading, transforming, and analyzing quantities derived from text and data from text. Because quanteda has built in most of the code to perform these data transformations and analyses, it makes it possible to replicate the results from the book with far less code.
In what follows, each section corresponds to a chapter in the book.
Our closest equivalent is simply:
install.packages("quanteda", dependencies = TRUE)
But if you are reading this vignette, than chances are that you have already completed this step.
Moby Dick: Descriptive analysis
The code below scans and splits the text of Moby Dick from Project Gutenberg, as implemented in the text. The command textfile()
loads almost any file, including those found on the Internet (beginning with a URL, such as “http” or “https”).
require(quanteda)
## Loading required package: quanteda
##
## Attaching package: 'quanteda'
##
## The following object is masked from 'package:stats':
##
## df
##
## The following object is masked from 'package:base':
##
## sample
# read the text as a single file
# mobydicktf <- textfile("http://www.gutenberg.org/cache/epub/2701/pg2701.txt")
mobydicktf <- textfile(unzip(system.file("extdata", "pg2701.txt.zip", package = "quanteda")))
The textfile()
loads the text and places inside a structured, intermediate object known as a corpusSource
object. We see this by outputting it to the global environment, as above.
We can access the text from a corpusSource
object (and also, as we will see, a corpus
class object), using the texts()
method. Here we will display just the first 75 characters, to prevent a massive dump of the text of the entire novel. We do this using the substring()
function, which shows the 1st through the 75th characters of the texts of our new object mobydicktf
. Because we have not assigned the return from this command to any object, it invokes a print method for character objects, and is displayed on the screen.
substring(texts(mobydicktf), 1, 75)
## [1] "The Project Gutenberg EBook of Moby Dick; or The Whale, by Herman Melville\n"
The Gutenburg edition of the text contains some metadata before and after the text of the novel. The code below uses the regexec
and substring
functions to separate this from the text.
# extract the header information
mobydickText <- texts(mobydicktf)
endMetadataIndex <- regexec("CHAPTER 1. Loomings.", mobydickText)[[1]]
metadata.v <- substring(texts(mobydicktf), 1, endMetadataIndex - 1)
To trim the extra text at the end of the Gutenburg version of the text, we can use the keyword-in-context (kwic
) function to view the contexts around the word ‘orphan’, which we know should occur at the end of the book.
# verify that "orphan" is the end of the novel
kwic(mobydickText, "orphan")
## contextPre keyword
## [text1, 260460] children, only found another [ orphan
## contextPost
## [text1, 260460] ] . End of Project Gutenberg's
# extract the novel -- a better way
novel.v <- substring(mobydickText, endMetadataIndex,
regexec("End of Project Gutenberg's Moby Dick.", mobydickText)[[1]]-1)
We begin processing the text by converting to lower case. quanteda
’s toLower
function works like the built-in tolower
, with an extra option to preserve upper-case acronyms when detected.
# lowercase
novel.lower.v <- toLower(novel.v)
quanteda
’s tokenize
function splits the text into words, with many options available for which characters should be preserved, and which should be used to define word boundaries. The default behaviour works similarly to splitting on the regular expression for word boundary (\W
), but does not treat apostrophes as word boundaries. This means that ’s and ’t are not treated as whole words from possessive forms and contractions.
# tokenize
moby.word.v <- tokenize(novel.lower.v, removePunct = TRUE, simplify = TRUE)
length(moby.word.v)
## [1] 210000
total.length <- length(moby.word.v)
str(moby.word.v)
## atomic [1:210000] chapter 1 loomings call ...
## - attr(*, "what")= chr "word"
## - attr(*, "ngrams")= int 1
## - attr(*, "concatenator")= chr ""
moby.word.v[1:10]
## [1] "chapter" "1" "loomings" "call" "me" "ishmael"
## [7] "some" "years" "ago" "never"
moby.word.v[99986]
## [1] "in"
moby.word.v[c(4,5,6)]
## [1] "call" "me" "ishmael"
head(which(moby.word.v=="whale"))
## [1] 2030 2060 2203 2415 4048 4211
The code below uses the tokenized text to the occurrence of the word whale. To include the possessive form whale’s, we may sum the counts of both forms, count the keyword-in-context matches by regular expression or glob[^1]. quanteda
’s tokenize function separates punctuation into tokens by default. To match the counts in the book, we can choose to remove the punctuation.
[^1] A glob is a simple wildcard matching pattern common on Unix systems – asterisks match zero or more characters.
moby.word.v <- tokenize(novel.lower.v, simplify = TRUE)
# count of the word 'whale'
length(moby.word.v[which(moby.word.v == "whale")])
## [1] 1030
# total occurrences of 'whale' including possessive
length(moby.word.v[which(moby.word.v == "whale")]) + length(moby.word.v[which(moby.word.v == "whale's")])
## [1] 1150
# same thing using kwic()
nrow(kwic(novel.lower.v, "whale"))
## [1] 1030
nrow(kwic(novel.lower.v, "whale*")) # includes words like 'whalemen'
## [1] 1572
(total.whale.hits <- nrow(kwic(novel.lower.v, "^whale('s){0,1}$", valuetype = 'regex')))
## [1] 1150
What fraction of the total words in the novel are ‘whale’?
total.whale.hits / ntoken(novel.lower.v, removePunct=TRUE)
## [1] 0.00547619
Calculating the size of the vocabulary – includes possessive forms.
# total unique words
length(unique(moby.word.v))
## [1] 17299
ntype(toLower(novel.v), removePunct = TRUE)
## [1] 18525
To quickly sort the word types by their frequency, we can use the dfm
command to create a matrix of counts of each word type – a document-frequency matrix. In this case there is only one document, the entire book.
# ten most frequent words
mobyDfm <- dfm(novel.lower.v)
##
## ... lowercasing
## ... tokenizing
## ... indexing documents: 1 document
## ... indexing features: 18,345 feature types
## ... created a 1 x 18345 sparse dfm
## ... complete.
## Elapsed time: 0.214 seconds.
mobyDfm[, "whale"]
## Document-feature matrix of: 1 document, 1 feature.
## 1 x 1 sparse Matrix of class "dfmSparse"
## features
## docs whale
## text1 908
topfeatures(mobyDfm)
## the of and a to in that his it i
## 14173 6446 6311 4605 4511 4071 2950 2495 2394 1976
plot(topfeatures(mobyDfm, 100), log = "y", cex = .6, ylab = "Term frequency")
We can query the document-frequency matrix to retrieve word frequencies, as with a normal matrix:
# frequencies of 'he' and 'she' - these are matrixes, not numerics
mobyDfm[, c("he", "she", "him", "her")]
## Document-feature matrix of: 1 document, 4 features.
## 1 x 4 sparse Matrix of class "dfmSparse"
## features
## docs he she him her
## text1 1758 112 1058 330
mobyDfm[, "her"]
## Document-feature matrix of: 1 document, 1 feature.
## 1 x 1 sparse Matrix of class "dfmSparse"
## features
## docs her
## text1 330
mobyDfm[, "him"]/mobyDfm[, "her"]
## 1 x 1 Matrix of class "dgeMatrix"
## features
## docs him
## text1 3.206061
mobyDfm[, "he"]/mobyDfm[, "she"]
## 1 x 1 Matrix of class "dgeMatrix"
## features
## docs he
## text1 15.69643
mobyDfmPct <- weight(mobyDfm, "relFreq") * 100
mobyDfmPct[, "the"]
## Document-feature matrix of: 1 document, 1 feature.
## 1 x 1 sparse Matrix of class "dfmSparse"
## features
## docs the
## text1 6.755385
plot(topfeatures(mobyDfmPct), type="b",
xlab="Top Ten Words", ylab="Percentage of Full Text", xaxt ="n")
axis(1, 1:10, labels = names(topfeatures(mobyDfmPct)))
A dispersion plot allows us to visualize the occurrences of particular terms throughout the text. The object returned by the kwic
function can be plotted to display a dispersion plot.
# using words from tokenized corpus for dispersion
plot(kwic(novel.v, "whale"))
plot(kwic(novel.v, "Ahab"))
grep
# identify the chapter break locations
(chap.positions.v <- kwic(novel.v, "CHAPTER \\d", valuetype = "regex")$position)
Splitting the text into chapters means that we will have a collection of documents, which makes this a good time to make a corpus
object to hold the texts. Initially, we make a single-document corpus, and then use the segment
function to split this by the string which specifies the chapter breaks.
head(kwic(novel.v, 'chapter'))
## contextPre keyword
## [text1, 1] [ CHAPTER
## [text1, 2621] hill in the air. [ CHAPTER
## [text1, 4355] Spouter" may be. [ CHAPTER
## [text1, 11416] better in my life. [ CHAPTER
## [text1, 13363] like a marshal's baton. [ CHAPTER
## [text1, 14249] out for a stroll. [ CHAPTER
## contextPost
## [text1, 1] ] 1. Loomings. Call
## [text1, 2621] ] 2. The Carpet-
## [text1, 4355] ] 3. The Spouter-
## [text1, 11416] ] 4. The Counterpane.
## [text1, 13363] ] 5. Breakfast. I
## [text1, 14249] ] 6. The Street.
chaptersVec <-unlist(segment(novel.v, what='other', delimiter="CHAPTER\\s\\d", perl=TRUE))
chaptersLowerVec <- toLower(chaptersVec)
chaptersCorp <- corpus(chaptersVec)
With the corpus split into chapters, we can use the dfm
command to create a matrix of counts of each word in each chapter – a document-frequency matrix.
chapDfm <- dfm(chaptersCorp)
## Creating a dfm from a corpus ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 136 documents
## ... indexing features: 18,345 feature types
## ... created a 136 x 18345 sparse dfm
## ... complete.
## Elapsed time: 0.2 seconds.
barplot(as.numeric(chapDfm[, 'whale']))
barplot(as.numeric(chapDfm[, 'ahab']))
The above plots are raw frequency plots. For relative frequency plots, (word count divided by the length of the chapter) we can weight the document-frequency matrix. To obtain expected word frequency per 100 words, we multiply by 100. To get a feel for what the resulting weighted dfm (document feature matrix) looks like, you can inspect it with the head
function, which prints the first few rows and columns.
relDfm <- weight(chapDfm, type='relFreq') * 100
head(relDfm)
## Document-feature matrix of: 136 documents, 18,345 features.
## (showing first 6 documents and first 6 features)
## features
## docs loomings call me ishmael some years
## text1 0.04508566 0.04508566 1.1271416 0.09017133 0.49594229 0.04508566
## text2 0.00000000 0.00000000 0.4178273 0.27855153 0.06963788 0.06963788
## text3 0.00000000 0.01709694 0.7522653 0.00000000 0.29064797 0.06838776
## text4 0.00000000 0.00000000 1.1438892 0.00000000 0.06020470 0.00000000
## text5 0.00000000 0.00000000 0.1347709 0.00000000 0.26954178 0.00000000
## text6 0.00000000 0.00000000 0.1230012 0.00000000 0.12300123 0.00000000
barplot(as.numeric(relDfm[, 'whale']))
barplot(as.numeric(relDfm[, 'ahab']))
The dfm
function constructs a matrix which contains zeroes (rather than NAs) for words that do not occur in a chapter, so there’s no need to manually convert NAs. We can compute the individual correlation or the correlation for a matrix of the two columns.
wf <- as.numeric(relDfm[,'whale'])
af <- as.numeric(relDfm[,'ahab'])
cor(wf, af)
## [1] -0.2355851
waDfm <- cbind(relDfm[,'whale'], relDfm[,'ahab'])
cor(as.matrix(waDfm))
## whale ahab
## whale 1.0000000 -0.2355851
## ahab -0.2355851 1.0000000
With the ahab frequency and whale frequency vectors extracted from the dfm, it is easy to calculate the significance of the correlation.
samples <- replicate(1000, cor(sample(af), sample(wf)))
h <- hist(samples, breaks=100, col="grey",
xlab="Correlation Coefficient",
main="Histogram of Random Correlation Coefficients\n
with Normal Curve",
plot=T)
xfit <- seq(min(samples),max(samples),length=1000)
yfit <- dnorm(xfit,mean=mean(samples),sd=sd(samples))
yfit <- yfit*diff(h$mids[1:2])*length(samples)
lines(xfit, yfit, col="black", lwd=2)
cor.test(wf, af)
##
## Pearson's product-moment correlation
##
## data: wf and af
## t = -2.8061, df = 134, p-value = 0.005763
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.38851092 -0.07002936
## sample estimates:
## cor
## -0.2355851
The mean word frequency for a particular chapter can be calculated simply with the dfm. Each row is a document (chapter), so, for example, the mean word frequency of the first chapter is the sum of the first row of the matrix, divided by the number of word types in the first chapter. To get the number of word types in the first chapter only, we can either exclude words in that row which have a frequency of zero, or use the ntype
function on the first document in the corpus to achieve the same result.
firstChap <- as.matrix(chapDfm[1,])
numWords <- length(firstChap[firstChap > 0])
sum(chapDfm[1,])/numWords
## [1] 2.612485
sum(chapDfm[1,])/ntype(chaptersCorp[1], removePunct=TRUE)
## text1
## 2.442731
The rowMeans
matrix function, which operates on a dfm, allows us to retreieve the means for all of the chapters.
chapMeans <- Matrix::rowMeans(chapDfm)
plot(chapMeans, type="h")