quanteda
quanteda
1 is an R package designed to simplify the process of quantitative analysis of text from start to finish, making it possible to turn texts into a structured corpus, conver this corpus into a quantitative matrix of features extracted from the texts, and to perform a variety of quanttative analyses on this matrix.
The object is inference about the data contained in the texts, whether this means describing characteristics of the texts, inferring quantities of interests about the texts of their authors, or determining the tone or topics contained in the texts. The emphasis of quanteda
is on simplicity: creating a corpus to manage texts and variables attached to these texts in a straightforward way, and providing powerful tools to extract features from this corpus that can be analyzed using quantitative techniques.
The tools for getting texts into a corpus object include:
The tools for working with a corpus include:
For extracting features from a corpus, quanteda
provides the following tools:
For analyzing the resulting document-feature matrix created when features are abstracted from a corpus, quanteda
provides:
quanteda
is hardly unique in providing facilities for working with text – the excellent tm package already provides many of the features we have described. quanteda
is designed to complement those packages, as well to simplify the implementation of the text-to-analysis workflow. quanteda
corpus structures are simpler objects than in tms, as are the document-feature matrix objects from quanteda
, compared to the sparse matrix implementation found in tm. However, there is no need to choose only one package, since we provide translator functions from one matrix or corpus object to the other in quanteda
.
This vignette is designed to introduce you to quanteda
as well as provide a tutorial overview of its features.
The code for the quanteda
package currently resides on http://github.com/kbenoit/quanteda. From an Internet-connected computer, you can install the package directly using the devtools
package:
library(devtools)
if (!require(quanteda)) install_github("kbenoit/quanteda")
This will download the package from github and install it on your computer. For other branches, for instance if you wish to install the development branch (containing work in progress) rather than the master, you should instead run:
# to install the latest dev branch version quanteda from Github use:
install_github("kbenoit/quanteda", dependencies=TRUE, quick=TRUE, ref="dev")
Typically, the dev
branch of a software package is under active development — so while it contains the latest updates, it is more likely to have bugs. The master
branch might be missing some of the newer features, but should be more reliable.
To try the functions provided for interacting with corpora, load the inaugCorpus
object packaged with quanteda
. This corpus contains US presidents’ inaugural addresses since 1789, with document-level variables for the year of each address (Year
) and the last name of the president (President
). The summary
command gives a brief description of the corpus, and a summary of the first n
documents:
# make sure quanteda is loaded and load the corpus of inaugural addresses
library(quanteda)
data(inaugCorpus)
summary(inaugCorpus, n=3)
#> Corpus consisting of 57 documents, showing 3 documents.
#>
#> Text Types Tokens Sentences Year President
#> 1789-Washington 594 1431 24 1789 Washington
#> 1793-Washington 90 135 4 1793 Washington
#> 1797-Adams 794 2321 37 1797 Adams
#>
#> Source: /home/paul/Dropbox/code/quanteda/* on x86_64 by paul.
#> Created: Fri Sep 12 12:41:17 2014.
#> Notes: .
We can save the output from the summary command as a data frame, and plot some basic descriptive statistics with this information:
tokenInfo <- summary(inaugCorpus)
if (require(ggplot2))
ggplot(data=tokenInfo, aes(x=Year, y=Tokens, group=1)) + geom_line() + geom_point() +
scale_x_discrete(labels=c(seq(1789,2012,12)), breaks=seq(1789,2012,12) )
#> Loading required package: ggplot2
tokenInfo[which.max(tokenInfo$Tokens),] # Longest inaugural address: William Henry Harrison
#> Text Types Tokens Sentences Year President
#> 1841-Harrison 1841-Harrison 1803 8463 215 1841 Harrison
A simple measure of the complexity of a text is lexical diversity, or the ratio of the number of unique word types (the vocabulary size) to the total number of word tokens (the length of the document in words). We can get this ratio from the corpus summary also. The type-token ratio is a simplistic measure, and is usually higher for short texts.
ttr <- tokenInfo$Types/tokenInfo$Tokens
if (require(ggplot2))
ggplot(data=tokenInfo, aes(x=Year, y=ttr, group=1)) + geom_line() + geom_point() +
scale_x_discrete(labels=c(seq(1789,2012,12)), breaks=seq(1789,2012,12) )
tokenInfo[which.max(ttr),]
#> Text Types Tokens Sentences Year President
#> 1793-Washington 1793-Washington 90 135 4 1793 Washington
The kwic
function (KeyWord In Context) performs a search for a word and allows us to view the contexts in which it occurs:
options(width = 200)
kwic(inaugCorpus, "terror")
#> preword word postword
#> [1797-Adams, 1183] by fraud or violence, by terror, intrigue, or venality, the Government
#> [1933-Roosevelt, 100] itself -- nameless, unreasoning, unjustified terror which paralyzes needed efforts to
#> [1941-Roosevelt, 252] seemed frozen by a fatalistic terror, we proved that this is
#> [1961-Kennedy, 763] alter that uncertain balance of terror that stays the hand of
#> [1961-Kennedy, 872] of science instead of its terrors. Together let us explore the
#> [1981-Reagan, 691] freeing all Americans from the terror of runaway living costs. All
#> [1981-Reagan, 1891] understood by those who practice terrorism and prey upon their neighbors.\n\nI
#> [1997-Clinton, 929] They fuel the fanaticism of terror. And they torment the lives
#> [1997-Clinton, 1462] maintain a strong defense against terror and destruction. Our children will
#> [2009-Obama, 1433] advance their aims by inducing terror and slaughtering innocents, we say
kwic(inaugCorpus, "terror", wholeword=TRUE)
#> preword word postword
#> [1933-Roosevelt, 100] itself -- nameless, unreasoning, unjustified terror which paralyzes needed efforts to
#> [1961-Kennedy, 763] alter that uncertain balance of terror that stays the hand of
#> [1981-Reagan, 691] freeing all Americans from the terror of runaway living costs. All
#> [1997-Clinton, 1462] maintain a strong defense against terror and destruction. Our children will
#> [2009-Obama, 1433] advance their aims by inducing terror and slaughtering innocents, we say
kwic(inaugCorpus, "communist")
#> preword word postword
#> [1949-Truman, 728] the actions resulting from the Communist philosophy are a threat to
#> [1961-Kennedy, 453] required -- not because the Communists may be doing it, not
In the above summary, Year
and President
are variables associated with each document. We can access such variables with the docvars()
function.
# check the document-level variable names
names(docvars(inaugCorpus))
#> [1] "Year" "President"
# list the first few values
head(docvars(inaugCorpus))
#> Year President
#> 1789-Washington 1789 Washington
#> 1793-Washington 1793 Washington
#> 1797-Adams 1797 Adams
#> 1801-Jefferson 1801 Jefferson
#> 1805-Jefferson 1805 Jefferson
#> 1809-Madison 1809 Madison
# check the corpus-level metadata
metacorpus(inaugCorpus)
#> $source
#> [1] "/home/paul/Dropbox/code/quanteda/* on x86_64 by paul"
#>
#> $created
#> [1] "Fri Sep 12 12:41:17 2014"
#>
#> $notes
#> NULL
#>
#> $citation
#> NULL
Many more corpora are available in the quantedaData package.
The simplest case is to create a corpus from a vector of texts already in memory in R. If we already have the texts in this form, we can call the corpus constructor function directly. inaugTexts
is a character vector of the inaugural addresses included with quanteda
.
data(inaugTexts)
myCorpus <- corpus(inaugTexts)
Often, texts aren’t available as pre-made R character vectors, and we need to load them from an external source. To do this, we first create a source for the documents, which defines how they are loaded from the source into the corpus. The source may be a character vector, a directory of text files, a zip file, a twitter search, or several external package formats such as tm
’s VCorpus
.
Once a source has been defined, we make a new corpus by calling the corpus
constructor with the source as the first argument. The corpus constructor also accepts arguments which can set some corpus metadata, and define how the document variables are set.
A very common source of files for creating a corpus will be a set of text files found on a local (or remote) directory. To load texts in this way, we first define a source for the directory, and pass this source as an argument to the corpus constructor. We create a directory source by calling the directory
function.
# Basic file import from directory
d <- textfile('~/Dropbox/QUANTESS/corpora/inaugural/*.txt')
myCorpus <- corpus(d)
myCorpus
If the document variables are specified in the filenames of the texts, we can read them by setting the docvarsfrom
argument (docvarsfrom = "filenames"
) and specifiying how the filenames are formatted with the sep
argument. For example, if the inaugural address texts were stored on disk in the format Year-President.txt
(e.g. 1973-Nixon.txt
), then we can load them and automatically populate the document variables. The docvarnames
argument sets the names of the document variables — it must be the same length as the parts of the filenames.
# File import reading document variables from filenames
d <- textfile('~/Dropbox/QUANTESS/corpora/inaugural/*.txt')
# In this example the format of the filenames is `Year-President.txt`.
# Because there are two variables in the filename, docvarnames must contain two names
myCorpus <- corpus(d, docvarsfrom="filenames", sep="-", docvarnames=c("Year", "President") )
quanteda
provides an interface to retrieve and store data from a twitter search as a corpus object. The REST API query uses the twitteR package, and an API authorization from twitter is required. The process of obtaining this authorization is described in detail here: https://openhatch.org/wiki/Community_Data_Science_Workshops/Twitter_authentication_setup, correct as of October 2014. The twitter API is a commercial service, and rate limits and the data returned are determined by twitter.
Four keys are required, to be passed to quanteda
’s getTweets
source function, in addition to the search query term and the number of results required. The maximum number of results that can be obtained is not exactly identified in the API documentation, but experimentation indicates an upper bound of around 1500 results from a single query, with a frequency limit of one query per minute.
The code below performs authentication and runs a search for the string ‘quantitative’. Many other functions for working with the API are available from the twitteR package. An R interface to the streaming API is also available link.
# These keys are examples and may not work! Get your own key at dev.twitter.com
consumer_key="vRLy03ef6OFAZB7oCL4jA"
consumer_secret="wWF35Lr1raBrPerVHSDyRftv8qB1H7ltV0T3Srb3s"
access_token="1577780816-wVbOZEED8KZs70PwJ2q5ld2w9CcvcZ2kC6gPnAo"
token_secret="IeC6iYlgUK9csWiP524Jb4UNM8RtQmHyetLi9NZrkJA"
tw <- getTweets('quantitative', numResults=20, consumer_key, consumer_secret, access_token, token_secret)
The return value from the above query is a source object which can be passed to quanteda’s corpus constructor, and the document variables are set to correspond with tweet metadata returned by the API.
twCorpus <- corpus(tw)
names(docvars(twCorpus))
In order to perform statistical analysis such as document scaling, we must extract a matrix associating values for certain features with each document. In quanteda, we use the dfm
function to produce such a matrix. 2.
By far the most common approach is to consider each word type to be a feature, and the number of occurrences of the word type in each document the values. This is easy to see with a concrete example, so lets use the dfm
command on the inaugural address corpus. To simplify the example output, we reduce the size of the inaugCorpus object using the corpus subset
function, which can create a new corpus from a subset, selecting by document variables.
data(inaugCorpus)
myCorpus <- subset(inaugCorpus, Year > 1990)
# make a dfm
myDfm <- dfm(myCorpus)
#> Creating a dfm from a corpus ...
#> ... lowercasing
#> ... tokenizing
#> ... indexing 6 documents
#> ... shaping tokens into data.table, found 11,915 total tokens
#> ... summing tokens by document
#> ... indexing 2,292 feature types
#> ... building sparse matrix
#> ... created a 6 x 2292 sparse dfm
#> ... complete. Elapsed time: 0.034 seconds.
myDfm [,1:5]
#> Document-feature matrix of: 6 documents, 5 features.
#> 6 x 5 sparse Matrix of class "dfmSparse"
#> features
#> docs 18th 19th 20th 21st a
#> 1993-Clinton 0 0 0 1 17
#> 1997-Clinton 1 2 3 3 59
#> 2001-Bush 0 0 0 0 46
#> 2005-Bush 0 0 0 0 27
#> 2009-Obama 0 0 0 0 47
#> 2013-Obama 0 0 0 0 37
# make a dfm, removing stopwords and applying stemming
myStemMat <- dfm(myCorpus, stopwords=TRUE, stem=TRUE)
#> Creating a dfm from a corpus ...
#> ... lowercasing
#> ... tokenizing
#> ... indexing 6 documents
#> ... shaping tokens into data.table, found 11,915 total tokens
#> ... stemming the tokens (english)
#> ... summing tokens by document
#> ... indexing 1,776 feature types
#> ... building sparse matrix
#> ... created a 6 x 1776 sparse dfm
#> ... complete. Elapsed time: 0.025 seconds.
myStemMat [,1:5]
#> Document-feature matrix of: 6 documents, 5 features.
#> 6 x 5 sparse Matrix of class "dfmSparse"
#> features
#> docs 18th 19th 20th 21st a
#> 1993-Clinton 0 0 0 1 17
#> 1997-Clinton 1 2 3 3 59
#> 2001-Bush 0 0 0 0 46
#> 2005-Bush 0 0 0 0 27
#> 2009-Obama 0 0 0 0 47
#> 2013-Obama 0 0 0 0 37
The dfm can be inspected in the Enviroment pane in RStudio, or by calling R’s View
function. Calling plot
on a dfm will display a wordcloud using the wordcloud package
plot(myStemMat)
Often, we are interested in analysing how texts differ according to substantive factors which may be encoded in the document variables, rather than simply by the boundaries of the document files. We can group documents which share the same value for a document variable when creating a dfm:
byPresMat <- dfm(myCorpus, groups=c('President'), stopwords=TRUE)
#> Creating a dfm from a corpus ...
#> ... grouping texts by variable: President
#> ... lowercasing
#> ... tokenizing
#> ... indexing 3 documents
#> ... shaping tokens into data.table, found 11,915 total tokens
#> ... summing tokens by document
#> ... indexing 2,292 feature types
#> ... building sparse matrix
#> ... created a 3 x 2292 sparse dfm
#> ... complete. Elapsed time: 0.018 seconds.
byPresMat[,1:5] # the counts here are sums of counts from speeches by the same President.
#> Document-feature matrix of: 3 documents, 5 features.
#> 3 x 5 sparse Matrix of class "dfmSparse"
#> features
#> docs 18th 19th 20th 21st a
#> Bush 0 0 0 0 73
#> Clinton 1 2 3 4 76
#> Obama 0 0 0 0 84
For some applications we have prior knowledge of sets of words that are indicative of traits we would like to measure from the text. For example, a general list of positive words might indicate positive sentiment in a movie review, or we might have a dictionary of political terms which are associated with a particular ideological stance. In these cases, it is sometimes useful to treat these groups of words as equivalent for the purposes of analysis, and sum their counts into classes.
For example, let’s look at how words associated with terrorism and words associated with the economy vary by President in the inaugural speeches corpus. From the original corpus, we select Presidents since Clinton:
data(inaugCorpus)
recentCorpus <- subset(inaugCorpus, Year > 1991)
Now we define a toy dictionary:
myDict<-list(terror=c("terrorism", "terrorists", "threat","a"),
economy=c("jobs", "business", "grow","work"))
We can use the dictionary when making the dfm:
# I don't think this is working
byPresMat <- dfm(myCorpus, dictionary=myDict)
#> Creating a dfm from a corpus ...
#> ... lowercasing
#> ... tokenizing
#> ... indexing 6 documents
#> ... shaping tokens into data.table, found 11,915 total tokens
#> ... applying a dictionary consisting of 2 key entries
#> ... summing dictionary-matched features by document
#> ... indexing 2 feature types
#> ... building sparse matrix
#> ... created a 6 x 2 sparse dfm
#> ... complete. Elapsed time: 0.04 seconds.
library(quanteda)
# create a corpus from the immigration texts from UK party platforms
uk2010immigCorpus <- corpus(ukimmigTexts,
docvars=data.frame(party=names(ukimmigTexts)),
notes="Immigration-related sections of 2010 UK party manifestos",
enc="UTF-8")
uk2010immigCorpus
#> Corpus consisting of 9 documents.
summary(uk2010immigCorpus, showmeta=TRUE)
#> Corpus consisting of 9 documents.
#>
#> Text Types Tokens Sentences party
#> text1 1024 2876 137 BNP
#> text2 135 235 12 Coalition
#> text3 235 454 21 Conservative
#> text4 306 614 30 Greens
#> text5 276 630 34 Labour
#> text6 246 443 26 LibDem
#> text7 74 103 5 PC
#> text8 82 125 4 SNP
#> text9 310 641 38 UKIP
#>
#> Source: /private/var/folders/3_/7s7qq3wx08b8htzt5l9sdm6m0000gr/T/RtmpcmNfe7/Rbuild5e7343e166d/quanteda/vignettes/* on x86_64 by kbenoit.
#> Created: Sat Jul 11 09:03:09 2015.
#> Notes: Immigration-related sections of 2010 UK party manifestos.
# key words in context for "deport", 3 words of context
kwic(uk2010immigCorpus, "deport", 3)
#> preword word postword
#> [text1, 71] further immigration, the deportation of all illegal
#> [text1, 139] The BNP will deport all foreigners convicted
#> [text1, 1628] long-term resettlement programme.\n\n2. Deport all illegal immigrants\n\nWe
#> [text1, 1633] illegal immigrants\n\nWe shall deport all illegal immigrants
#> [text1, 1653] current unacceptably lax deportation policies, thousands of
#> [text1, 1659] of people are deported from the UK
#> [text1, 2169] enforced by instant deportation, for anyone found
#> [text1, 2180] British immigration laws.\n\n8. Deportation of all Foreign
#> [text1, 2186] Foreign Criminals\n\nWe shall deport all criminal entrants,
#> [text1, 2198] This includes the deportation of all Muslim
#> [text4, 566] subject to summary deportation. They should receive
#> [text6, 194] illegal labour.\n\n- Prioritise deportation efforts on criminals,
#> [text6, 394] flight risks.\n\n- End deportations of refugees to
#> [text9, 317] laws or face deportation. Such citizens will
# create a dfm, removing stopwords
mydfm <- dfm(uk2010immigCorpus, stopwords=TRUE)
#> Creating a dfm from a corpus ...
#> ... lowercasing
#> ... tokenizing
#> ... indexing 9 documents
#> ... shaping tokens into data.table, found 6,021 total tokens
#> ... summing tokens by document
#> ... indexing 1,574 feature types
#> ... building sparse matrix
#> ... created a 9 x 1574 sparse dfm
#> ... complete. Elapsed time: 0.035 seconds.
dim(mydfm) # basic dimensions of the dfm
#> [1] 9 1574
topfeatures(mydfm, 15) # 15 top words
#> the of and to in we a will for that immigration be our is are
#> 339 228 218 218 117 97 89 86 77 76 68 54 53 50 48
# if (Sys.info()["sysname"] == "Darwin") quartz(width=8, height=8)
plot(mydfm) # word cloud
This research was supported by the European Research Council grant ERC-2011-StG 283794-QUANTESS. Code contributors to the project include Ben Lauderdale, Pablo Barberà, and Kohei Watanabe.↩
dfm stands for document-feature matrix — we say “feature” as opposed to “term”, since it is possible to use other properties of documents (e.g. ngrams or syntactic dependencies) for further analysis↩