How a quanteda corpus works

Describe the object, how the goal is to store unchanged, original texts, whose only processing has been to convert their encoding to a common format.

A quanteda corpus can store settings, metadata, document variables, and be indexed. It can be linked to dictionaries, collocation lists, and custom stop words.

Currently available corpus sources

quanteda has tools for getting texts into a corpus from a variety of sources:

From a character object already in memory

The simplest case is to create a corpus from a vector of texts already in memory in R. This gives the advanced R user complete flexbility with his or her choice of text inputs, as there are almost endless ways to get a vector of texts into R.

If we already have the texts in this form, we can call the corpus constructor function directly. We can demonstrate this on the built-in character vector of 57 US president inaugural speeches called inaugTexts.

str(inaugTexts)  # this gives us some information about the object
#>  Named chr [1:57] "Fellow-Citizens of the Senate and of the House of Representatives:\n\nAmong the vicissitudes incident to life no event could ha"| __truncated__ ...
#>  - attr(*, "names")= chr [1:57] "1789-Washington" "1793-Washington" "1797-Adams" "1801-Jefferson" ...
myCorpus <- corpus(inaugTexts)  # build the corpus
summary(myCorpus, n=5)
#> Corpus consisting of 57 documents, showing 5 documents.
#> 
#>   Text Types Tokens Sentences
#>  text1   594   1431        24
#>  text2    90    135         4
#>  text3   794   2321        37
#>  text4   680   1730        42
#>  text5   776   2166        45
#> 
#> Source:  /private/var/folders/3_/7s7qq3wx08b8htzt5l9sdm6m0000gr/T/RtmpcmNfe7/Rbuild5e7343e166d/quanteda/vignettes/* on x86_64 by kbenoit.
#> Created: Sat Jul 11 09:02:53 2015.
#> Notes:   .

If we wanted, we could add some document-level variables – what quanteda calls docvars – to this corpus. We can do this using the R’s substring() function to extract characters from a name – in this case, the name of the character vector inaugTexts. This works using our fixed starting and ending positions with substring() because these names are a very regular format of YYYY-PresidentName.

docvars(myCorpus, "President") <- substring(names(inaugTexts), 6)
docvars(myCorpus, "Year") <- as.integer(substring(names(inaugTexts), 1, 4))
summary(myCorpus, n=5)
#> Corpus consisting of 57 documents, showing 5 documents.
#> 
#>   Text Types Tokens Sentences  President Year
#>  text1   594   1431        24 Washington 1789
#>  text2    90    135         4 Washington 1793
#>  text3   794   2321        37      Adams 1797
#>  text4   680   1730        42  Jefferson 1801
#>  text5   776   2166        45  Jefferson 1805
#> 
#> Source:  /private/var/folders/3_/7s7qq3wx08b8htzt5l9sdm6m0000gr/T/RtmpcmNfe7/Rbuild5e7343e166d/quanteda/vignettes/* on x86_64 by kbenoit.
#> Created: Sat Jul 11 09:02:53 2015.
#> Notes:   .

If we wanted to tag each document with additional meta-data not considered a document variable of interest for analysis, but rather something that we need to know as an attribute of the document, we could also add those to our corpus.

metadoc(myCorpus, "language") <- "english"
metadoc(myCorpus, "docsource")  <- paste("inaugTexts", 1:ndoc(myCorpus), sep="_")
summary(myCorpus, n=5, showmeta=TRUE)
#> Corpus consisting of 57 documents, showing 5 documents.
#> 
#>   Text Types Tokens Sentences  President Year _language   _docsource
#>  text1   594   1431        24 Washington 1789   english inaugTexts_1
#>  text2    90    135         4 Washington 1793   english inaugTexts_2
#>  text3   794   2321        37      Adams 1797   english inaugTexts_3
#>  text4   680   1730        42  Jefferson 1801   english inaugTexts_4
#>  text5   776   2166        45  Jefferson 1805   english inaugTexts_5
#> 
#> Source:  /private/var/folders/3_/7s7qq3wx08b8htzt5l9sdm6m0000gr/T/RtmpcmNfe7/Rbuild5e7343e166d/quanteda/vignettes/* on x86_64 by kbenoit.
#> Created: Sat Jul 11 09:02:53 2015.
#> Notes:   .

The last command, metadoc, allows you to define your own document meta-data fields. Note that in assiging just the single value of "english", R has recycled the value until it matches the number of documents in the corpus. In creating a simple tag for our custom metadoc field docsource, we used the quanteda function ndoc() to retrieve the number of documents in our corpus. This function is deliberately designed to work in a way similar to functions you may already use in R, such as nrow() and ncol().

Tools for handling corpus objects

Adding two corpus objects together

The + operator provides a simple method for concatenating two corpus objects. If they contain different sets of document-level variables, these will be stitched together in a fashion that guarantees that no information is lost. Corpus-level medata data is also concatenated.

library(quanteda)
mycorpus1 <- corpus(inaugTexts[1:5], note="First five inaug speeches")
mycorpus2 <- corpus(inaugTexts[6:10], note="Next five inaug speeches")
mycorpus3 <- mycorpus1 + mycorpus2
summary(mycorpus3)
#> Corpus consisting of 10 documents.
#> 
#>    Text Types Tokens Sentences
#>   text1   594   1431        24
#>   text2    90    135         4
#>   text3   794   2321        37
#>   text4   680   1730        42
#>   text5   776   2166        45
#>  text11   521   1177        21
#>  text21   518   1211        33
#>  text31   980   3378       121
#>  text41  1202   4476       131
#>  text51   963   2916        74
#> 
#> Source:  Combination of corpuses mycorpus1 and mycorpus2.
#> Created: Sat Jul 11 09:02:53 2015.
#> Notes:   First five inaug speeches Next five inaug speeches.

Extracting a subset of a corpus

subset

Indexing a corpus

Coming soon

Managing settings in a corpus

Coming soon

Redefining document units

segment

changeunits

Methods for analyzing a corpus directly

Getting simple information

print

summary

ndoc and nfeature

Extracting data

texts docvars metacorpus metadoc

Exploring a corpus

kwic

Dispersion plots – coming soon.

Operations on the corpus texts

Creating a corpus fr

Often, texts aren’t available as pre-made R character vectors, and we need to load them from an external source. To do this, we first create a source for the documents, which defines how they are loaded from the source into the corpus. The source may be a character vector, a directory of text files, a zip file, a twitter search, or several external package formats such as tm’s VCorpus.

Once a source has been defined, we make a new corpus by calling the corpus constructor with the source as the first argument. The corpus constructor also accepts arguments which can set some corpus metadata, and define how the document variables are set.

From a directory of files

A very common source of files for creating a corpus will be a set of text files found on a local (or remote) directory. To load texts in this way, we first define a source for the directory, and pass this source as an argument to the corpus constructor. We create a directory source by calling the directory function.

# Basic file import from directory
d <- textfile('~/Dropbox/QUANTESS/corpora/inaugural/*.txt')
myCorpus <- corpus(d)

If the document variables are specified in the filenames of the texts, we can read them by setting the docvarsfrom argument (docvarsfrom = "filenames") and specifiying how the filenames are formatted with the sep argument. For example, if the inaugural address texts were stored on disk in the format Year-President.txt (e.g. 1973-Nixon.txt), then we can load them and automatically populate the document variables. The docvarnames argument sets the names of the document variables — it must be the same length as the parts of the filenames.

# File import reading document variables from filenames
d <- textfile('~/Dropbox/QUANTESS/corpora/inaugural/*.txt')

# In this example the format of the filenames is `Year-President.txt`. 
# Because there are two variables in the filename, docvarnames must contain two names
myCorpus <- corpus(d, docvarsfrom="filenames", sep="-", docvarnames=c("Year", "President") )