The gutenbergr package helps you download and process public domain works from the Project Gutenberg collection. This includes both tools for downloading books (and stripping header/footer information), and a complete dataset of Project Gutenberg metadata that can be used to find words of interest. Includes:
gutenberg_download()
that downloads one or more works from Project Gutenberg by ID: e.g., gutenberg_download(84)
downloads the text of Frankenstein.gutenberg_metadata
contains information about each work, pairing Gutenberg ID with title, author, language, etcgutenberg_authors
contains information about each author, such as aliases and birth/death yeargutenberg_subjects
contains pairings of works with Library of Congress subjects and topicsThis package contains metadata for all Project Gutenberg works as R datasets, so that you can search and filter for particular works before downloading.
The dataset gutenberg_metadata
contains information about each work, pairing Gutenberg ID with title, author, language, etc:
library(gutenbergr)
gutenberg_metadata
## Source: local data frame [51,997 x 8]
##
## gutenberg_id
## (int)
## 1 0
## 2 1
## 3 2
## 4 3
## 5 4
## 6 5
## 7 6
## 8 7
## 9 8
## 10 9
## .. ...
## Variables not shown: title (chr), author (chr), gutenberg_author_id (int),
## language (chr), gutenberg_bookshelf (chr), rights (chr), has_text (lgl)
For example, you could find the Gutenberg ID of Wuthering Heights by doing:
library(dplyr)
gutenberg_metadata %>%
filter(title == "Wuthering Heights")
## Source: local data frame [1 x 8]
##
## gutenberg_id title author gutenberg_author_id
## (int) (chr) (chr) (int)
## 1 768 Wuthering Heights Brontë, Emily 405
## Variables not shown: language (chr), gutenberg_bookshelf (chr), rights
## (chr), has_text (lgl)
In many analyses, you may want to filter just for English works, avoid duplicates, and include only books that have text that can be downloaded. The gutenberg_works()
function does this pre-filtering:
gutenberg_works()
## Source: local data frame [40,737 x 8]
##
## gutenberg_id
## (int)
## 1 0
## 2 1
## 3 2
## 4 3
## 5 4
## 6 5
## 7 6
## 8 7
## 9 8
## 10 9
## .. ...
## Variables not shown: title (chr), author (chr), gutenberg_author_id (int),
## language (chr), gutenberg_bookshelf (chr), rights (chr), has_text (lgl)
It also allows you to perform filtering as an argument:
gutenberg_works(author == "Austen, Jane")
## Source: local data frame [10 x 8]
##
## gutenberg_id
## (int)
## 1 105
## 2 121
## 3 141
## 4 158
## 5 161
## 6 946
## 7 1212
## 8 1342
## 9 31100
## 10 42078
## Variables not shown: title (chr), author (chr), gutenberg_author_id (int),
## language (chr), gutenberg_bookshelf (chr), rights (chr), has_text (lgl)
# or with a regular expression
library(stringr)
gutenberg_works(str_detect(author, "Austen"))
## Source: local data frame [13 x 8]
##
## gutenberg_id
## (int)
## 1 105
## 2 121
## 3 141
## 4 158
## 5 161
## 6 946
## 7 1212
## 8 1342
## 9 17797
## 10 31100
## 11 33513
## 12 39897
## 13 42078
## Variables not shown: title (chr), author (chr), gutenberg_author_id (int),
## language (chr), gutenberg_bookshelf (chr), rights (chr), has_text (lgl)
The meta-data currently in the package was last updated on 05 May 2016.
The function gutenberg_download()
downloads one or more works from Project Gutenberg based on their ID. For example, we earlier saw that “Wuthering Heights” has ID 768 (see the URL here), so gutenberg_download(768)
downloads this text.
wuthering_heights <- gutenberg_download(768)
wuthering_heights
## Source: local data frame [12,085 x 2]
##
## gutenberg_id
## (int)
## 1 768
## 2 768
## 3 768
## 4 768
## 5 768
## 6 768
## 7 768
## 8 768
## 9 768
## 10 768
## .. ...
## Variables not shown: text (chr)
Notice it is returned as a tbl_df (a type of data frame) including two variables: gutenberg_id
(useful if multiple books are returned), and a character vector of the text, one row per line. Notice that the header and footer added by Project Gutenberg (visible here) have been stripped away.
Provide a vector of IDs to download multiple books. For example, to download Jane Eyre (book 1260) along with Wuthering Heights, do:
books <- gutenberg_download(c(768, 1260), meta_fields = "title")
books
## Source: local data frame [32,744 x 3]
##
## gutenberg_id
## (int)
## 1 768
## 2 768
## 3 768
## 4 768
## 5 768
## 6 768
## 7 768
## 8 768
## 9 768
## 10 768
## .. ...
## Variables not shown: text (chr), title (chr)
Notice that the meta_fields
argument allows us to add one or more additional fields from the gutenberg_metadata
to the downloaded text, such as title or author.
books %>%
count(title)
## Source: local data frame [2 x 2]
##
## title n
## (chr) (int)
## 1 Jane Eyre: An Autobiography 20659
## 2 Wuthering Heights 12085
You may want to select books based on information other than their title or author, such as their genre or topic. gutenberg_subjects
contains pairings of works with Library of Congress subjects and topics. “lcc” means Library of Congress Classification, while “lcsh” means Library of Congress subject headings:
gutenberg_subjects
## Source: local data frame [140,173 x 3]
##
## gutenberg_id subject_type
## (int) (chr)
## 1 1 lcc
## 2 1 lcsh
## 3 1 lcsh
## 4 1 lcc
## 5 2 lcc
## 6 2 lcsh
## 7 2 lcsh
## 8 2 lcc
## 9 3 lcsh
## 10 3 lcsh
## .. ... ...
## Variables not shown: subject (chr)
This is useful for extracting texts from a particular topic or genre, such as detective stories, or a particular character, such as Sherlock Holmes. The gutenberg_id
column can then be used to download these texts or to link with other metadata.
gutenberg_subjects %>%
filter(subject == "Detective and mystery stories")
## Source: local data frame [521 x 3]
##
## gutenberg_id subject_type subject
## (int) (chr) (chr)
## 1 170 lcsh Detective and mystery stories
## 2 173 lcsh Detective and mystery stories
## 3 244 lcsh Detective and mystery stories
## 4 305 lcsh Detective and mystery stories
## 5 330 lcsh Detective and mystery stories
## 6 481 lcsh Detective and mystery stories
## 7 547 lcsh Detective and mystery stories
## 8 863 lcsh Detective and mystery stories
## 9 905 lcsh Detective and mystery stories
## 10 1155 lcsh Detective and mystery stories
## .. ... ... ...
gutenberg_subjects %>%
filter(grepl("Holmes, Sherlock", subject))
## Source: local data frame [47 x 3]
##
## gutenberg_id subject_type
## (int) (chr)
## 1 108 lcsh
## 2 221 lcsh
## 3 244 lcsh
## 4 834 lcsh
## 5 1661 lcsh
## 6 2097 lcsh
## 7 2343 lcsh
## 8 2344 lcsh
## 9 2345 lcsh
## 10 2346 lcsh
## .. ... ...
## Variables not shown: subject (chr)
gutenberg_authors
contains information about each author, such as aliases and birth/death year:
gutenberg_authors
## Source: local data frame [16,236 x 7]
##
## gutenberg_author_id author
## (int) (chr)
## 1 1 United States
## 2 3 Lincoln, Abraham
## 3 4 Henry, Patrick
## 4 5 Adam, Paul
## 5 7 Carroll, Lewis
## 6 8 United States. Central Intelligence Agency
## 7 9 Melville, Herman
## 8 10 Barrie, J. M. (James Matthew)
## 9 12 Smith, Joseph, Jr.
## 10 14 Madison, James
## .. ... ...
## Variables not shown: alias (chr), birthdate (int), deathdate (int),
## wikipedia (chr), aliases (chr)
What’s next after retrieving a book’s text? Well, having the book as a data frame is especially useful for working with the tidytext package for text analysis.
library(tidytext)
words <- books %>%
unnest_tokens(word, text)
words
## Source: local data frame [305,532 x 3]
##
## gutenberg_id title word
## (int) (chr) (chr)
## 1 768 Wuthering Heights wuthering
## 2 768 Wuthering Heights heights
## 3 768 Wuthering Heights chapter
## 4 768 Wuthering Heights i
## 5 768 Wuthering Heights 1801
## 6 768 Wuthering Heights i
## 7 768 Wuthering Heights have
## 8 768 Wuthering Heights just
## 9 768 Wuthering Heights returned
## 10 768 Wuthering Heights from
## .. ... ... ...
word_counts <- words %>%
anti_join(stop_words, by = "word") %>%
count(title, word, sort = TRUE)
word_counts
## Source: local data frame [21,201 x 3]
## Groups: title [2]
##
## title word n
## (chr) (chr) (int)
## 1 Jane Eyre: An Autobiography jane 342
## 2 Jane Eyre: An Autobiography rochester 317
## 3 Jane Eyre: An Autobiography sir 315
## 4 Jane Eyre: An Autobiography miss 310
## 5 Jane Eyre: An Autobiography time 244
## 6 Jane Eyre: An Autobiography day 232
## 7 Jane Eyre: An Autobiography looked 221
## 8 Jane Eyre: An Autobiography night 217
## 9 Jane Eyre: An Autobiography eyes 187
## 10 Jane Eyre: An Autobiography john 184
## .. ... ... ...
You may also find these resources useful:
wikipedia
column in gutenberg_author
to Wikipedia content with the WikipediR package or to pageview statistics with the wikipediatrend packageformat_reverse
function for reversing “Last, First” names).