Why textmineR?

textmineR was created with three principles in mind:

Maximize interoperability within R’s ecosystem
Scaleable in terms of object storage and computation time
Syntax that is idiomatic to R

R has many packages for text mining and natural language processing (NLP). The CRAN task view on natural language processing lists 53 unique packages. Some of these packages are interoperable. Some are not.

textmineR strives for maximum interoperability in three ways. First, it uses the dgCMatrix class from the popular Matrix package for document term matrices (DTMs) and term co-occurrence matrices (TCMs). The Matrix package is an R “recommended” package with nearly 500 packages that depend, import, or suggest it. Compare that to the slam package used by tm and its derivatives. slam has an order of magnitude fewer dependents. It is simply not as well integrated. Matrix also has methods that make the syntax for manipulating its matrices nearly identical to base R. This greatly reduces the cognitive burden of the programmers.

Second, textmineR relies on base R objects for corpus and metadata storage. Actually, it relies on the user to do so. textmineR’s core functions CreateDtm and CreateTcm take a simple character vector as input. Users may store their corpora as character vectors, lists, or data frames. There is no need to learn a new ‘Corpus’ class.

Third and last, textmineR represents the output of topic models in a consistent way, a list containing two matrices. This is described in more detail in the next section. Several topic models are supported and the simple representation means that textmineR’s utility functions are usable with outputs from other packages, so long as they are represented as matrices of probabilities. (Again, see the next section for more detail.)

textmineR achieves scaleability through three means. First, sparse matrices (like the dgCMatrix) offer significant memory savings. Second, textmineR utilizes Rcpp throughout for speedup. Finally, textmineR uses parallel processing by default where possible. textmineR offers a function TmParallelApply which implements a framework for parallel processing that is syntactically agnostic between Windows and Unix-like operating systems. TmParallelApply is used liberally within textmineR and is exposed for users.

textmineR does make some tradeoffs of performance for syntactic simplicity. textmineR is designed to run on a single node in a cluster computing environment. It can (and will by default) use all available cores of that node. If performance is your number one concern, see text2vec. textmineR uses some text2vec under the hood.

textmineR strives for syntax that is idiomatic to R. This is, admittedly, a nebulous concept. textmineR does not create new classes where existing R classes exist. It strives for a functional programming paradigm. And it attempts to group closely-related sequential steps into single functions. This means that users will not have to make several temporary objects along the way. As an example, compare making a document term matrix in textmineR (example below) with tm or text2vec.

As a side note: textmineR’s framework for NLP does not need to be exclusive to textmineR. Text mining packages in R can be interoperable with a few concepts. First, use dgCMatrix for DTMs and TCMs. Second, write most text mining models in a way that they can take a dgCMatrix as the input. Finally, keep non-base R classes to a minimum, especially for corpus and metadata management.

colnames(dtm)
aaa
aaa_ball
aaaaatch
aaaaatch_kah
aage
aage_haugland

rownames(dtm)
4273_1
7112_4
1891_3
6252_10
9929_2
8970_10

Basic corpus statistics

The code below performs some basic corpus statistics. textmineR has a built in function for getting term frequencies across the corpus. This function TermDocFreq gives term frequencies (equivalent to colSums(dtm)), the number of documents in which each term appears (equivalent to colSums(dtm > 0)), and an inverse-document frequency (IDF) vector. The IDF vector can be used to create a TF-IDF matrix.


# get counts of tokens across the corpus
tf_mat <- TermDocFreq(dtm = dtm)

str(tf_mat) 
#> tibble [56,023 × 4] (S3: tbl_df/tbl/data.frame)
#>  $ term     : chr [1:56023] "aaa" "aaa_ball" "aaaaatch" "aaaaatch_kah" ...
#>  $ term_freq: num [1:56023] 1 1 1 1 1 1 1 1 1 1 ...
#>  $ doc_freq : int [1:56023] 1 1 1 1 1 1 1 1 1 1 ...
#>  $ idf      : num [1:56023] 6.21 6.21 6.21 6.21 6.21 ...

# look at the most frequent tokens
head(tf_mat[ order(tf_mat$term_freq, decreasing = TRUE) , ], 10)

Ten most frequent tokens
term	term_freq	doc_freq	idf
br	1990	289	0.5481814
br_br	999	289	0.5481814
film	909	294	0.5310283
movie	774	290	0.5447272
good	337	209	0.8722738
time	252	167	1.0966143
people	238	143	1.2517635
story	230	143	1.2517635
great	181	125	1.3862944
bad	176	121	1.4188176

# look at the most frequent bigrams
tf_bigrams <- tf_mat[ stringr::str_detect(tf_mat$term, "_") , ]

head(tf_bigrams[ order(tf_bigrams$term_freq, decreasing = TRUE) , ], 10)

Ten most frequent bi-grams
term	term_freq	doc_freq	idf
br_br	999	289	0.5481814
br_film	41	35	2.6592600
br_movie	34	31	2.7806209
special_effects	28	24	3.0365543
film_br	20	17	3.3813948
good_movie	18	17	3.3813948
low_budget	18	16	3.4420194
waste_time	17	15	3.5065579
movie_br	15	15	3.5065579
movie_good	15	14	3.5755508

It looks like we have stray html tags (“<br>”) in the documents. These aren’t giving us any relevant information about content. (Except, perhaps, that these documents were originally part of web pages.)

The most intuitive approach, perhaps, is to strip these tags from our documents, re-construct a document term matrix, and re-calculate the objects as above. However, a simpler approach would be to simply remove the tokens containing “br” from the DTM we already calculated. This is much more computationally efficient and gives us the same result anyway.

# remove offending tokens from the DTM
dtm <- dtm[ , ! stringr::str_detect(colnames(dtm),
                                    "(^br$)|(_br$)|(^br_)") ]

# re-construct tf_mat and tf_bigrams
tf_mat <- TermDocFreq(dtm)

tf_bigrams <- tf_mat[ stringr::str_detect(tf_mat$term, "_") , ]

head(tf_mat[ order(tf_mat$term_freq, decreasing = TRUE) , ], 10)
#> # A tibble: 10 x 4
#>    term      term_freq doc_freq   idf
#>    <chr>         <dbl>    <int> <dbl>
#>  1 film            909      294 0.531
#>  2 movie           774      290 0.545
#>  3 good            337      209 0.872
#>  4 time            252      167 1.10 
#>  5 people          238      143 1.25 
#>  6 story           230      143 1.25 
#>  7 great           181      125 1.39 
#>  8 bad             176      121 1.42 
#>  9 made            170      140 1.27 
#> 10 character       169      108 1.53

Ten most frequent terms, ‘<br>’ removed
term	term_freq	doc_freq	idf
film	909	294	0.5310283
movie	774	290	0.5447272
good	337	209	0.8722738
time	252	167	1.0966143
people	238	143	1.2517635
story	230	143	1.2517635
great	181	125	1.3862944
bad	176	121	1.4188176
made	170	140	1.2729657
character	169	108	1.5324769

head(tf_bigrams[ order(tf_bigrams$term_freq, decreasing = TRUE) , ], 10)

Ten most frequent bi-grams, ‘<br>’ removed
term	term_freq	doc_freq	idf
special_effects	28	24	3.036554
good_movie	18	17	3.381395
low_budget	18	16	3.442019
waste_time	17	15	3.506558
movie_good	15	14	3.575551
real_life	14	11	3.816713
watch_movie	14	13	3.649659
watching_movie	14	13	3.649659
high_school	13	7	4.268698
horror_films	13	8	4.135167

We can also calculate how many tokens each document contains from the DTM. Note that this reflects the modifications we made in constructing the DTM (removing stop words, punctuation, numbers, etc.).

# summary of document lengths
doc_lengths <- rowSums(dtm)

summary(doc_lengths)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>      20      97     139     188     222     928

Often,it’s useful to prune your vocabulary and remove any tokens that appear in a small number of documents. This will greatly reduce the vocabulary size (see Zipf’s law) and improve computation time.

# remove any tokens that were in 3 or fewer documents
dtm <- dtm[ , colSums(dtm > 0) > 3 ] # alternatively: dtm[ , tf_mat$term_freq > 3 ]

tf_mat <- tf_mat[ tf_mat$term %in% colnames(dtm) , ]

tf_bigrams <- tf_bigrams[ tf_bigrams$term %in% colnames(dtm) , ]

The movie review data set contains more than just text of reviews. It also contains a variable tagging the review as positive (movie_review$sentiment $=1$) or negative (movie_review$sentiment $=0$). We can examine terms associated with positive and negative reviews. If we wanted, we could use them to build a simple classifier.

However, as we will see immediately below, looking at only the most frequent terms in each category is not helpful. Because of Zipf’s law, the most frequent terms in just about any category will be the same.

# what words are most associated with sentiment?
tf_sentiment <- list(positive = TermDocFreq(dtm[ movie_review$sentiment == 1 , ]),
                     negative = TermDocFreq(dtm[ movie_review$sentiment == 0 , ]))

These are basically the same. Not helpful at all.

head(tf_sentiment$positive[ order(tf_sentiment$positive$term_freq, decreasing = TRUE) , ], 10)

Ten most-frequent positive tokens
term	term_freq	doc_freq	idf
film	483	151	0.4591837
movie	340	130	0.6089291
good	180	106	0.8130245
great	142	93	0.9438641
time	138	88	0.9991267
story	124	75	1.1589754
people	107	65	1.3020763
films	102	56	1.4511119
love	82	53	1.5061716
made	81	67	1.2717709

head(tf_sentiment$negative[ order(tf_sentiment$negative$term_freq, decreasing = TRUE) , ], 10)

Ten most-frequent negative tokens
term	term_freq	doc_freq	idf
movie	434	160	0.4893466
film	426	143	0.6016758
good	157	103	0.9297914
people	131	78	1.2078116
bad	130	82	1.1578012
time	114	79	1.1950726
story	106	68	1.3450127
character	95	60	1.4701758
made	89	73	1.2740610
make	86	63	1.4213857

That was unhelpful. Instead, we need to re-weight the terms in each class. We’ll use a probabilistic reweighting, described below.

The most frequent words in each class are proportional to $P(word|sentiment_j)$. As we saw above, that would puts the words in the same order as $P(word)$, overall. However, we can use the difference in those probabilities to get a new order. That difference is

\[\begin{align} P(word|sentiment_j) - P(word) \end{align}\]

You can interpret the difference in (1) as follows: Positive values are more probable in the sentiment class than in the corpus overall. Negative values are less probable. Values close to zero are statistically-independent of sentiment. Since most of the top words are the same when we sort by $P(word|sentiment_j)$, these words are statistically-independent of sentiment. They get forced towards zero.

For those paying close attention, this difference should give a similar ordering as pointwise-mutual information (PMI), defined as $PMI = \frac{P(word|sentiment_j)}{P(word)}$. However, I prefer the difference as it is bound between $-1$ and $1$.

The difference method is applied to both words overall and bi-grams in the code below.


# let's reweight by probability by class
p_words <- colSums(dtm) / sum(dtm) # alternatively: tf_mat$term_freq / sum(tf_mat$term_freq)

tf_sentiment$positive$conditional_prob <- 
  tf_sentiment$positive$term_freq / sum(tf_sentiment$positive$term_freq)

tf_sentiment$positive$prob_lift <- tf_sentiment$positive$conditional_prob - p_words

tf_sentiment$negative$conditional_prob <- 
  tf_sentiment$negative$term_freq / sum(tf_sentiment$negative$term_freq)

tf_sentiment$negative$prob_lift <- tf_sentiment$negative$conditional_prob - p_words

# let's look again with new weights
head(tf_sentiment$positive[ order(tf_sentiment$positive$prob_lift, decreasing = TRUE) , ], 10)

Reweighted: ten most relevant terms for positive sentiment
term	term_freq	doc_freq	idf	conditional_prob	prob_lift
great	142	93	0.9438641	0.0084009	0.0029998
film	483	151	0.4591837	0.0285748	0.0014502
love	82	53	1.5061716	0.0048512	0.0012406
films	102	56	1.4511119	0.0060344	0.0010810
performance	55	32	2.0107276	0.0032539	0.0008667
role	52	38	1.8388774	0.0030764	0.0008384
charlie	28	3	4.3778513	0.0016565	0.0007911
excellent	36	31	2.0424763	0.0021298	0.0006975
performances	34	21	2.4319411	0.0020115	0.0006687
time	138	88	0.9991267	0.0081642	0.0006445

head(tf_sentiment$negative[ order(tf_sentiment$negative$prob_lift, decreasing = TRUE) , ], 10)

Reweighted: ten most relevant terms for negative sentiment
term	term_freq	doc_freq	idf	conditional_prob	prob_lift
movie	434	160	0.4893466	0.0261304	0.0030342
bad	130	82	1.1578012	0.0078271	0.0025752
worst	47	39	1.9009588	0.0028298	0.0012483
pretty	54	41	1.8509483	0.0032512	0.0010729
poor	42	34	2.0381599	0.0025287	0.0009771
guy	56	28	2.2323159	0.0033717	0.0009248
part	57	43	1.8033203	0.0034319	0.0008955
black	41	20	2.5687881	0.0024685	0.0008870
minutes	44	32	2.0987845	0.0026492	0.0008588
stupid	31	24	2.3864666	0.0018665	0.0008519

# what about bi-grams?
tf_sentiment_bigram <- lapply(tf_sentiment, function(x){
  x <- x[ stringr::str_detect(x$term, "_") , ]
  x[ order(x$prob_lift, decreasing = TRUE) , ]
})

head(tf_sentiment_bigram$positive, 10)

Reweighted: ten most relevant bigrams for positive sentiment
term	term_freq	doc_freq	idf	conditional_prob	prob_lift
makes_movie	10	8	3.397022	0.0005916	0.0002932
good_movie	14	14	2.837406	0.0008283	0.0002911
high_school	11	5	3.867026	0.0006508	0.0002629
star_wars	9	5	3.867026	0.0005324	0.0002340
great_movie	8	8	3.397022	0.0004733	0.0002047
highly_recommended	6	6	3.684704	0.0003550	0.0001759
great_film	8	7	3.530553	0.0004733	0.0001749
movie_great	8	7	3.530553	0.0004733	0.0001749
academy_award	5	4	4.090169	0.0002958	0.0001466
acting_great	5	5	3.867026	0.0002958	0.0001466

head(tf_sentiment_bigram$negative, 10)

Reweighted: ten most relevant bigrams for negative sentiment
term	term_freq	doc_freq	idf	conditional_prob	prob_lift
waste_time	16	14	2.925463	0.0009633	0.0004561
special_effects	20	16	2.791932	0.0012042	0.0003686
movie_made	10	10	3.261935	0.0006021	0.0002440
make_sense	9	6	3.772761	0.0005419	0.0002136
scooby_doo	8	4	4.178226	0.0004817	0.0002131
worst_movie	7	6	3.772761	0.0004215	0.0002126
part_movie	8	6	3.772761	0.0004817	0.0001833
read_book	8	8	3.485079	0.0004817	0.0001833
good_idea	6	6	3.772761	0.0003612	0.0001822
main_character	8	6	3.772761	0.0004817	0.0001534

1. Start here

Thomas W. Jones

2021-06-27

Why textmineR?

Corpus management

Creating a DTM

Basic corpus statistics