This package aims to provide metrics to evaluate generated text. To this point, only the BLEU (bilingual evaluation understudy) score, introduced by Papineni et al., 2002, is available. The library is implemented in ‘R’ and ‘C++’. The metrics are implemented on the base of previous tokenization, so that lists with tokenized sequences are evaluated. This package is inspired by the ‘NLTK’ and ‘sacrebleu’ implementation for ‘Python’.
The BLEU-score is a metric used to evaluate the quality of machine-generated texts by comparing them to reference texts. It is calculated based on the precision of n-grams, which are contiguous sequences of n items, typically words.
Mathematically, BLEU can be expressed as follows:
\[ BLEU = \text{{BP}} \times \exp\left(\sum_{n=1}^{N} \frac{1}{N} \log \text{{precision}}_n\right) \]
Where: - \(\text{{BP}}\) is the brevity penalty, which penalizes if the candidate text is shorter than the reference texts. It is defined as \(\exp(1 - \frac{{\text{{reference length}}}}{{\text{{output length}}}})\). - \(N\) is the maximum n-gram order considered in the calculation. - \(\text{{precision}}_n\) is the precision of n-grams, calculated as the ratio of the number of n-grams in the candidate text that appear in any of the reference texts to the total number of n-grams in the candidate text.
\(\text{{precision}}_n\) is defined as the following: \[ precision_n = \frac{\sum_{c \in \text{Cand}} ngram_{\text{clip}}(c)}{\sum_{r \in \text{Ref}_{\text{Cand}}} ngram(r)} \]
Where \(ngram_{\text{clip}}\) represents the count of n-grams in the candidate text that appear in any of the reference texts, while \(ngram\) stands for the total number of n-grams in the candidate sentence, ensuring they do not exceed the count of the reference n-grams. This procedure is repeated for all 1 to N-grams.
In summary, the BLEU score provides a single numerical value indicating the quality of a candidate text, with higher scores indicating better quality.
This package provides two smoothing techniques from Chen et al., 2014. The
methods available in this package are floor
and
add-k
.
floor
The precision of BLEU is calculated by dividing the sum of the n-grams. However, in some cases, the count of certain n-grams may be zero. To address this issue, a small value (denoted as \(\epsilon\)) is added to the numerator of the precision calculation when the count is zero.
add-k
Similar to the motivation behind the floor
method, the
add-k
smoothing technique involves adding an integer value
(\(k\)) to the overall sum of the
numerator and the denominator of the precision calculation for each
1..N-gram.
library(sacRebleu)
<- list(c(1,2,3), c(1,2))
cand_corpus <- list(list(c(1,2,3), c(2,3,4)), list(c(1,2,6), c(781, 21, 9), c(7, 3)))
ref_corpus <- bleu_corpus_ids(ref_corpus, cand_corpus) bleu_corpus_ids_standard
Here, the text is already tokenized and represented through integers in the ‘cand_corpus’ and ‘ref_corpus’ lists. For tokenization, the ‘tok’ package is recommended. It is also possible to feed the function with text using the ‘bleu_corpus’ or ‘bleu_sentence’ functions.