Classification performance metrics and indices

Luciana Nieto & Adrian Correndo

2023-04-13

Description

The metrica package compiles +80 functions to assess regression (continuous) and classification (categorical) prediction performance from multiple perspectives.

For classification (binomial and multinomial) tasks, it includes a function to visualize the confusion matrix using ggplot2, and 27 functions of prediction scores including: accuracy, error rate, precision, recall, specificity, balanced accuracy (balacc), F-score (fscore), adjusted F-score (agf), G-mean (gmean), Bookmaker Informedness (bmi, a.k.a. Youden’s J-index), Markedness (deltaP), Matthews Correlation Coefficient (mcc), Cohen’s Kappa (khat), negative predictive value (npv), positive and negative likelihood ratios (posLr, negLr), diagnostic odds ratio (dor), prevalence (preval), prevalence threshold (preval_t), critical success index (csi, a.k.a. threat score), false positive rate (FPR), false negative rate (FNR), false detection rate (FDR), false omission rate (FOR), and area under the ROC curve (AUC_roc).

For supervised models, always keep in mind the concept of “cross-validation” since predicted values should ideally come from out-of-bag samples (unseen by training sets) to avoid overestimation of the prediction performance.

Using the functions

There are two basic arguments common to all metrica functions: (i) obs(Oi; observed, a.k.a. actual, measured, truth, target, label), and (ii) pred (Pi; predicted, a.k.a. simulated, fitted, modeled, estimate) values.

Optional arguments include data that allows to call an existing data frame containing both observed and predicted vectors, and tidy, which controls the type of output as a list (tidy = FALSE) or as a data.frame (tidy = TRUE).

For binary classification (two classes), functions also require to check the pos_level arg., which indicates the alphanumeric order of the “positive level”. Normally, the most common binary denominations are c(0,1), c(“Negative”, “Positive”), c(“FALSE”, “TRUE”), so the default pos_level = 2 (1, “Positive”, “TRUE”). However, other cases are also possible, such as c(“Crop”, “NoCrop”) for which the user needs to specify pos_level = 1.

For multiclass classification tasks, some functions present the atom arg. (logical TRUE / FALSE), which controls the output to be an overall average estimate across all classes, or a class-wise estimate. For example, user might be interested in obtaining estimates of precision and recall for each possible class of the prediction.

List of classification metrics* (categorical variables)

Note: All classification functions automatically recognize the number of classes and adjust estimations for binary or multiclass cases. However, for binary classification tasks, the user would need to check the alphanumeric order of the level considered as positive. By default “pos_level = 2” based on the most common denominations being c(0,1), c(“Negative”,“Positive”), c(“TRUE”, “FALSE”).

# Metric Definition Details Formula
1 accuracy Accuracy It is the most commonly used metric to evaluate classification quality. It represents the number of corrected classified cases with respect to all cases. However, be aware that this metric does not cover all aspects about classification quality. When classes are uneven in number, it may not be a reliable metric. \(accuracy = \frac{TP+TN}{TP+FP+TN+FN}\)
2 error_rate Error Rate It represents the complement of accuracy. It could vary between 0 and 1. Being 0 the best and 1 the worst \(error~rate = \frac{FP+FN}{TP+FP+TN+FN}\)
3 precision, ppv Precision Also known as positive predictive value (ppv), it represents the proportion of well classified cases with respect to the total of cases predicted with a given class (multinomial) or the true class (binomial) \(precision = \frac{TP}{TP + FP}\)
4 recall, sensitivity, TPR, hitrate Recall Also known as sensitivity, hit rate, or true positive rate (TPR) for binary cases. It represents the proportion of well predicted cases with respect to the total number of observed cases for a given class (multinomial) or the positive class (binomial) \(recall = \frac{TP}{P} = 1 - FNR\)
5 specificity, selectivity, TNR Specificity Also known as selectivity or true negative rate (TNR). It represents the proportion of well classified negative values with respect to the total number of actual negatives \(specificity = \frac{TN}{N} = 1 - FPR\)
6 balacc Balanced Accuracy This metric is especially useful when the number of observations across classes is imbalanced \(b.accuracy = \frac{recall + specificity}{2}\)
7 fscore F-score F1-score, F-measure \(fscore = \frac{(1 + B ^ 2) * precision * recall}{(B ^ 2 * precision) + recall)}\)
8 agf Adjusted F-score The agf adjusts the fscore for datasets with imbalanced classes \(agf = \sqrt{F_2 * invF_{0.5}}\), where \(F_2 = 5 * \frac{recall~*~precision}{(4*recall)~+~precision}\), and \(invF_{0.5} = (\frac{5}{4}) * \frac{recall~*~precision}{(0.5^2 ~*~ recall)~+~precision}\)
9 gmean G-mean The Geometric Mean (gmean) is a measure that considers a balance between the performance of both majority and minority classes. The higher the value the lower the risk of over-fitting of negative and under-fitting of positive classes \(gmean = \sqrt{recall~*~specificity}\)
10 khat K-hat or Cohen’s Kappa Coefficient The khat is considered a more robust metric than the classic accuracy. It normalizes the accuracy by the possibility of agreement by chance. It is positively bounded to 1, but it is not negatively bounded. The closer to 1, the better the classification quality \(khat = \frac{2 * (TP * TN - FN * FP)}{(TP+FP) * (FP+TN) + (TP+FN) * (FN + TN)}\)
11 mcc, phi_coef Matthews Correlation Coefficient Also known as phi-coefficient. It is particularly useful when the number of observations belonging to each class is uneven. It varies between 0-1, being 0 the worst and 1 the best. Currently, the mcc estimation is only available for binary cases (two classes) \(mcc = \frac{TP * TN - FP * FN}{\sqrt{(TP+FP) * (TP+FN) * (TN+FP) * (TN+FN)}}\)
12 fmi Fowlkes-Mallows Index The fmi is a metric that measures the similarity between two clusters (predicted and observed). It is equivalent to the square root of the product between precision (PPV) and recall (TPR). It varies between 0-1, being 0 the worst and 1 the best. \(fmi = \sqrt{precision * recall} = \sqrt{PPV * TPR}\)
13 bmi, jindex Informedness Also known as the Bookmaker Informedness, or as the Youden’s J-index. It is a suitable metric when the number of cases for each class is uneven. It varies between \(bmi = recall + specificity -1 = TPR + TNR - 1 = \frac{FP+FN}{TP+FP+TN+FN}\)
14 posLr Positive Likelihood Ratio The posLr, also known as LR(+) represents the odds of obtaining a positive prediction for actual positives. \(posLr = \frac{recall}{1+specificity} = \frac{TPR}{FPR}\)
15 negLr Negative Likelihood Ratio The negLr, also known as LR(-) indicates the odds of obtaining a negative prediction for actual positives (or non-negatives in multiclass) relative to the probability of actual negatives of obtaining a negative prediction \(negLr = \frac{1-recall}{specificity} = \frac{FNR}{TNR}\)
16 dor Diagnostic Odds Ratio The dor is a metric summarizing the effectiveness of classification. It represents the odds of a positive case obtaining a positive prediction result with respect to the odds of actual negatives obtaining a positive result \(dor = \frac{posLr}{negLr}\)
17 npv Negative predictive Value It represents the complement of accuracy. It could vary between 0 and 1. Being 0 the best and 1 the worst \(npv = \frac{TP}{PP} = \frac{TP}{TP + FP}\)
18 FPR False Positive Rate It represents the complement of specificity. It could vary between 0 and 1. The lower the better. \(FPR = 1 - specificity = 1 - TNR = \frac{FP}{N}\)
19 FNR False Negative Rate It represents the complement of recall. It could vary between 0 and 1. The lower the better. \(FNR = 1 - recall = 1 - TPR = \frac{FN}{P}\)
20 FDR False Detection Rate It represents the complement of precision (or positive predictive value -ppv-). It could vary between 0 and 1, being 0 the best and 1 the worst \(FDR = 1 - precision = \frac{FP}{PP} = \frac{FP}{TP + FP}\)
21 FOR False Omission Rate It represents the complement of the npv. It could vary between 0 and 1, being 0 the best and 1 the worst \(FOR = 1 - npv = \frac{FN}{PN} = \frac{FN}{TN + FN}\)
22 preval Error Rate It represents the complement of accuracy. It could vary between 0 and 1. Being 0 the best and 1 the worst \(error~rate = \frac{FP+FN}{TP+FP+TN+FN}\)
23 preval_t Error Rate It represents the complement of accuracy. It could vary between 0 and 1. Being 0 the best and 1 the worst \(error~rate = \frac{FP+FN}{TP+FP+TN+FN}\)
24 csi, jaccardindex Critical Success Index The csi is also known as the threat score (TS) or Jaccard’s Index. It could vary between 0 and 1, being 0 the worst and 1 the best \(csi = \frac{TP}{TP+FP+TN}\)
25 deltap, mk Markedness or deltap The deltap (a.k.a. Markedness -mk-) is a metric that quantifies the probability that a condition is marked by the predictor with respect to a random chance \(deltap = precision+npv-1 = PPV + NPV -1\)
26 AUC_roc Area Under the Curve The AUC_roc estimates the area under the receiving operator characteristic curve following the trapezoid approach. It bounded between 0 and 1. The closet to 1 the better. AUC_roc = 0.5 means the models predictions are the same than a random classifier. \(AUC_{roc} = precision+npv-1 = PPV + NPV -1\)


List of additional abbreviations:

P = positive (true + false)

N = negative (true + false)

TP = true positive

TN = true negative

FP = false positive

FN = false negative

TPR = true positive rate

TNR = true negative rate

FPR = false positive rate

FNR = false negative rate

ppv = positive predictive value

B = coefficient B (a.k.a. beta) indicating the weight to be applied to the estimation of fscore (as \(B^2\)).

References:

  1. Ting K.M. (2017). Confusion Matrix. In: Sammut C., Webb G.I. (eds) Encyclopedia of Machine Learning and Data Mining. Springer, Boston, MA.

  2. Accuracy. (2017). In: Sammut C., Webb G.I. (eds) Encyclopedia of Machine Learning and Data Mining . Springer, Boston, MA.

  3. García, V., Mollineda, R.A., Sánchez, J.S. (2009). Index of Balanced Accuracy: A Performance Measure for Skewed Class Distributions. In: Araujo, H., Mendonça, A.M., Pinho, A.J., Torres, M.I. (eds) Pattern Recognition and Image Analysis. IbPRIA 2009. Lecture Notes in Computer Science, vol 5524. Springer-Verlag Berlin Heidelberg.

  4. Ting K.M. (2017). Precision and Recall. In: Sammut C., Webb G.I. (eds) Encyclopedia of Machine Learning and Data Mining. Springer, Boston, MA.

  5. Sensitivity. (2017). In: Sammut C., Webb G.I. (eds) Encyclopedia of Machine Learning and Data Mining. Springer, Boston, MA.

  6. Ting K.M. (2017). Sensitivity and Specificity. In: Sammut C., Webb G.I. (eds) Encyclopedia of Machine Learning and Data Mining. Springer, Boston, MA.

  7. Trevethan, R. (2017). Sensitivity, Specificity, and Predictive Values: Foundations, Pliabilities, and Pitfalls in Research and Practice. Front. Public Health 5:307

  8. Goutte, C., Gaussier, E. (2005). A Probabilistic Interpretation of Precision, Recall and F-Score, with Implication for Evaluation. In: D.E. Losada and J.M. Fernandez-Luna (Eds.): ECIR 2005. Advances in Information Retrieval LNCS 3408, pp. 345–359, 2. Springer-Verlag Berlin Heidelberg.

  9. Maratea, A., Petrosino, A., Manzo, M. (2014). Adjusted-F measure and kernel scaling for imbalanced data learning. Inf. Sci. 257: 331-341.

  10. De Diego, I.M., Redondo, A.R., Fernández, R.R., Navarro, J., Moguerza, J.M. (2022). General Performance Score for classification problems. Appl. Intell. (2022).

  11. Fowlkes, Edward B; Mallows, Colin L (1983). A method for comparing two hierarchical clusterings. Journal of the American Statistical Association. 78 (383): 553–569.

  12. Chicco, D., Jurman, G. (2020). The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics 21, 6 (2020).

  13. Youden, W.J. (1950). Index for rating diagnostic tests. Cancer 3: 32-35.

  14. Powers, D.M.W. (2011). Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation. Journal of Machine Learning Technologies 2(1): 37–63.

  15. Chicco, D., Tötsch, N., Jurman, G. (2021). The Matthews correlation coefficient (MCC) is more reliable than balanced accuracy, bookmaker informedness, and markedness in two-class confusion matrix evaluation. BioData Min 14(1): 13.

  16. GlasaJeroen, A.S., Lijmer, G., Prins, M.H., Bonsel, G.J., Bossuyta, P.M.M. (2009). The diagnostic odds ratio: a single indicator of test performance. Journal of Clinical Epidemiology 56(11): 1129-1135.

  17. Wang H., Zheng H. (2013). Negative Predictive Value. In: Dubitzky W., Wolkenhauer O., Cho KH., Yokota H. (eds) Encyclopedia of Systems Biology. Springer, New York, NY.

  18. Freeman, E.A., Moisen, G.G. (2008). A comparison of the performance of threshold criteria for binary classification in terms of predicted prevalence and kappa. Ecol. Modell. 217(1-2): 45-58.

  19. Balayla, J. (2020). Prevalence threshold (φe) and the geometry of screening curves. Plos one, 15(10):e0240215.

  20. Schaefer, J.T. (1990). The critical success index as an indicator of warning skill. Weather and Forecasting 5(4): 570-575.

  21. Hanley, J.A., McNeil, J.A. (2017). The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143(1): 29-36

  22. Hand, D.J., Till, R.J. (2001). A simple generalisation of the area under the ROC curve for multiple class classification problems. Machine Learning 45: 171-186

  23. Mandrekar, J.N. (2010). Receiver operating characteristic curve in diagnostic test assessment. J. Thoracic Oncology 5(9): 1315-1316