When we train a model, we usually want to know how good the model is.
Model performance is assessed using different metrics that quantify how
well a model discriminates cases, stratifies groups or predicts values.
familiar
implements metrics that are typically used to
assess the performance of categorical, regression and survival
models.
Performance metrics for models with categorical outcomes,
i.e. binomial
and multinomial
are listed
below.
method  tag  averaging 

accuracy  accuracy 

area under the receiveroperating curve  auc , auc_roc 

balanced accuracy  bac , balanced_accuracy 

balanced error rate  ber , balanced_error_rate 

Brier score  brier 

Cohen’s kappa  kappa , cohen_kappa 

f1 score  f1_score 
× 
false discovery rate  fdr ,
false_discovery_rate 
× 
informedness  informedness 
× 
markedness  markedness 
× 
Matthews’ correlation coefficient  mcc ,
matthews_correlation_coefficient 

negative predictive value  npv 
× 
positive predictive value  precision , ppv 
× 
recall  recall , sensitivity ,
tpr , true_positive_rate 
× 
specificity  specificity , tnr ,
true_negative_rate 
× 
Youden’s J  youden_j , youden_index 
× 
The area under the receiveroperating curve is quite literally that. It is the area under the curve created by plotting the true positive rate (sensitivity) against the false positive rate (1specificity). TPR and FPR are derived from a contingency table, which is created by comparing predicted class probabilities against a threshold. The receiveroperating curve is created by iterating over1 threshold values. The AUC of a model that predicts perfectly is \(1.0\), while \(0.5\) indicates predictions that are no better than random.
The implementation in familiar
does not use the ROC
curve to compute the AUC. Instead, an algebraic equation by Hand and Till (2001) is used. For
multinomial
outcomes the AUC is computed for each pairwise
comparison of outcome classes, and averaged (Hand
and Till 2001).
The Brier score (Brier 1950) is a measure of deviation of predicted probabilities from the ideal situation where the probability for class a is \(1.0\) if the observed class is a and \(0.0\) if it is not a. Hence, it can be viewed as a measure of calibration as well. A value of \(0.0\) is optimal.
The implementation in familiar
iterates over all outcome
classes in a oneversusall approach, as originally devised by Brier (1950). For binomial
outcomes, the score is divided by 2, so that it falls in the \([0.0, 1.0]\) range.
A contingency table or confusion matrix displays the observed and predicted classes. When dealing with two classes, e.g. a and b, one of the classes is usually termed ‘positive’ and the other ‘negative’. For example, let b be the ‘positive’ class. Then we can define the following four categories:
A contingency table contains the occurrence of each of the four cases. If a model is good, most samples will be either true positive or true negative. Models that are not as good may have larger numbers of false positives and/or false negatives.
Metrics based on the contingency table use two or more of the four
categories to characterise model performance. The extension from two
classes (binomial
) to more (multinomial
) is
often not trivial. For many metrics, familiar
uses a
oneversusall approach. Here, all classes are iteratively used as the
‘positive’ class, with the rest grouped as ‘negative’. Three options can
be used to obtain performance values for multinomial
problems, with an implementation similar to that of
scikit.learn
:
micro
: The number of true positives, true negatives,
false positives and false negatives are computed for each class
iteration, and then summed over all classes. The score is calculated
afterwards.
macro
: A score is computed for each class iteration,
and then averaged.
weighted
: A score is computed for each class
iteration, and the averaged with a weight corresponding to the number of
samples with the observed ‘positive’ class, i.e. the
prevalence.
By default, familiar
uses macro
, but the
averaging procedure may be selected through appending
_micro
, _macro
or _weighted
to
the name of the metric. For example, recall_micro
will
compute the recall metric using micro
averaging.
Averaging only applies to multinomial
outcomes. No
averaging is performed for binomial
problems with two
classes. In this case familiar
will always consider the
second class level to correspond to the ‘positive’ class.
Accuracy quantifies the number of correctly predicted classes: \(s_{acc}=(TP + TN) / (TP + TN + FP + FN)\). The extension to more than 2 classes is trivial. No averaging is performed for the accuracy metric.
Accuracy is known to be sensitive to imbalances in the class distribution. A balanced accuracy was therefore defined (Brodersen et al. 2010), which is the averaged withinclass true positive rate (also known as recall or sensitivity): \(s_{bac}=0.5 (TP / (TP + FN) + TN / (TN + FP))\).
The extension to more than 2 classes involves summation of inclass true positive rate \(TP / (TP + FN)\) for each positive class and subsequent division by the number of classes. No averaging is performed for balanced accuracy.
The balanced error rate is closely related to balanced accuracy, i.e. instead of the inclass true positive rate, the inclass false negative rate is used: \(s_{ber}=0.5 (FN / (TP + FN) + FP / (TN + FP))\).
The extension to more than 2 classes involves summation of inclass false negative rate \(FN / (TP + FN)\) for each positive class and subsequent division by the number of classes. No averaging is performed for balanced error rate.
The F1 score is the harmonic mean of precision and sensitivity: \(s_{f1} = 2 \; TP / (2 \; TP + FP + FN)\).
The metric is not invariant to class permutation. Averaging is
therefore performed for multinomial
outcomes.
The false discovery rate quantifies the proportion of false positives among all predicted positives, i.e. the Type I error: \(s_{fdr} = FP / (TP + FP)\).
The metric is not invariant to class permutation. Averaging is
therefore performed for multinomial
outcomes.
Informedness is a generalisation of Youden’s J statistic (Powers 2011). Informedness can be extended to multiple classes, and no averaging is therefore required.
For binomial
problems, informedness and the Youden J
statistic are the same.
Cohen’s kappa coefficient is a measure of correspondence between the observed and predicted classes (Cohen 1960). Cohen’s kappa coefficient is invariant to class permutations and no averaging is performed for Cohen’s kappa.
Markedness is related to the precision or positive predictive value (Powers 2011).
Matthews’ correlation coefficient measures the correlation between observed and predicted classes [Matthews1975kh].
An extension to multiple classes, i.e. multinomial outcomes, was devised by Gorodkin (2004).
The negative predictive value (NPV) is the fraction of predicted negative classes that were also observed to be negative: \(s_{npv} = TN / (TN + FN)\).
The NPV is not invariant to class permutations. Averaging is
performed for multinomial
outcomes.
The positive predictive value (PPV) is the fraction of predicted positive classes that were also observed to be positive: \(s_{ppv} = TP / (TP + FP)\).
The PPV is also referred to as precision. The PPV is not invariant to
class permutations. Averaging is performed for multinomial
outcomes. micro
averaging effectively computes the
accuracy.
Recall, also known as sensitivity or true positive rate, is the fraction of observed positive classes that were also predicted to be positive: \(s_{recall} = TP / (TP + FN)\).
Recall is not invariant to class permutations and averaging is
performed for multinomial
outcomes. Both micro
and weighted
averaging effectively compute the
accuracy.
Specificity, also known as the true negative rate, is the fraction of observed negative classes that were also predicted to be negative: \(s_{spec} = TN / (TN + FP)\).
Specificity is not invariant to class permutations and averaging is
performed for multinomial
outcomes.
Youden’s J statistic (Youden 1950) is the sum of recall and specificity minus 1: \(s_{youden} = TP / (TP + FN) + TN / (TN + FP)  1\).
Youden’s J statistic is not invariant to class permutations and
averaging is performed for multinomial
outcomes.
For binomial
problems, informedness and the Youden J
statistic are the same.
Performance metrics for models with regression outcomes,
i.e. count
and continuous
, are listed
below.
method  tag 

explained variance  explained_variance 
mean absolute error  mae , mean_absolute_error 
relative absolute error  rae ,
relative_absolutive_error 
mean log absolute error  mlae ,
mean_log_absolute_error 
mean squared error  mse , mean_squared_error 
relative squared error  rse ,
relative_squared_error 
mean squared log error  msle ,
mean_squared_log_error 
median absolute error  medae ,
median_absolute_error 
R^{2} score  r2_score , r_squared 
root mean square error  rmse ,
root_mean_square_error 
root relative squared error  rrse ,
root_relative_squared_error 
root mean square log error  rmsle ,
root_mean_square_log_error 
Each of the above metrics can be made more robust against rare
outliers by appending _winsor
or _trim
as a
suffix to the metric name. This respectively performs winsorising
(clipping) and trimming (truncating) based on the absolute prediction
error, prior to computing the metric. Winsorising clips the predicted
values for 5% of the instances with the most extreme absolute errors
prior to computing the performance metric, whereas trimming removes
these instances. For example, winsored and trimmed versions of the mean
squared error metric are specified as mse_winsor
and
mse_trim
respectively.
Let \(y\) be the set of observed values, and \(\hat{y}\) the corresponding predicted values. The error is then \(\epsilon = y\hat{y}\).
The explained variance is defined as \(1  \text{Var}\left(\epsilon\right) / \text{Var}\left(y\right)\). This metric is not sensitive to differences in offset between observed and predicted values.
The mean absolute error is defined as \(1/N \sum_i^N \left\epsilon_i\right\), with \(N\) the number of samples.
The relative absolute error is defined as \(\sum_i^N \left\epsilon_i\right/ \sum_i^N \lefty_i  \bar{y}\right\).
The mean log absolute error is defined as \(1/N \sum_i^N \log(\left\epsilon_i\right + 1)\).
The mean squared error is defined as \(1/N \sum_i^N \left(\epsilon_i \right)^2\).
The relative squared error is defined as \(\sum_i^N \left(\epsilon_i\right)^2/ \sum_i^N \left(y_i  \bar{y}\right)^2\).
Mean squared log error is defined as \(1/N \sum_i^N \left(\log \left(y_i + 1\right)  \log\left(\hat{y}_i + 1\right)\right)^2\). Note that this score only applies to observed and predicted values in the positive domain. It is not defined for negative values.
The median absolute error is the median of absolute error \(\left\epsilon\right\).
The R^{2} score is defined as: \[R^2 = 1  \frac{\sum_i^N(\epsilon_i)^2}{\sum_i^N(y_i  \bar{y})^2}\] Here \(\bar{y}\) denotes the mean value of \(y\).
The root mean square error is defined as \(\sqrt{1/N \sum_i^N \left(\epsilon_i \right)^2}\).
The root relative squared error is defined as \(\sqrt{\sum_i^N \left(\epsilon_i\right)^2/ \sum_i^N \left(y_i  \bar{y}\right)^2}\).
The root mean square log error is defined as \(\sqrt{1/N \sum_i^N \left(\log \left(y_i + 1\right)  \log\left(\hat{y}_i + 1\right)\right)^2}\). Note that this score only applies to observed and predicted values in the positive domain. It is not defined for negative values.
Performance metrics for models with survival outcomes,
i.e. survival
, are listed below.
method  tag 

concordance index  concordance_index , c_index ,
concordance_index_harrell ,
c_index_harrell 
The concordance index assesses ordering between observed and predicted values. Let \(T\) be observed times, \(c\) the censoring status (\(0\): no observed event; \(1\): event observed) and \(\hat{T}\) predicted times. Concordance between all pairs of values is determined as follows (Pencina and D’Agostino 2004):
The concordance index is then computed as: \[ci = \frac{n_{concord} + 0.5 n_{tied}}{n_{concord} + n_{discord} + n_{tied}}\]