- 1 Assessing the performance of risk prediction models
- 2 Assessing the performance of other types of prediction models
- 3 FAQ
- 4 References

In this document, we give you a brief overview of the basic functionality of the `CalibrationCurves`

package. In addition, we present the theoretical framework behind calibration and provide some illustrative examples to give the reader a better insight into the calibration assessment of a predictive model. We advise you to also consult the help-pages of the functions to get an exhaustive overview of the functionality.

We tried to tailor the explanation of the concepts to professionals with different backgrounds. Please, do contact me if you feel that something is unclear so that I can adjust (and hopefully improve) it. In addition, don’t hesitate to send any suggestions you might have and bug reports to the package author.

In this package, we focus on risk prediction models that estimate the probability \(\pi_i\) of observing an event. We use \(y_i \in (0, 1)\) to denote the variable that captures this outcome which takes on the value 0 in case of a non-event and 1 in case of an event. Here, \(i\) serves as an index for the observations (mostly the patient within medical predictive analytics) with \(i = (1, \dots, n)\) and where \(n\) denotes the total number of observations. We assume that the response variable \(y_i\) follows a Bernoulli distribution \(y_i \sim \text{Bern}(\pi_i)\).

For example, we could be interested in estimating the probability \(\pi_i\) of observing a malignant tumour for patient \(i\). In this case, the event \(y_i = 1\) is the tumour being malignant and \(y_i = 0\) when the tumour is benign. With no available information on the patient characteristics, we might rely on the prevalence in the general population to estimate this probability.

Using risk prediction models, we model the outcome as a function of the observed risk/patient characteristics. The risk characteristics are contained in the covariate vector \(\boldsymbol{x}_i\). This vector contains all observed information for patient \(i\) (e.g. maximum diameter of the lesion, proportion of solid tissue, …). This allows us to obtain a more accurate prediction that is based on the relation between the patient characteristics and the outcome. To construct a clinical prediction model, we either rely on a statistical models such as logistic regression or machine learning methods. A general expression that encompasses both types of models is \[\begin{align*} E[y_i | \boldsymbol{x}_i] = f(\boldsymbol{x}_i). \end{align*}\] This expression states that we model the response \(y_i\) as a function of the observed risk characteristics \(\boldsymbol{x}_i\).

To construct a risk prediction model, we could rely on a logistic regression model \[\begin{align*} E[y_i | \boldsymbol{x}_i] = \pi_i(\boldsymbol{\beta}) = \frac{e^{\boldsymbol{x}_i^\top \boldsymbol{\beta}}}{1 + e^{\boldsymbol{x}_i^\top \boldsymbol{\beta}}} \end{align*}\] where \(\boldsymbol{\beta}\) denotes the parameter vector. \(\pi_i(\boldsymbol{\beta}) = P(y_i = 1| \boldsymbol{x}_i)\) denotes the probability of observing the event, given the covariate vector \(\boldsymbol{x}_i\). We can rewrite the equation to its more well-known form \[\begin{align*} \log\left( \frac{\pi_i(\boldsymbol{\beta})}{1 - \pi_i(\boldsymbol{\beta})} \right) &= \boldsymbol{x}_i^\top \boldsymbol{\beta}\\[0.5em] \text{logit}(\pi_i(\boldsymbol{\beta})) &= \eta_i \end{align*}\] where \(\eta_i\) denotes the linear predictor. Here, we have the well-known logit function at the left side of the equation.

With machine learning methods, \(f(\cdot)\) depends on the specific algorithm. With tree-based methods, for example, this correspond to the observed proportion in the leaf nodes. For neural networks, \(f(\cdot)\) is determined by the weights in the layers and the chosen activation functions.

To assess how well the model is able to predict (the probability of) the outcome, we assess two different aspects of the model (Van Calster et al. 2016, 2019; Alba et al. 2017):

*discrimination*;*calibration*.

With *discrimination*, we refer to the model’s ability to differentiate between observations that have the event and observations that have not. In this context, this translates to giving higher risk estimates for patients with the event than patients without the event. We commonly assess this using the area under the receiver operating characteristic curve. However, discrimination performance does not tell us how accurate the predictions are. The estimated risk may result in good discrimination and can be inaccurate at the same time. We refer to the accuracy of the predictions as the *calibration*. Hence, hereby we assess the agreement between the estimated and observed number of events (Van Calster et al. 2016). We say that a prediction model is calibrated if the predicted risks correspond to the observed proportions of the event.

One way to examine the calibration of risk predictions, is by using calibration curves (Van Calster et al. 2016, 2019; Steyerberg 2019; De Cock Campo 2023). A calibration curve maps the predicted probabilities \(f(\boldsymbol{x}_i)\) to the actual event probabilities \(P(y_i = 1| f(\boldsymbol{x}_i))\) and visualizes the correspondence between the model’s predicted risks and the true probabilities. For perfectly calibrated predictions, the calibration curve equals the diagonal, i.e. \(P(y_i = 1 | f(\boldsymbol{x}_i)) = f(\boldsymbol{x}_i) \ \forall \ i\) where \(\forall \ i\) denotes for all \(i\).

In practice, we typically assess the model’s calibration on a validation set. In this setting, a calibration curve visualizes the correspondence between the model’s predicted risks and the observed proportion. When we have a perfect agreement between the observed and predicted proportion the calibration curve coincides with the ideal curve (a diagonal line). This scenario is visualized in Figure 1.1.

By assessing the calibration performance on a data set other than the training set, we obtain an indication of how well our risk prediction is able to generalize to other data sets and how accurate its out-of-sample predictions are. In general, the prediction model will show some miscalibration and the calibration curve gives us a visual depiction of how badly the model is miscalibrated. The further from the diagonal line, the worse the calibration. Figure 1.2 depicts an example of a model that is miscalibrated and is a typical example of a model that is overfitted to the training data. This particular model has predictions that are too extreme: high risks are overestimated and low risks are underestimated.