```
library(ldt)
library(kableExtra)
```

It is recommended to read the following vignettes first:

In **ldt**, we *automatically explain* or
*automatically predict* one or more than one random variable. In
this vignette, we design a model set for predicting loan default. Of
course, we focus on comparing the performance of *logit* and
*probit* models. We use AUC as a performance measure:

```
<- GetMeasureOptions(
measureOptions typesIn = c("aucIn", "frequency"),
typesOut = c("aucOut", "frequency")
)
```

Note that, we calculate both in-sample and out-of-sample AUC. Also, we calculate the error ratio by using the following frequency cost matrix:

\[\begin{equation} \label{eq:error-ratio} \begin{bmatrix} 0.5 & 1 & 1\\ 1.0 & 0 & 0 \end{bmatrix} \end{equation}\]

which is:

`<- matrix(c(0.5, 1, 1, 0, 1, 0), 2, 3) frequencyCost `

Note that this is not a favorable cost matrix in an actual application. One might want to define more thresholds or increase the costs in the third column.

What is a frequency cost matrix in **ldt**? The general
form of a frequency cost matrix in binary case is: \[\begin{equation}
\begin{bmatrix}
t_1 & c_{11} & c_{12}\\
t_2 & c_{21} & c_{22}\\
\vdots & \vdots&\vdots\\
t_n & c_{n1} & c_{n2}\\
\end{bmatrix}
\end{equation}\] In this presentation, \(t_i\) for \(i=1,\ldots,n\) is the threshold for the
predicted probability. If the actual value is negative (i.e., \(y_i=0\)), the cost is determined by the
first column. Otherwise (i.e., \(y_i=1\)), the cost is determined by the
third column.

In this vignette, we use Berka and Sochorova
(1993) data-set and `ldt::Data_BerkaLoan()`

function
to get a sample for the dependent variable and the potential predictors
(or, features):

```
<- Data_BerkaLoan(positive = c("B", "D"), negative = c("A", "C"))
data #data <- Data_BerkaLoan(positive = c("B"), negative = c("A"))
```

The data set has a *loan table* with 682 observations, each
labeled as one of the following:

- A: finished (29.8%);
- B: finished with default
(
`vig_data$berka$B_finished`

%); - C: running (59.1%); and
- D: running with default (6.6%).

Numbers in the parenthesis show the percentage of data in each class.
Each loan observation has an *account identification* that can
provide other types of information from other tables, such as the
characteristics of the account of the loan and its transactions.
Furthermore, each account has a *district identification* that
can provide information about the demographic characteristics of the
location of its branch. The combined table has features (including the
label) and observations.

For this example, both *finished* and *running*
(without default) classes are considered to be *negative* and
both finished and running *with default* classes to be positive
observations. Note that the observations labeled *running* might
introduce measurement errors. Without them, the length of the table is
smaller. If you do not want to use this part of the data, uncomment the
code in the previous chunk.

The dependent and potential exogenous variables are:

```
<- data[, c("label"), drop = FALSE]
y <- data[, 4:ncol(data)] x
```

Note that the first 2 columns of `data`

are
`id`

and `status`

.

We are not able to load the data in this vignette, because it needs
an external data set and this is not available in this package. But, a
part of the data set is saved in the **ldt** package, and
we load:

```
= as.matrix(ldt::vig_data$berka$x)
x = as.matrix(ldt::vig_data$berka$y) y
```

If you have downloaded the data set files, do not run this code. Since only 22.28739% of the observations are positive, we define and use the following weight vector to balance the data:

`<- as.matrix((y == 1) * (nrow(y) / sum(y == 1)) + (y == 0)) weight `

There are 10 potential predictors and the size of the potential
predicting models is relatively large. We follow a step-wise search
approach by defining the following two arguments for the
`DcSearch_s()`

function:

```
<- list(as.integer(c(1, 2)), as.integer(c(3)))
xSizes <- c(NA, 4) xCounts
```

We also need a seed for the RNG and some other options to define the out-of-sample prediction:

```
$seed <- 340
measureOptions$simFixSize <- 10
measureOptions$trainRatio <- 0.75 measureOptions
```

Note that the out-of-sample simulation depends on random sampling. Finally, we start the search function:

```
<- list(
berka_res logit = DcSearch_s(
x = x, y = y, w = weight, costMatrices = list(frequencyCost),
xSizes = xSizes, counts = xCounts,
searchLogit = TRUE, searchProbit = FALSE,
searchItems = GetSearchItems(bestK = 20, inclusion = TRUE),
measureOptions = measureOptions,
searchOptions = GetSearchOptions(printMsg = FALSE),
savePre = NULL
),probit = DcSearch_s(
x = x, y = y, w = weight, costMatrices = list(frequencyCost),
xSizes = xSizes, counts = xCounts,
searchLogit = FALSE, searchProbit = TRUE,
searchItems = GetSearchItems(bestK = 20, inclusion = TRUE),
measureOptions = measureOptions,
searchOptions = GetSearchOptions(printMsg = FALSE),
savePre = NULL
) )
```

Since we want to compare the performance of logit and probit models,
we run two discrete choice searches. All options are similar, but one is
with `searchLogit = TRUE`

and the other is with
`searchLogit = FALSE`

and `searchProbit = TRUE`

.
The results are reported in the following plot:

Berka, Petr, and Marta Sochorova. 1993. “PKDD’99 Discovery
Challenge, a Collaborative Effort in Knowledge Discovery from
Databases.”