The `search.bin()`

function is one of the three main
functions in the `ldt`

package. This vignette explains a
basic usage of this function using Berka and
Sochorova (1993) dataset. Loan default refers to the failure to
repay a loan according to the terms agreed upon in the loan contract.
This topic has been extensively studied in the literature. According to
Manz (2019), determinants of default can
include macroeconomic events such as changes in interest and
unemployment rates, bank-specific factors such as risk management
strategies, and loan-specific factors such as its amount and purpose, as
well as borrower creditworthiness. In this section, we will use the
`ldt`

package to conduct an experiment on this topic while
making minimal theoretical assumptions.

Berka and Sochorova (1993) dataset has
a *loan table* with 682 observations, each labeled as
*finished*, *finished with default*, *running*, and
*running with default.* Each loan observation has an *account
identification* that can provide other types of information from
other tables, such as the characteristics of the account of the loan and
its transactions. Each account has a *district identification*
that can provide information about the demographic characteristics of
the location of its branch. The combined table has 58 features and 682
observations.

For this example, we use just the first 7 columns of data:

Here are the last few observations from this subset of the data:

```
tail(data)
#> label ln(amount) ln(payments) rate duration_12 duration_36 duration_60
#> [677,] 0 11.98866 8.405144 0 0 1 0
#> [678,] 0 12.77338 8.902183 0 0 0 0
#> [679,] 0 10.86880 8.383890 0 1 0 0
#> [680,] 0 11.84573 8.667680 0 0 0 0
#> [681,] 0 10.92651 7.748460 0 0 0 0
#> [682,] 0 12.39214 8.297793 0 0 0 1
#> duration_24
#> [677,] 0
#> [678,] 0
#> [679,] 0
#> [680,] 1
#> [681,] 1
#> [682,] 0
```

And here are some summary statistics for each variable:

```
sapply(as.data.frame(data), summary)
#> label ln(amount) ln(payments) rate duration_12 duration_36
#> Min. 0.000000 8.513185 5.717028 0 0.0000000 0.0000000
#> 1st Qu. 0.000000 11.108439 7.814803 0 0.0000000 0.0000000
#> Median 0.000000 11.669313 8.277412 0 0.0000000 0.0000000
#> Mean 0.111437 11.614518 8.157249 0 0.1920821 0.1906158
#> 3rd Qu. 0.000000 12.257972 8.667938 0 0.0000000 0.0000000
#> Max. 1.000000 13.289267 9.201300 0 1.0000000 1.0000000
#> duration_60 duration_24
#> Min. 0.00000 0.000000
#> 1st Qu. 0.00000 0.000000
#> Median 0.00000 0.000000
#> Mean 0.21261 0.202346
#> 3rd Qu. 0.00000 0.000000
#> Max. 1.00000 1.000000
```

The columns of the data represent the following variables:

label: default=1

ln(amount): amount of money (log)

ln(payments): monthly payments (log)

rate: interest rate

duration_12: dummy variable indicating a 12-month loan duration

duration_36: dummy variable indicating a 36-month loan duration

duration_60: dummy variable indicating a 60-month loan duration

duration_24: dummy variable indicating a 24-month loan duration

The target variable is the first variable. We use the out-of-sample AUC evaluation metric to find the best predicting model.

```
search_res <- search.bin(data = get.data(data, endogenous = 1, weights = data.berka$w),
combinations = get.combinations(sizes = c(1,2,3),
numTargets = 1),
metric <- get.search.metrics(typesIn = c(),
typesOut = c("auc"),
simFixSize = 20,
trainRatio = 0.8,
seed = 123),
items = get.search.items(bestK = 0,
inclusion = TRUE,
type1 = TRUE,
mixture4 = TRUE))
print(search_res)
#> LDT search result:
#> Method in the search process: Binary
#> Expected number of models: 29, searched: 29 , failed: 7 (24.1%)
#> Elapsed time: 0.01686 minutes
#> Length of results: 2
#> --------
#> Failures:
#> 1. ldt::statistics->matrix singularity: 7 (24.1%)
#> --------
#> Target (label):
#> Evaluation (aucOut):
#> Inclusion weights average:
#> maximum value: 0.6738315
#> name: ln(payments)
#> count: 6
#> Mixture significant [mean-1.95*std, mean, mean+1.95*std]:
#> ln(amount): (3x1) 0.1843058, 0.6110107, 1.037716
#> ln(payments): (3x1) 0.7696544, 1.032932, 1.296209
#> --------
```

The output of the `search.bin()`

function does not contain
any estimation results, but only the information required to replicate
them. The `summary()`

function returns a similar structure
but with the estimation results included.

While choosing an out-of-sample metric indicates our interest in the
predictive power of the best model, for presentation purposes, the
`mixture4 = TRUE`

part is also included in the
`searchItems`

argument. Therefore, we use the results to
study the coefficients uncertainty. In this regard, we select the first
four regressors with the highest inclusion weights (note the
`inclusion = TRUE`

) and plot their combined weighted
distributions. Note that the parameters of the generalized lambda
distribution are estimated from the first four moments of the combined
distribution using the `s.gld.from.moments()`

}` function in
this package, with the distribution restricted to be unimodal with a
continuous tail.

First, we prepare data for plot:

```
inclusion_mat <- search_res$results[sapply(search_res$results, function(a)a$typeName == "inclusion")][[1]]$value
inclusion_mat <- inclusion_mat[!(rownames(inclusion_mat) %in% c("label", "(Intercept)")), ]
sorted_inclusion_mat <- inclusion_mat[order(inclusion_mat[,1], decreasing = TRUE),]
selected_vars <- rownames(sorted_inclusion_mat)[1:4]
mixture_mat <- search_sum$results[sapply(search_sum$results, function(a)a$typeName == "mixture")][[1]]$value
moments <- lapply(selected_vars, function(v)mixture_mat[rownames(mixture_mat)==v,])
gld_parms <- lapply(moments, function(c) s.gld.from.moments(c[[1]], c[[2]],
c[[3]], c[[4]],
start = c(0.25, 0.25),
type = 4,
nelderMeadOptions = get.options.neldermead(100000,1e-6)))
```

Then we plot the estimated distributions:

```
# plots
probs <- seq(0.01,0.99,0.01)
i <- 0
for (gld in gld_parms){
i <- i + 1
x <- s.gld.quantile(probs, gld[1],gld[2],gld[3],gld[4])
y <- s.gld.density.quantile(probs, gld[1],gld[2],gld[3],gld[4])
plot(x, y, type = "l", xaxt = "n", xlab = NA, ylab = NA, col = "blue", lwd = 2,
main = selected_vars[i])
lower <- x[abs(probs - 0.05)<1e-10]
upper <- x[abs(probs - 0.95)<1e-10]
axis(1, at = c(lower, 0, upper), labels = c(round(lower, 2), 0, round(upper, 2)))
xleft <- x[x <= lower]
xright <- x[x >= upper]
yleft <- y[x <= lower]
yright <- y[x >= upper]
polygon(c(min(x), xleft, lower), c(0, yleft, 0), col = "gray", density = 30, angle = 45)
polygon(c(max(x), upper, xright), c(0, 0, yright), col = "gray", density = 30, angle = -45)
text(mean(x), mean(y), "90%", col = "gray")
}
```

This package can be a recommended tool for empirical studies that
require reducing assumptions and summarizing uncertainty analysis
results. This vignette is just a demonstration. There are indeed other
options you can explore with the `search.bin()`

function. For
instance, you can experiment with different evaluation metrics or
restrict the model set based on your specific needs. Additionally,
there’s an alternative approach where you can combine modeling with
Principal Component Analysis (PCA) (see `estim.bin()`

function). I encourage you to experiment with these options and see how
they can enhance your data analysis journey.

Berka, Petr, and Marta Sochorova. 1993. “PKDD’99 Discovery
Challenge, a Collaborative Effort in Knowledge Discovery from
Databases.”

Manz, Florian. 2019. “Determinants of Non-Performing Loans: What
Do We Know? A Systematic Review and Avenues for Future Research.”
*Management Review Quarterly* 69 (4): 351–89. https://doi.org/10.1007/s11301-019-00156-7.