`library(ldt)`

It is recommended to read the following vignettes first:

In this vignette, we talk about fraud detection. The data in this
example is published as a part of a competition, in which the AUC of the
winner model is 0.945884 (see Vesta Corporation
(2018)). Of course, we are not going to participate in that
competition and use a specific test sample and compare the performance
of discrete choice modeling to machine learning approaches. This is
beyond the scope of this vignette (see, e.g., Clarke, Fokoue, and Zhang (2009) for a
discussion). We want to use a large data set, generally to talk about
the performance of **ldt** when data is *big*.

We use Vesta Corporation (2018) and
`Data_VestaFraud()`

function to get the required data:

`<- Data_VestaFraud(training = TRUE) vestadata `

In this data set, there are two samples: train and test. We will use
the training sample in this vignette. The observations are labeled with
*fraud* or *not fraud*. There are 393 features in the
files. Furthermore, each observation has an that can link a part of the
observations to another data file with 40 identity-related features. The
combined data-set has 476 features and 281 millions data-points in which
46.1% is `NA`

.

Dependent and potential exogenous data are:

```
<- as.matrix(vestadata$data[, c("isFraud")])
y <- as.matrix(vestadata$data[, 3:length(vestadata$data)])
x <- as.matrix((y == 1) * (nrow(y) / sum(y == 1)) + (y == 0)) weight
```

Since the data is large and to increase the speed of the calculations, we change the default optimization options:

`<- GetNewtonOptions(maxIterations = 10, functionTol = 1e-2) optimOptions `

We also choose to search a small subset of the model set:

```
<- list(as.integer(c(1)), as.integer(c(2)), as.integer(c(3)), as.integer(c(4:10)))
xSizes <- c(NA, 20, 15, 10) xCounts
```

And, a relatively small out-of-sample simulation:

`<- 4 simFixSize `

And finally, we search for the best model:

```
<- DcSearch_s(
vestaRes x = x, y = y, w = weight,
xSizes = xSizes, counts = xCounts,
optimOptions = optimOptions,
searchItems = GetSearchItems(bestK = 20),
modelCheckItems = GetModelCheckItems(
maxConditionNumber = 1e15, minDof = 1e5, minOutSim = simFixSize / 2
),measureOptions = GetMeasureOptions(
typesIn = c("aucIn"),
typesOut = c("aucOut"),
simFixSize = 4,
trainRatio = 0.9,
seed = 340
),searchOptions = GetSearchOptions(printMsg = FALSE),
printMsg = FALSE,
savePre = "data/dc_vesta_"
)
```

We can get the in-sample and out-of-sample AUC by the following code:

```
print(paste0("Best In-Sample AUC: ", vestaRes$aucIn$target1$model$bests$best1$weight))
print(paste0("Best Out-Of-Sample AUC: ", vestaRes$aucOut$target1$model$bests$best1$weight))
```

Note that we do not evaluate the codes here due to the large data.

The computations are relatively time-consuming even if we choose a small subset of the model set. One might be able to improve our current performance by better use of categorical features (e.g., by studying them and grouping some items). Furthermore, (for development) any improvement in the speed of the calculations allows us to search a larger proportion of the data set. In the presence of categorical variables, applying a more efficient algorithm in dealing with dummy variables or sparse matrices might be helpful.

Clarke, Bertrand, Ernest Fokoue, and Hao Helen Zhang. 2009.
*Principles and Theory for Data Mining and Machine Learning*.
Springer New York, NY. https://doi.org/https://doi.org/10.1007/978-0-387-98135-2.

Vesta Corporation. 2018. “IEEE-CIS Fraud Detection.”