```
library(ldt)
library(kableExtra)
library(MASS)
```

It is recommended to read the following vignette first:

In this vignette, we draw a sample from a known population and talk
about the automatic variable selection in **ldt** (In another vignette we use an actual data set
with a similar goal). In this vignette, we focus on a simulation problem
with two endogenous variables and the following DGP: \[\begin{align}
\label{eq:sur-sim-model}
\begin{pmatrix}y_1\\y_2\end{pmatrix}=&
\begin{bmatrix}
1 & 2 & 0\\ 0 & 3 & 4
\end{bmatrix}\begin{pmatrix}x_1\\x_2\\x_3\end{pmatrix} +
\begin{pmatrix}e_1\\e_2\end{pmatrix}\\
&\begin{pmatrix}e_1\\e_2\end{pmatrix}=s
\begin{bmatrix}
2 & 1\\ 0 & 2
\end{bmatrix}\begin{pmatrix}\epsilon_1\\\epsilon_2\end{pmatrix},\quad
\epsilon_i\sim N(0,1),\\
&x_i\sim N(\frac{i}{3},1),
\end{align}\]

where \(y_1\) and \(y_2\) are endogenous variables, \(x_i\) for \(i=1,\dots,3\) are exogenous variables, \(s\) is a constant parameter, \(\epsilon_1\) and \(\epsilon_2\) are independent, and \(e_1\) and \(e_2\) are disturbances. It is easy to see that \(\operatorname{var}(e_1)=5s^2\), \(\operatorname{var}(e_2)=4s^2\), and \(\operatorname{cov}(e_1,e_2)=2s^2\). Also, note that the model is restricted and each dependent variable is explained by 2 variables.

Furthermore, we assume that there is a sample of size \(n\) for \(y_1\) and \(y_2\) and 18 other variables denoted by \(z_i\), where, \[\begin{equation} z_i\sim N(\frac{i}{18},1),\quad \text{ for } i=1,\dots,18. \end{equation}\] We assume that \(\{z_i\}\) is the set of the potential regressors. Note that \(z_{6}\), \(z_{12}\) and \(z_{18}\) are respectively equal in distribution to \(x_1\), \(x_2\) and \(x_3\) (the means are \(1/3\), \(0.5\) and \(1\), respectively).

In this example, we focus on the ability to select the true random variables and to estimate the parameters properly, in some scenarios on the \(s\) and \(n\):

```
<- c(30, 50)
sampleSizes <- c(0.1, 2) sMultipliers
```

Note that a larger value for \(s\)
(element of `sMultipliers`

) means more dispersed and more
correlated disturbances and a larger value for \(n\) (element of `sampleSizes`

)
means more degrees of freedom.

We assume that the number of endogenous variables is known.
Furthermore, to decrease the size of the model set, we assume that we
know that the number of regressors is not larger than 3. In other words,
we will use the following data (If you are not familiar with the
arguments of the `SurSearch_s()`

function, please read this vignette first):

```
<- list(as.integer(c(1, 2)), as.integer(c(3)))
xSizes <- c(NA, 5)
xCounts <- list(as.integer(c(1, 2))) yGroups
```

The following function can draw a sample from the described population:

```
<- function(n, s) {
GetSample <- NULL
z for (i in (1:18)) {
<- cbind(z, rnorm(n, i / 18, 1))
z
}<- as.matrix(z)
z <- z[, c(6, 12, 18)]
x <- as.matrix(mvrnorm(n, c(0, 0), s * s * matrix(c(5, 2, 2, 4), 2, 2)))
e <- x %*% t(matrix(c(1, 2, 0, 0, 3, 4), 2, 3, byrow = TRUE)) + e
y colnames(y) <- c("y1", "y2")
colnames(z) <- paste0("z", seq(1, 18, 1))
return(list(y = y, z = z))
}
```

Note that with `x = z[,c(6,12,18)]`

we are assuming
equality of random variables (not just equality in distribution).

In each simulation, a sample is drawn randomly. Therefore, we need a seed for the random number generator:

```
<- 340
seed set.seed(seed)
```

Given the sample, we use `SurSearch_s()`

function to find
the best model. This function will select one best mode for each of the
evaluation methods. We use the results for AIC, SIC, RMSE, and CRPS
measures:

```
<- GetMeasureOptions(
measureOptions typesIn = c("aic", "sic"),
typesOut = c("rmse", "crps")
)<- c("AIC", "SIC", "RMSE", "CRPS") measureNames
```

If you are not familiar with the arguments of the
`SurSearch_s()`

function, please read this vignette first. The
`measureNames`

vector is just a display name that we will use
later.

In order to compare the performance of different measures, we use the following scoring schema:

```
<- function(indices1, indices2) {
GetScore if (length(indices1) == 2 && length(indices2) == 2 &&
all(indices1 == c(6, 12)) && all(indices2 == c(12, 18))) {
return(1)
}<- 0
score if (any(indices1 == 6)) score <- score + 0.1
if (any(indices1 == 12)) score <- score + 0.1
if (any(indices2 == 12)) score <- score + 0.1
if (any(indices2 == 18)) score <- score + 0.1
return(score)
}
```

This schema uses the indices of the regressors in the (restricted)
best model and calculates the score. If the indices belong to the true
model (i.e., `indices1=c(6,12)`

and
`indices2=c(12,18)`

), the score is 1. otherwise each correct
index (i.e., 6 and 12 for the first equation and 12 and 18 for the
second equation) adds 0.1 points to the score. For example, this means
the maximum score for an incorrect model is 0.4. Such a model has all
the true explanatory variables and some irrelevant ones. The minimum
score is zero. This is the score of a model with no correct explanatory
variable.

The `SurSearch_s()`

function also reports the inclusion
weights. The previous function is used to get the score of such a list.
In this case, `indices1`

and `indices2`

are the
index of the first 2 explanatory variables in the inclusion list.

Before we continue, we need a helper function to get the required indices from the estimations:

```
<- function(measureRes, measureSumRes) {
helper_GetIndices <- measureSumRes$target1$model$bests$best1$coefs$isRestricted
isRest1 <- measureSumRes$target2$model$bests$best1$coefs$isRestricted
isRest2
if (is.null(isRest1)) {
<- sort(measureRes$target1$model$bests$best1$exoIndices)
mindices1 else {
} <- sort(measureRes$target1$model$bests$best1$exoIndices[isRest1[, 1] != 1])
mindices1
}if (is.null(isRest2)) {
<- sort(measureRes$target2$model$bests$best1$exoIndices)
mindices2 else {
} <- sort(measureRes$target2$model$bests$best1$exoIndices[isRest2[, 2] != 1])
mindices2
}
<- nrow(measureRes$target1$model$inclusion)
end <- sort(measureRes$target1$model$inclusion[3:end, 1, drop = F],
iindices1 index.return = TRUE, decreasing = TRUE
)<- sort(measureRes$target2$model$inclusion[3:end, 1, drop = F],
iindices2 index.return = TRUE, decreasing = TRUE
)
return(list(
model1 = sort(mindices1), model2 = sort(mindices2),
inclusion1 = sort(iindices1$ix[1:2]), inclusion2 = sort(iindices2$ix[1:2])
)) }
```

The central piece of this experiment is the following function in which we generate a sample, search for the best models for each performance measure, and calculate the score of the best models:

```
<- function(i, res, n, s, xSizes, xCounts, yGroups,
SampleAndEstimate
measureOptions, searchItems, checkItems) {<- GetSample(n, s)
sample <- SurSearch_s(
search_res y = sample$y, x = sample$z, numTargets = 2,
xSizes = xSizes, counts = xCounts,
yGroups = yGroups,
searchSigMaxIter = 1,
searchSigMaxProb = 0.1,
searchItems = searchItems,
measureOptions = measureOptions,
modelCheckItems = checkItems,
searchOptions = GetSearchOptions(printMsg = FALSE),
savePre = NULL
)
<- summary(search_res, y = sample$y, x = sample$z, test = FALSE, printMsg = FALSE)
result_sum if (i < 0) {
return(list(search_res = search_res, result_sum = result_sum))
}
<- 0
j for (mea in c("aic", "sic", "rmse", "crps")) {
<- j + 1
j <- helper_GetIndices(search_res[[mea]], result_sum[[mea]])
h $model[[i, j]] <- res$model[[i, j]] + GetScore(h$model1, h$model2)
res$inclusion[[i, j]] <- res$inclusion[[i, j]] + GetScore(h$inclusion1, h$inclusion2)
res
}return(res)
}
```

Finally, we run the simulation:

```
<- 2
simCount <- list()
sur_sim_res for (n in sampleSizes) {
<- list(model = matrix(0, length(sMultipliers), length(measureNames),
res dimnames = list(paste0("sigma=", sMultipliers), measureNames)
))$inclusion <- res$model
res
<- 0
i for (s in sMultipliers) {
<- i + 1
i for (iter in seq(1, simCount)) {
<- seed + iter
seedi $seed <- seedi
measureOptions$simFixSize <- 6
measureOptions$trainRatio <- 0.75
measureOptions<- GetSearchItems(model = TRUE, type1 = FALSE, bestK = 10, inclusion = TRUE)
searchItems <- SampleAndEstimate(
res
i, res, n, s, xSizes, xCounts, yGroups, measureOptions,NULL
searchItems,
)
}
}$model <- res$model / simCount
res$inclusion <- res$inclusion / simCount
reslength(sur_sim_res) + 1]] <- list(n = n, sMultipliers = sMultipliers, res = res)
sur_sim_res[[ }
```

You can increase the number of simulations (We choose a small number
to decrease the computation time). Note that for each
`n`

\(\in\)`sampleSizes`

and each
`s`

\(\in\)`sMultipliers`

, we sample
and estimate 2 times. This is relatively time-consuming.

Comparing the performance of different measures in selecting the true explanatory variables can be interesting. Therefore, in the following plot we draw the scores of the best models:

There is a similar result for the inclusion weights:

Apart from finding the correct variables, another important task is
estimating the unknown parameters. **ldt** provides 4 types
of information about a parameter: - estimated coefficient and its
estimated variance; - the value of the cumulative distribution function
at some specific points; - extreme bounds; and - first four moments of
the combined distribution. We specify these options by the following
code:

```
<- GetSearchItems(
searchItems model = FALSE, type1 = TRUE,
bestK = 1, cdfs = seq(0.8, 1.8, 0.03),
extremeMultiplier = 2, mixture4 = TRUE
)
```

In `SurSearch`

, `type1`

information is the
estimated parameters. Note that in order to set a proper value for
`cdfs`

, we need to run the code (at least) twice. This is
because we need to get some general idea about the range of values of
estimated coefficients (e.g., by studying the extreme bounds analysis
result).

Next, we select the values of \(n\) and \(s\):

```
<- 1
s <- 50 n
```

Next, we want to estimate two types of coefficients: an unrestricted search and a restricted one in which we discard a part of the model set with low. We assess the performance by using AIC:

`<- GetModelCheckItems(maxAic = 10) checkItems `

Note that we adjusted the value of `maxAic`

by running the
code multiple times. This is because, at the first stage, we have no
information about the magnitude of the AICs.

```
<- list(
sur_sim_one unr = SampleAndEstimate(
-1, NULL, n, s, as.integer(c(3)), c(NA), list(as.integer(c(1, 2))), measureOptions,
searchItems, checkItems
),res = SampleAndEstimate(
-1, NULL, n, s, as.integer(c(3)), c(NA), list(as.integer(c(1, 2))), measureOptions,
searchItems, checkItems
) )
```

The number of estimated regressions for the unrestricted and restricted model sets are 136 and 18, respectively.

In this example, we present the results for one of the estimated
parameters: \(\frac{\rho y_1}{\rho
x_1}\) in the following figure. Note that the actual value of
this parameter is 1.0 (see the equation in the first section). Also,
note that we use `ldt::PlotCoefs()`

and
`ldt::GetGldFromMoments()`

functions.