`stratallo`

PackageThe goal of *stratallo* package is to provide implementations
of the efficient algorithms that solve a classical problem in survey
methodology - an optimum sample allocation in stratified sampling. In
this context, the classical problem of optimum sample allocation is the
Tschuprov-Neyman’s sense (Neyman 1934; Tschuprov
1923). It is formulated as determination of a vector of strata
sample sizes that minimizes the variance of the *stratified \(\pi\)-estimator* of the population
total of a given study variable, under constraint on total sample size.
This problem can be further complemented by imposing lower or upper
bounds on sample sizes is strata.

A minor modification of the classical optimum sample allocation problem leads to the minimum cost allocation. This problem lies in the determination of a vector of strata sample sizes that minimizes total cost of the survey, under assumed fixed level of the stratified \(\pi\)-estimator’s variance. As in the case of the classical optimum allocation, the problem of minimum cost allocation can be complemented by imposing upper bounds on sample sizes in strata.

*stratallo* provides two **user functions**:

`opt()`

`optcost()`

that solve sample allocation problems briefly characterized above. In this context, it is assumed that the variance of the stratified estimator is of the following generic form: \[ V_{st}(\mathbf n) = \sum_{h=1}^{H} \frac{A_h^2}{n_h} - A_0, \] where \(H\) denotes total number of strata, \(\mathbf n= (n_h)_{h \in \{1,\ldots,H\}}\) is the allocation vector with strata sample sizes, and population parameters \(A_0\), and \(A_h > 0,\, h = 1,\ldots,H\), do not depend on the \(x_h,\, h = 1,\ldots,H\).

Among stratified estimators and stratified sampling designs that
jointly give rise to a variance of the above form, is the so called
stratified \(\pi\) estimator of the
population total with *stratified simple random sampling without
replacement* design, which is one of the most basic and commonly
used stratified sampling designs. This case yields \(A_0 = \sum_{h = 1}^H N_h S_h^2\), \(A_h = N_h S_h,\, h = 1,\ldots,H\), where
\(S_h\) denotes stratum standard
deviation of study variable and \(N_h\)
is the stratum size (see e.g. Sarndal et al.
(1993), Result 3.7.2, p.103).

Apart from `opt()`

and `optcost()`

,
*stratallo* provides the following **helpers
functions**:

`var_st()`

,`var_st_tsi()`

,`asummary()`

,`ran_round()`

,`round_oric()`

.

Functions `var_st()`

and `var_st_tsi()`

compute
a value of the variance \(V_{st}\). The
`var_st_tsi()`

is a simple wrapper of `var_st()`

that is dedicated for the case when \(A_0 =
\sum_{h = 1}^H N_h S_h^2\) and \(A_h =
N_h S_h,\, h = 1,\ldots,H\). `asummary()`

creates a
`data.frame`

object with summary of the allocation. Functions
`ran_round()`

and `round_oric()`

are the rounding
functions that can be used to round non-integers allocations (see
section Rounding, below). The package comes with three predefined,
artificial populations with 10, 507 and 969 strata. These are stored
under `pop10_mM`

, pop507`and`

pop969` objects,
respectively.

`opt()`

functionThe `opt()`

function solves the following three problems
of the optimum sample allocation, formulated in the language of
mathematical optimization. User of `opt()`

can choose whether
the solution computed will be for **Problem 1**,
**Problem 2** or **Problem 3**. This is
achieved with the proper use of `m`

and `M`

arguments of the function. Also, if required, the inequality constraints
can be removed from the optimization problem. For more details, see the
help page for `opt()`

function.

Given numbers \(n > 0,\, A_h > 0,\, M_h > 0\), such that \(M_h \leq N_h,\, h = 1,\ldots,H\), and \(n \leq \sum_{h=1}^H M_h\), \[\begin{align*} \underset{\mathbf x\in {\mathbb R}_+^H}{\mathrm{minimize ~\,}} & \quad f(\mathbf x) = \sum_{h=1}^H \tfrac{A_h^2}{x_h} \\ \mathrm{subject ~ to} & \quad \sum_{h=1}^H x_h = n \\ & \quad x_h \leq M_h, \quad{h = 1,\ldots,H,} \end{align*}\] where \(\mathbf x= (x_h)_{h \in \{1,\ldots,H\}}\).

There are four different algorithms available to use for
**Problem 1**, *RNA* (default), *SGA*,
*SGAPLUS*, *COMA*. All these algorithms, except
*SGAPLUS*, are described in detail in Wesołowski et al. (2021). The *SGAPLUS*
is defined in Wójciak (2019) as
*Sequential Allocation (version 1)* algorithm.

`library(stratallo)`

Define example population

```
<- c(3000, 4000, 5000, 2000) # Strata sizes.
N <- c(48, 79, 76, 16) # Standard deviations of a study variable in strata.
S <- N * S
a <- 190 # Total sample size. n
```

Tschuprov-Neyman allocation (no inequality constraints).

```
<- opt(n = n, a = a)
opt
opt#> [1] 31.376147 68.853211 82.798165 6.972477
sum(opt) == n
#> [1] TRUE
# Variance of the stratified estimator that corresponds to optimum allocation.
var_st_tsi(opt, N, S)
#> [1] 3940753053
```

One-sided upper-bounds constraints.

```
<- c(100, 90, 70, 80) # Upper bounds imposed on the sample sizes in strata.
M all(M <= N)
#> [1] TRUE
<= sum(M)
n #> [1] TRUE
# Solution to Problem 1.
<- opt(n = n, a = a, M = M)
opt
opt#> [1] 35.121951 77.073171 70.000000 7.804878
sum(opt) == n
#> [1] TRUE
all(opt <= M) # Does not violate upper-bounds constraints.
#> [1] TRUE
# Variance of the stratified estimator that corresponds to optimum allocation.
var_st_tsi(opt, N, S)
#> [1] 4018789143
```

Given numbers \(n,\, A_h > 0,\, m_h > 0\), such that \(m_h \leq N_h,\, h = 1,\ldots,H\), and \(n \geq \sum_{h=1}^H m_h\), \[\begin{align*} \underset{\mathbf x\in {\mathbb R}_+^H}{\mathrm{minimize ~\,}} & \quad f(\mathbf x) = \sum_{h=1}^H \tfrac{A_h^2}{x_h} \\ \mathrm{subject ~ to} & \quad \sum_{h=1}^H x_h = n \\ & \quad x_h \geq m_h, \quad{h = 1,\ldots,H,} \end{align*}\] where \(\mathbf x= (x_h)_{h \in \{1,\ldots,H\}}\).

The optimization **Problem 2** is solved by the
*LRNA* that in principle is based on the *RNA* and it is
introduced in Wójciak (2023).

```
<- c(50, 120, 1, 2) # Lower bounds imposed on the sample sizes in strata.
m >= sum(m)
n #> [1] TRUE
# Solution to Problem 2.
<- opt(n = n, a = a, m = m)
opt
opt#> [1] 50 120 18 2
sum(opt) == n
#> [1] TRUE
all(opt >= m) # Does not violate lower-bounds constraints.
#> [1] TRUE
# Variance of the stratified estimator that corresponds to optimum allocation.
var_st_tsi(opt, N, S)
#> [1] 9719807556
```

Given numbers \(n,\, A_h > 0,\, m_h > 0,\, M_h > 0\), such that \(m_h < M_h \leq N_h,\, h = 1,\ldots,H\), and \(\sum_{h=1}^H m_h \leq n \leq \sum_{h=1}^H M_h\), \[\begin{align*} \underset{\mathbf x\in {\mathbb R}_+^H}{\mathrm{minimize ~\,}} & \quad f(\mathbf x) = \sum_{h=1}^H \tfrac{A_h^2}{x_h} \\ \mathrm{subject ~ to} & \quad \sum_{h=1}^H x_h = n \\ & \quad m_h \leq x_h \leq M_h, \quad{h = 1,\ldots,H,} \end{align*}\] where \(\mathbf x= (x_h)_{h \in \{1,\ldots,H\}}\).

The optimization **Problem 3** is solved by the
*RNABOX* which is a new algorithm proposed by the authors of this
package and it will be published soon.

```
<- c(100, 90, 500, 50) # Lower bounds imposed on sample sizes in strata.
m <- c(300, 400, 800, 90) # Upper bounds imposed on sample sizes in strata.
M <- 1284
n >= sum(m) && n <= sum(M)
n #> [1] TRUE
# Optimum allocation under box constraints.
<- opt(n = n, a = a, m = m, M = M)
opt
opt#> [1] 228.9496 400.0000 604.1727 50.8777
sum(opt) == n
#> [1] TRUE
all(opt >= m & opt <= M) # Does not violate any lower or upper bounds constraints.
#> [1] TRUE
# Variance of the stratified estimator that corresponds to optimum allocation.
var_st_tsi(opt, N, S)
#> [1] 538073357
```

`optcost()`

functionThe `optcost()`

function solves the following minimum
total cost allocation problem, formulated in the language of
mathematical optimization.

Given numbers \(A_h > 0,\, c_h > 0,\, M_h > 0\), such that \(M_h \leq N_h,\, h = 1,\ldots,H\), and \(V \geq \sum_{h=1}^H \tfrac{A_h^2}{M_h} - A_0\), \[\begin{align*} \underset{\mathbf x\in {\mathbb R}_+^H}{\mathrm{minimize ~\,}} & \quad c(\mathbf x) = \sum_{h=1}^H c_h x_h \\ \mathrm{subject ~ to} & \quad \sum_{h=1}^H \tfrac{A_h^2}{x_h} - A_0 = V \\ & \quad x_h \leq M_h, \quad{h = 1,\ldots,H,} \end{align*}\] where \(\mathbf x= (x_h)_{h \in \{1,\ldots,H\}}\).

The algorithm that solves **Problem 4** is based on the
*LRNA* and it is described in Wójciak
(2023).

```
<- c(3000, 4000, 5000, 2000)
a <- 70000
a0 <- c(0.5, 0.6, 0.6, 0.3) # c_h, h = 1,...4.
unit_costs <- c(100, 90, 70, 80)
M <- 1e6 # Variance constraint.
V >= sum(a^2 / M) - a0
V #> [1] TRUE
<- optcost(V = V, a = a, a0 = a0, M = M, unit_costs = unit_costs)
opt
opt#> [1] 40.39682 49.16944 61.46181 34.76805
sum(a^2 / opt) - a0 == V
#> [1] TRUE
all(opt <= M)
#> [1] TRUE
```

*stratallo* comes with 2 functions: `ran_round()`

and `round_oric()`

that can be used to round non-integer
allocations.

```
<- c(100, 90, 500, 50)
m <- c(300, 400, 800, 90)
M <- 1284
n
# Optimum, non-integer allocation under box constraints.
<- opt(n = n, a = a, m = m, M = M)
opt
opt#> [1] 297.4286 396.5714 500.0000 90.0000
<- round_oric(opt)
opt_int
opt_int#> [1] 297 397 500 90
```

You can install the released version of *stratallo* package
from CRAN with:

`install.packages("stratallo")`

Consider the following example

```
<- c(3000, 4000, 5000, 2000)
N <- c(48, 79, 76, 17)
S <- N * S
a <- 190
n <- opt(n = n, a = a) # which after simplification is (n / sum(a)) * a
opt
opt#> [1] 31.304348 68.695652 82.608696 7.391304
```

and note that

```
sum(opt) == n
#> [1] FALSE
```

which results from the fact that

```
options(digits = 22)
sum(opt)
#> [1] 190.0000000000000284217
sum((n / sum(a)) * a) == n # mathematically, it should be TRUE!
#> [1] FALSE
```

Neyman, J. (1934), “On the Two Different Aspects of the
Representative Method: The Method of Stratified Sampling and the Method
of Purposive Selection,” *Journal of the Royal Statistical
Society*, 97, 558–606.

Sarndal, C.-E., Swensson, B., and Wretman, J. (1993), *Model Assisted
Survey Sampling*, Springer.

Tschuprov, A. A. (1923), “On the Mathematical Expectation of the
Moments of Frequency Distributions in the Case of Correlated
Observations,” *Metron*, 2, 461–493, 646–683.

Wesołowski, J., Wieczorkowski, R., and Wójciak, W. (2021),
“Optimality of the Recursive Neyman Allocation,”
*Journal of Survey Statistics and Methodology*. https://doi.org/10.1093/jssam/smab018.
https://arxiv.org/abs/2105.14486.

Wójciak, W. (2019), “Optimum Allocation in Stratified Sampling
Schemes,” *MSc Thesis*, Warsaw University of Technology.
http://home.elka.pw.edu.pl/~wwojciak/msc_optimal_allocation.pdf.

Wójciak, W. (2023), “Another Solution of Some Optimum Allocation
Problem.”