Iterative Proportional Fitting

This vignette explains the usage of the ipf() function, which has been used for calibrating the labour force survey of Austria for several years. It is based on the Iterative Proportional Fitting algorithm and gives some flexibility about the details of the implementation. See (Meraner, Gumprecht, and Kowarik 2016) or vignette("methodology") for more details.

Setup

We will assume the output of demo.eusilc() is our population. From this population, a sample without replacement is drawn. The sample covers 10 percent of the population. We assign a weight of one for all observations of the population and a weight of ten for all observations of the sample.

library(surveysd)
population <- demo.eusilc(1, prettyNames = TRUE)
population[, pWeight := 1]
pop_sample <- population[sample(1:.N, floor(.N*0.10)), ]
pop_sample[, pWeight := 10]

One constraint, one variable

We will start with an example where we want to adapt the weights of pop_sample such that the weighted number of males and females matches the ones of population. We can see that this is currently not the case.

(gender_distribution <- xtabs(pWeight ~ gender, population))
#> gender
#>   male female 
#>   7267   7560
xtabs(pWeight ~ gender, pop_sample)
#> gender
#>   male female 
#>   7280   7540

Due to random sampling (rather than stratified sampling), there are differences between the gender distributions. We can pass gender_distribution as a parameter to ipf() to obtain modified weights.

pop_sample_c <- ipf(pop_sample, conP = list(gender_distribution), w = "pWeight")

The resulting dataset, pop_sample_c is similar to pop_sample but has an additional column with the adjusted weights.

dim(pop_sample)
#> [1] 1482   30
dim(pop_sample_c)
#> [1] 1482   31
setdiff(names(pop_sample_c), names(pop_sample))
#> [1] "calibWeight"

We can now calculate the weighted number of males and females according to this new weight. This will result in a match for the constraints.

xtabs(calibWeight ~ gender, pop_sample_c)
#> gender
#>   male female 
#>   7267   7560
xtabs(pWeight ~ gender, population)
#> gender
#>   male female 
#>   7267   7560

In this simple case, ipf just performs a post stratification step. This means, that all males and all females have the same weight.

xtabs(~ calibWeight + gender, pop_sample_c)
#>                   gender
#> calibWeight        male female
#>   9.98214285714286  728      0
#>   10.026525198939     0    754

All males have been weighted down (calibWeight < 10) to compensate for the overrepresentation in the sample.

One constraint, two variables

Let’s now assume that we want to put constraints on the number of males and females for each age group. The numbers from the original population can be obtained with xtabs().

(con_ga <- xtabs(pWeight ~ gender + age, population))
#>         age
#> gender   (-Inf,16] (16,25] (25,45] (45,65] (65, Inf]
#>   male        1528     855    2165    1822       897
#>   female      1375     848    2255    1845      1237
xtabs(pWeight ~ gender + age, pop_sample)
#>         age
#> gender   (-Inf,16] (16,25] (25,45] (45,65] (65, Inf]
#>   male        1300     850    2280    1840      1010
#>   female      1210     970    2350    1830      1180

Again, we can see that those constraints are not met. Supplying the contingency table con_ga to ipf() will again resolve this.

pop_sample_c2 <- ipf(pop_sample, conP = list(con_ga), w = "pWeight")
xtabs(pWeight ~ gender + age, population)
#>         age
#> gender   (-Inf,16] (16,25] (25,45] (45,65] (65, Inf]
#>   male        1528     855    2165    1822       897
#>   female      1375     848    2255    1845      1237
xtabs(calibWeight ~ gender + age, pop_sample_c2)
#>         age
#> gender   (-Inf,16] (16,25] (25,45] (45,65] (65, Inf]
#>   male        1528     855    2165    1822       897
#>   female      1375     848    2255    1845      1237

Two constraints

Now we assume that we know the number of persons living in each nuts2 region from registry data.

registry_table <- xtabs(pWeight ~ region, population)

However, those registry data does not provide any information about age or gender. Therefore, the two contingency tables (con_ga and registry_table) have to be specified independently. This can be done by supplying a list to conP

pop_sample_c2 <- ipf(pop_sample, conP = list(con_ga, registry_table), w = "pWeight")
xtabs(pWeight ~ gender + age, population)
#>         age
#> gender   (-Inf,16] (16,25] (25,45] (45,65] (65, Inf]
#>   male        1528     855    2165    1822       897
#>   female      1375     848    2255    1845      1237
xtabs(calibWeight ~ gender + age, pop_sample_c2)
#>         age
#> gender   (-Inf,16] (16,25] (25,45] (45,65] (65, Inf]
#>   male        1528     855    2165    1822       897
#>   female      1375     848    2255    1845      1237
xtabs(pWeight ~ region, population)
#> region
#>    Burgenland     Carinthia Lower Austria      Salzburg        Styria 
#>           549          1078          2804           924          2295 
#>         Tyrol Upper Austria        Vienna    Vorarlberg 
#>          1317          2805          2322           733
xtabs(calibWeight ~ region, pop_sample_c2)
#> region
#>    Burgenland     Carinthia Lower Austria      Salzburg        Styria 
#>      549.0000     1077.9999     2804.0000      924.0002     2294.9999 
#>         Tyrol Upper Austria        Vienna    Vorarlberg 
#>     1316.9996     2805.0003     2322.0001      733.0001

this time, the constraints are not matched perfectly. That is, because we provided more than one constraint. therefore, the ipf() algorithm had to work iteratively.

Household Constraints

If the dataset has a household structure, household constraints can be passed via the parameter conH. If this parameter is used, it is also necessary to supply hid, which defines the column names that contains household ids.

(conH1 <- xtabs(pWeight ~ hsize + region, data = population[!duplicated(hid)]))
#>          region
#> hsize     Burgenland Carinthia Lower Austria Salzburg Styria Tyrol
#>   (0,1]           58       117           325      103    264   118
#>   (1,2]           82       126           345      102    260   149
#>   (2,3]           37        80           189       55    187    79
#>   (3,4]           33        63           169       71    122   102
#>   (4,5]           14        22            82       18     49    37
#>   (5,Inf]          2        17            21       12     34    11
#>          region
#> hsize     Upper Austria Vienna Vorarlberg
#>   (0,1]             262    431         67
#>   (1,2]             321    355         72
#>   (2,3]             203    175         44
#>   (3,4]             168     96         53
#>   (4,5]              79     35         27
#>   (5,Inf]            35     15          7
pop_sample_hh <- ipf(pop_sample, hid = "hid", conH = list(conH1), w = "pWeight",
                     bound = 10)
xtabs(calibWeight ~ hsize + region, data = pop_sample_hh[!duplicated(hid)])
#>          region
#> hsize     Burgenland Carinthia Lower Austria Salzburg Styria Tyrol
#>   (0,1]            0         0             0        0      0     0
#>   (1,2]            0         0             0        0      0     0
#>   (2,3]            0         0             0        0      0     0
#>   (3,4]            0         0             0        0      0     0
#>   (4,5]            0         0             0        0      0     0
#>   (5,Inf]          0         0             0        0      0     0
#>          region
#> hsize     Upper Austria Vienna Vorarlberg
#>   (0,1]               0      0          0
#>   (1,2]               0      0          0
#>   (2,3]               0      0          0
#>   (3,4]               0      0          0
#>   (4,5]               0      0          0
#>   (5,Inf]             0      0          0

Tolerances

If conP or conH contain several contingency tables or if conP and conH are used at the same time, the ipf algorithm will operate iteratively. This means that the calibrated dataset will satisfy the constraints only approximately. The default tolerances of the approximation can be overwritten using the parameters conP and conH.

Lowering the tolerances will improve the match between the constraints and the contingency tables according to the calibrated weights. However, lower tolerances will also make it so more iterations are necessary until a convergence is met. If the constraints are too small, ipf will return with a warning that indicates that a convergence could not be reached.

ipf(pop_sample, conP = list(con_ga, registry_table), w = "pWeight",
    verbose = TRUE, epsP = 0.01)
#> Iteration stopped after 2 steps
#> Convergence reached
ipf(pop_sample, conP = list(con_ga, registry_table), w = "pWeight",
    verbose = TRUE, epsP = 0.0001)
#> Iteration stopped after 4 steps
#> Convergence reached

We see that changing the tolerances from 0.01 (one percent) to 0.0001 increases the number of required iterations.