Basic unit-level models

The basic unit-level model (Battese, Harter, and Fuller 1988; Rao and Molina 2015) is given by \[ y_j = \beta' x_j + v_{i[j]} + \epsilon_j\,,\\ v_i \stackrel{\mathrm{iid}}{\sim} {\cal N}(0, \sigma_v^2) \qquad \epsilon_j \stackrel{\mathrm{iid}}{\sim} {\cal N}(0, \sigma^2) \] where \(j\) runs from 1 to \(n\), the number of unit-level observations, \(\beta\) is a vector of regression coefficients for given covariates \(x_j\), and \(v_i\) are random area intercepts.

We use the api dataset included in packages survey.

library(survey)
data(api)
apipop$cname <- as.factor(apipop$cname)
apisrs$cname <- factor(apisrs$cname, levels=levels(apipop$cname))

The apipop data frame contains the complete population whereas apisrs is a simple random sample from it. The variable cname is the county name, and we will be interested in estimation at the county level. Not all counties in the population are sampled. In order to be able to make predictions for out-of-sample areas we make sure that the levels of the sample’s cname variable match those of its population counterpart.

The basic unit-level model with county random area effects is fit as follows

library(mcmcsae)
mod <- api00 ~ 
  reg(~ ell + meals + stype + hsg + col.grad + grad.sch, name="beta") +
  gen(factor = ~ iid(cname), name="v")
sampler <- create_sampler(mod, data=apisrs)
sim <- MCMCsim(sampler, store.all=TRUE, verbose=FALSE)
(summary(sim))
## llh_ :
##       Mean   SD t-value  MCSE q0.05  q0.5 q0.95 n_eff R_hat
## llh_ -1090 4.43    -246 0.129 -1097 -1089 -1083  1181     1
## 
## sigma_ :
##        Mean   SD t-value   MCSE q0.05 q0.5 q0.95 n_eff R_hat
## sigma_ 56.3 3.01    18.7 0.0681  51.5 56.2  61.4  1949     1
## 
## beta :
##                 Mean     SD t-value    MCSE    q0.05     q0.5    q0.95 n_eff R_hat
## (Intercept)  801.389 25.166   31.84 0.56242  760.223  801.031 843.5891  2002 0.999
## ell           -1.968  0.298   -6.60 0.00684   -2.439   -1.970  -1.4612  1902 1.000
## meals         -1.955  0.278   -7.03 0.00585   -2.416   -1.962  -1.4926  2266 1.000
## stypeH      -106.631 13.464   -7.92 0.26382 -129.126 -106.548 -84.8618  2604 0.999
## stypeM       -60.285 11.553   -5.22 0.22015  -78.696  -60.479 -40.9994  2754 1.000
## hsg           -0.603  0.418   -1.44 0.00861   -1.285   -0.610   0.0789  2352 1.000
## col.grad       0.589  0.501    1.18 0.01199   -0.233    0.591   1.4229  1744 1.001
## grad.sch       2.134  0.468    4.56 0.00912    1.361    2.139   2.8832  2636 1.000
## 
## v_sigma :
##         Mean   SD t-value  MCSE q0.05 q0.5 q0.95 n_eff R_hat
## v_sigma 22.8 6.53    3.49 0.222  13.1 22.2  34.1   866     1
## 
## v :
##                  Mean   SD  t-value  MCSE q0.05     q0.5  q0.95 n_eff R_hat
## Alameda      -24.9218 15.5 -1.60881 0.396 -51.5 -24.2639 -0.963  1533 1.005
## Amador         0.1177 23.2  0.00508 0.423 -38.9   0.2556 36.789  3000 1.000
## Butte          0.0418 23.3  0.00179 0.438 -37.8   0.1290 38.149  2833 0.999
## Calaveras      8.0988 22.0  0.36763 0.418 -26.4   7.2349 45.825  2774 1.000
## Colusa        -0.5346 23.4 -0.02286 0.427 -39.6   0.0135 36.942  3000 1.000
## Contra Costa  -9.1719 15.1 -0.60699 0.341 -33.5  -8.7300 14.944  1968 1.002
## Del Norte     -0.1754 24.2 -0.00724 0.464 -40.5  -0.3683 39.124  2725 1.001
## El Dorado      0.6086 23.6  0.02574 0.432 -37.6   0.7744 39.464  3000 0.999
## Fresno         0.4942 15.1  0.03283 0.314 -24.2   1.0186 24.837  2298 1.001
## Glenn         -0.1746 23.7 -0.00738 0.448 -37.9  -0.0410 38.479  2790 1.000
## ... 47 elements suppressed ...

We want to estimate the area population means \[ \theta_i = \frac{1}{N_i}\sum_{j \in U_i} y_j\,, \] where \(U_i\) is the set of units in area \(i\) of size \(N_i\). The MCMC output in variable sim can be used to obtain draws from the posterior distribution for \(\theta_i\). The \(r\)th draw can be expressed as \[ \theta_{i;r} = \frac{1}{N_i} \left(n_i \bar{y}_i + \beta_r'(t_{x;i} - n_i \bar{x}_i) + (N_i - n_i)v_{i;r} + \sum_{j \in U_i\setminus s_i} \epsilon_{j;r} \right)\,, \] where \(\bar{y}_i\) is the sample mean of \(y\) in area \(i\) and \(t_{x;i}\) is a vector of population totals for area \(i\).

N <- table(apipop$cname)
samplesums <- tapply(apisrs$api00, apisrs$cname, sum)
samplesums[is.na(samplesums)] <- 0  # substitute 0 for out-of-sample areas
m <- match(apisrs$cds, apipop$cds)  # population units in the sample
res <- predict(sim, newdata=apipop, labels=names(N),
         fun=function(x, p) (samplesums + tapply(x[-m], apipop$cname[-m], sum ))/N,
         show.progress=FALSE)
(summ <- summary(res))
##                 Mean    SD t-value  MCSE q0.05 q0.5 q0.95 n_eff R_hat
## Alameda          679 14.72    46.1 0.317   654  679   703  2154 1.002
## Amador           738 29.84    24.7 0.565   690  739   788  2790 0.999
## Butte            679 25.54    26.6 0.509   639  679   721  2516 1.000
## Calaveras        743 28.25    26.3 0.610   698  743   791  2146 1.000
## Colusa           551 31.25    17.6 0.599   500  550   601  2725 1.001
## Contra Costa     727 14.62    49.7 0.270   703  727   751  2944 1.002
## Del Norte        678 32.90    20.6 0.674   626  679   732  2386 1.000
## El Dorado        754 26.65    28.3 0.506   710  754   798  2774 0.999
## Fresno           595 14.82    40.1 0.271   570  595   619  3000 1.000
## Glenn            625 30.98    20.2 0.596   574  625   676  2706 1.000
## Humboldt         701 25.99    27.0 0.542   660  701   744  2296 1.003
## Imperial         557 24.67    22.6 0.459   517  557   597  2886 1.000
## Inyo             707 32.62    21.7 0.621   651  708   759  2755 0.999
## Kern             596 15.04    39.6 0.312   571  596   621  2321 1.000
## Kings            595 22.74    26.2 0.450   555  596   632  2554 1.001
## Lake             662 24.80    26.7 0.558   621  662   704  1976 1.001
## Lassen           706 25.01    28.2 0.457   666  706   746  3000 1.000
## Los Angeles      636  8.15    78.0 0.149   622  636   650  3000 1.000
## Madera           615 19.75    31.1 0.364   582  615   647  2946 1.000
## Marin            816 20.52    39.8 0.400   783  816   850  2630 1.001
## Mariposa         723 34.80    20.8 0.727   668  722   781  2289 1.000
## Mendocino        664 27.34    24.3 0.543   620  664   708  2538 1.001
## Merced           578 23.02    25.1 0.454   541  577   616  2576 1.000
## Modoc            671 30.45    22.0 0.657   624  670   722  2148 1.000
## Mono             704 41.38    17.0 0.772   634  703   773  2873 1.000
## Monterey         628 17.76    35.4 0.326   598  629   656  2959 0.999
## Napa             702 21.35    32.9 0.390   667  702   737  3000 1.000
## Nevada           799 28.83    27.7 0.562   751  799   845  2634 1.000
## Orange           692 15.56    44.5 0.336   667  692   718  2141 1.001
## Placer           772 23.85    32.4 0.458   736  771   812  2712 1.000
## Plumas           691 30.97    22.3 0.579   641  691   743  2857 1.000
## Riverside        635 14.55    43.7 0.266   611  636   659  3000 1.001
## Sacramento       668 14.98    44.6 0.274   643  668   692  3000 0.999
## San Benito       708 30.03    23.6 0.577   660  708   757  2708 0.999
## San Bernardino   647 12.83    50.4 0.242   625  647   668  2803 0.999
## San Diego        704 14.84    47.4 0.356   681  704   730  1737 1.002
## San Francisco    638 19.71    32.4 0.360   607  637   671  2996 1.000
## San Joaquin      630 17.78    35.4 0.426   599  631   658  1744 1.001
## San Luis Obispo  752 23.59    31.9 0.444   714  752   791  2823 1.001
## San Mateo        734 21.55    34.1 0.404   699  733   770  2847 1.001
## Santa Barbara    678 19.51    34.8 0.363   647  678   711  2887 1.001
## Santa Clara      733 15.63    46.9 0.288   707  733   758  2944 1.000
## Santa Cruz       680 19.87    34.2 0.374   647  681   712  2829 0.999
## Shasta           696 21.94    31.7 0.452   660  696   733  2352 1.000
## Sierra           709 40.46    17.5 0.806   642  709   778  2521 0.999
## Siskiyou         697 25.41    27.4 0.485   656  696   739  2747 1.001
## Solano           711 20.37    34.9 0.394   677  712   744  2669 0.999
## Sonoma           725 23.09    31.4 0.430   688  725   763  2888 1.000
## Stanislaus       661 19.17    34.5 0.365   629  661   692  2762 1.000
## Sutter           653 24.59    26.5 0.449   612  653   693  3000 1.000
## Tehama           659 27.80    23.7 0.536   614  659   705  2692 1.000
## Trinity          649 39.19    16.5 0.777   583  648   715  2544 1.000
## Tulare           583 21.03    27.7 0.439   546  583   616  2296 1.001
## Tuolumne         718 29.79    24.1 0.584   670  718   769  2601 1.000
## Ventura          709 17.65    40.2 0.349   681  709   738  2552 1.000
## Yolo             687 24.11    28.5 0.452   647  687   725  2845 1.000
## Yuba             626 27.95    22.4 0.547   582  626   673  2611 1.000
theta <- c(tapply(apipop$api00, apipop$cname, mean))  # true population quantities
plot_coef(summ, list(est=theta), n.se=2, est.names=c("mcmcsae", "true"), maxrows=30)

Binomial Unit-Level Model

A model with binomial likelihood can also be fit. We now model the target variable sch.wide, a binary variable indicating whether a school-wide growth target has been met. We use the same mean model structure as above for the linear model, but now using a logistic link function, \[ y_j \stackrel{\mathrm{iid}}{\sim} {\cal Be}(p_j)\,,\\ \mathrm{logit}(p_j) = \beta' x_j + v_{i[j]}\,,\\ v_i \stackrel{\mathrm{iid}}{\sim} {\cal N}(0, \sigma_v^2) \]

apisrs$target.met <- as.numeric(apisrs$sch.wide == "Yes")
sampler <- create_sampler(update(mod, target.met ~ .), family="binomial", data=apisrs)
sim <- MCMCsim(sampler, store.all=TRUE, verbose=FALSE)
summary(sim)
## llh_ :
##       Mean   SD t-value   MCSE q0.05  q0.5 q0.95 n_eff R_hat
## llh_ -83.4 2.53     -33 0.0612 -87.8 -83.2 -79.5  1700     1
## 
## beta :
##                 Mean     SD t-value     MCSE   q0.05      q0.5    q0.95 n_eff R_hat
## (Intercept)  4.58751 1.5380   2.983 0.054687  2.1962  4.509295  7.25306   791  1.00
## ell         -0.03683 0.0150  -2.462 0.000425 -0.0619 -0.036567 -0.01303  1236  1.00
## meals        0.00131 0.0142   0.092 0.000419 -0.0214  0.000831  0.02568  1156  1.00
## stypeH      -2.83101 0.6288  -4.502 0.018573 -3.8795 -2.806771 -1.81862  1146  1.00
## stypeM      -1.67328 0.5564  -3.008 0.015826 -2.6033 -1.663576 -0.76778  1236  1.01
## hsg         -0.02634 0.0231  -1.143 0.000692 -0.0660 -0.026164  0.01103  1110  1.00
## col.grad    -0.03770 0.0272  -1.384 0.000923 -0.0820 -0.037437  0.00675   870  1.00
## grad.sch     0.01410 0.0318   0.444 0.001098 -0.0359  0.012812  0.06995   837  1.00
## 
## v_sigma :
##          Mean    SD t-value   MCSE  q0.05  q0.5 q0.95 n_eff R_hat
## v_sigma 0.367 0.283     1.3 0.0115 0.0292 0.308 0.905   601     1
## 
## v :
##                  Mean    SD  t-value    MCSE  q0.05      q0.5 q0.95 n_eff R_hat
## Alameda      -0.18521 0.427 -0.43412 0.01184 -1.004 -0.069558 0.319  1298 1.001
## Amador        0.00790 0.464  0.01704 0.00846 -0.699  0.000461 0.778  3000 0.999
## Butte        -0.00181 0.481 -0.00377 0.00878 -0.758 -0.001981 0.740  3000 1.000
## Calaveras     0.02455 0.465  0.05285 0.00881 -0.666  0.001470 0.777  2781 1.000
## Colusa       -0.01875 0.455 -0.04119 0.00852 -0.769 -0.001616 0.637  2856 1.000
## Contra Costa  0.01387 0.405  0.03430 0.00911 -0.660  0.000769 0.700  1973 1.000
## Del Norte     0.00899 0.469  0.01917 0.00856 -0.716 -0.000903 0.760  3000 1.001
## El Dorado    -0.00649 0.453 -0.01432 0.00876 -0.722  0.000190 0.727  2678 1.001
## Fresno       -0.03499 0.395 -0.08850 0.00803 -0.726 -0.008858 0.593  2422 1.001
## Glenn        -0.00568 0.464 -0.01224 0.00847 -0.775  0.003167 0.699  3000 1.000
## ... 47 elements suppressed ...

To predict the population fractions of schools that meet the growth target by county,

samplesums <- tapply(apisrs$target.met, apisrs$cname, sum)
samplesums[is.na(samplesums)] <- 0  # substitute 0 for out-of-sample areas
res <- predict(sim, newdata=apipop, labels=names(N),
         fun=function(x, p) (samplesums + tapply(x[-m], apipop$cname[-m], sum ))/N,
         show.progress=FALSE)
(summ <- summary(res))
##                  Mean     SD t-value     MCSE q0.05  q0.5 q0.95 n_eff R_hat
## Alameda         0.784 0.0614   12.78 0.001674 0.667 0.792 0.864  1343 1.000
## Amador          0.841 0.1192    7.06 0.002191 0.600 0.900 1.000  2957 0.999
## Butte           0.836 0.0763   10.96 0.001655 0.708 0.833 0.938  2124 1.000
## Calaveras       0.873 0.0937    9.33 0.001756 0.700 0.900 1.000  2844 0.999
## Colusa          0.640 0.1580    4.05 0.002989 0.333 0.667 0.889  2793 1.000
## Contra Costa    0.817 0.0583   14.02 0.001410 0.709 0.821 0.905  1710 1.000
## Del Norte       0.884 0.1061    8.33 0.001937 0.750 0.875 1.000  3000 1.001
## El Dorado       0.849 0.0749   11.34 0.001537 0.725 0.850 0.950  2375 1.001
## Fresno          0.782 0.0632   12.38 0.001517 0.667 0.790 0.871  1734 1.001
## Glenn           0.794 0.1382    5.75 0.002529 0.556 0.778 1.000  2985 1.000
## Humboldt        0.852 0.0759   11.23 0.001579 0.725 0.850 0.950  2312 1.000
## Imperial        0.696 0.1053    6.61 0.002187 0.525 0.700 0.850  2319 1.001
## Inyo            0.782 0.1517    5.15 0.002953 0.571 0.857 1.000  2637 1.000
## Kern            0.805 0.0522   15.42 0.001142 0.711 0.811 0.878  2088 1.000
## Kings           0.851 0.0789   10.79 0.001668 0.720 0.840 0.960  2240 1.002
## Lake            0.827 0.0943    8.77 0.001826 0.636 0.818 0.955  2666 1.000
## Lassen          0.834 0.1102    7.57 0.002079 0.636 0.818 1.000  2813 1.000
## Los Angeles     0.799 0.0429   18.64 0.001034 0.726 0.800 0.868  1719 1.000
## Madera          0.818 0.0757   10.81 0.001527 0.677 0.839 0.935  2458 1.000
## Marin           0.838 0.0704   11.91 0.001687 0.720 0.840 0.940  1742 1.000
## Mariposa        0.862 0.1472    5.86 0.002687 0.600 0.800 1.000  3000 1.000
## Mendocino       0.813 0.0907    8.97 0.001870 0.640 0.840 0.960  2352 1.000
## Merced          0.790 0.0824    9.58 0.001789 0.651 0.794 0.921  2125 1.000
## Modoc           0.832 0.1590    5.23 0.003150 0.600 0.800 1.000  2548 1.000
## Mono            0.750 0.2423    3.09 0.004426 0.333 0.667 1.000  2997 0.999
## Monterey        0.801 0.0683   11.72 0.001587 0.687 0.807 0.916  1855 1.001
## Napa            0.848 0.0769   11.03 0.001671 0.704 0.852 0.963  2118 1.000
## Nevada          0.871 0.0945    9.21 0.001891 0.714 0.857 1.000  2498 1.000
## Orange          0.772 0.0620   12.45 0.001386 0.667 0.773 0.871  2003 1.001
## Placer          0.815 0.0786   10.37 0.001851 0.682 0.818 0.924  1803 1.000
## Plumas          0.739 0.1542    4.79 0.003198 0.444 0.778 1.000  2325 1.000
## Riverside       0.826 0.0529   15.62 0.001262 0.726 0.835 0.898  1759 1.000
## Sacramento      0.846 0.0483   17.52 0.001210 0.767 0.847 0.924  1592 1.000
## San Benito      0.821 0.1256    6.53 0.002416 0.636 0.818 1.000  2704 1.000
## San Bernardino  0.861 0.0427   20.13 0.001025 0.790 0.862 0.931  1738 1.001
## San Diego       0.849 0.0432   19.66 0.000947 0.773 0.852 0.913  2081 1.000
## San Francisco   0.761 0.0740   10.29 0.001467 0.630 0.770 0.870  2540 0.999
## San Joaquin     0.829 0.0565   14.68 0.001318 0.733 0.831 0.911  1836 0.999
## San Luis Obispo 0.855 0.0713   11.98 0.001450 0.725 0.850 0.950  2419 1.000
## San Mateo       0.797 0.0789   10.10 0.002156 0.650 0.811 0.895  1339 1.000
## Santa Barbara   0.828 0.0651   12.72 0.001455 0.716 0.840 0.926  2003 1.001
## Santa Clara     0.783 0.0723   10.83 0.002060 0.642 0.796 0.875  1232 1.001
## Santa Cruz      0.786 0.0754   10.43 0.001524 0.653 0.796 0.898  2444 0.999
## Shasta          0.815 0.0783   10.40 0.001673 0.675 0.825 0.925  2192 1.000
## Sierra          0.729 0.2258    3.23 0.004242 0.333 0.667 1.000  2835 1.000
## Siskiyou        0.845 0.0972    8.69 0.001880 0.667 0.867 1.000  2672 1.000
## Solano          0.864 0.0637   13.56 0.001405 0.750 0.867 0.967  2056 1.002
## Sonoma          0.845 0.0642   13.15 0.001406 0.731 0.852 0.935  2087 1.000
## Stanislaus      0.812 0.0733   11.09 0.002040 0.681 0.824 0.912  1291 0.999
## Sutter          0.831 0.0874    9.51 0.001677 0.700 0.850 0.950  2717 1.000
## Tehama          0.842 0.0967    8.71 0.001891 0.647 0.824 1.000  2614 1.001
## Trinity         0.754 0.2020    3.73 0.003688 0.500 0.750 1.000  3000 1.001
## Tulare          0.831 0.0662   12.56 0.001518 0.718 0.836 0.927  1899 1.000
## Tuolumne        0.878 0.0953    9.21 0.001892 0.750 0.917 1.000  2539 1.000
## Ventura         0.839 0.0505   16.61 0.001161 0.752 0.845 0.913  1893 1.000
## Yolo            0.850 0.0733   11.61 0.001490 0.714 0.857 0.971  2417 1.000
## Yuba            0.795 0.1043    7.62 0.001950 0.632 0.789 0.947  2861 1.000
theta <- c(tapply(apipop$sch.wide == "Yes", apipop$cname, mean))  # true population quantities
plot_coef(summ, list(est=theta), n.se=2, est.names=c("mcmcsae", "true"), maxrows=30)

References

Battese, G. E., R. M. Harter, and W. A. Fuller. 1988. “An Error-Components Model for Prediction of County Crop Areas Using Survey and Satellite Data.” Journal of the American Statistical Association 83 (401): 28–36.
Rao, J. N. K., and I. Molina. 2015. Small Area Estimation. John Wiley & Sons.