The basic unit-level model (Battese, Harter, and Fuller 1988; Rao and Molina 2015) is given by \[ y_j = \beta' x_j + v_{i[j]} + \epsilon_j\,,\\ v_i \stackrel{\mathrm{iid}}{\sim} {\cal N}(0, \sigma_v^2) \qquad \epsilon_j \stackrel{\mathrm{iid}}{\sim} {\cal N}(0, \sigma^2) \] where \(j\) runs from 1 to \(n\), the number of unit-level observations, \(\beta\) is a vector of regression coefficients for given covariates \(x_j\), and \(v_i\) are random area intercepts.
We use the api
dataset included in packages survey.
library(survey)
data(api)
apipop$cname <- as.factor(apipop$cname)
apisrs$cname <- factor(apisrs$cname, levels=levels(apipop$cname))
The apipop
data frame contains the complete population
whereas apisrs
is a simple random sample from it. The
variable cname
is the county name, and we will be
interested in estimation at the county level. Not all counties in the
population are sampled. In order to be able to make predictions for
out-of-sample areas we make sure that the levels of the sample’s
cname
variable match those of its population
counterpart.
The basic unit-level model with county random area effects is fit as follows
library(mcmcsae)
mod <- api00 ~
reg(~ ell + meals + stype + hsg + col.grad + grad.sch, name="beta") +
gen(factor = ~ iid(cname), name="v")
sampler <- create_sampler(mod, data=apisrs)
sim <- MCMCsim(sampler, store.all=TRUE, verbose=FALSE)
(summary(sim))
## llh_ :
## Mean SD t-value MCSE q0.05 q0.5 q0.95 n_eff R_hat
## llh_ -1090 4.43 -246 0.129 -1097 -1089 -1083 1181 1
##
## sigma_ :
## Mean SD t-value MCSE q0.05 q0.5 q0.95 n_eff R_hat
## sigma_ 56.3 3.01 18.7 0.0681 51.5 56.2 61.4 1949 1
##
## beta :
## Mean SD t-value MCSE q0.05 q0.5 q0.95 n_eff R_hat
## (Intercept) 801.389 25.166 31.84 0.56242 760.223 801.031 843.5891 2002 0.999
## ell -1.968 0.298 -6.60 0.00684 -2.439 -1.970 -1.4612 1902 1.000
## meals -1.955 0.278 -7.03 0.00585 -2.416 -1.962 -1.4926 2266 1.000
## stypeH -106.631 13.464 -7.92 0.26382 -129.126 -106.548 -84.8618 2604 0.999
## stypeM -60.285 11.553 -5.22 0.22015 -78.696 -60.479 -40.9994 2754 1.000
## hsg -0.603 0.418 -1.44 0.00861 -1.285 -0.610 0.0789 2352 1.000
## col.grad 0.589 0.501 1.18 0.01199 -0.233 0.591 1.4229 1744 1.001
## grad.sch 2.134 0.468 4.56 0.00912 1.361 2.139 2.8832 2636 1.000
##
## v_sigma :
## Mean SD t-value MCSE q0.05 q0.5 q0.95 n_eff R_hat
## v_sigma 22.8 6.53 3.49 0.222 13.1 22.2 34.1 866 1
##
## v :
## Mean SD t-value MCSE q0.05 q0.5 q0.95 n_eff R_hat
## Alameda -24.9218 15.5 -1.60881 0.396 -51.5 -24.2639 -0.963 1533 1.005
## Amador 0.1177 23.2 0.00508 0.423 -38.9 0.2556 36.789 3000 1.000
## Butte 0.0418 23.3 0.00179 0.438 -37.8 0.1290 38.149 2833 0.999
## Calaveras 8.0988 22.0 0.36763 0.418 -26.4 7.2349 45.825 2774 1.000
## Colusa -0.5346 23.4 -0.02286 0.427 -39.6 0.0135 36.942 3000 1.000
## Contra Costa -9.1719 15.1 -0.60699 0.341 -33.5 -8.7300 14.944 1968 1.002
## Del Norte -0.1754 24.2 -0.00724 0.464 -40.5 -0.3683 39.124 2725 1.001
## El Dorado 0.6086 23.6 0.02574 0.432 -37.6 0.7744 39.464 3000 0.999
## Fresno 0.4942 15.1 0.03283 0.314 -24.2 1.0186 24.837 2298 1.001
## Glenn -0.1746 23.7 -0.00738 0.448 -37.9 -0.0410 38.479 2790 1.000
## ... 47 elements suppressed ...
We want to estimate the area population means \[
\theta_i = \frac{1}{N_i}\sum_{j \in U_i} y_j\,,
\] where \(U_i\) is the set of
units in area \(i\) of size \(N_i\). The MCMC output in variable
sim
can be used to obtain draws from the posterior
distribution for \(\theta_i\). The
\(r\)th draw can be expressed as \[
\theta_{i;r} = \frac{1}{N_i} \left(n_i \bar{y}_i + \beta_r'(t_{x;i}
- n_i \bar{x}_i) + (N_i - n_i)v_{i;r} + \sum_{j \in U_i\setminus s_i}
\epsilon_{j;r} \right)\,,
\] where \(\bar{y}_i\) is the
sample mean of \(y\) in area \(i\) and \(t_{x;i}\) is a vector of population totals
for area \(i\).
N <- table(apipop$cname)
samplesums <- tapply(apisrs$api00, apisrs$cname, sum)
samplesums[is.na(samplesums)] <- 0 # substitute 0 for out-of-sample areas
m <- match(apisrs$cds, apipop$cds) # population units in the sample
res <- predict(sim, newdata=apipop, labels=names(N),
fun=function(x, p) (samplesums + tapply(x[-m], apipop$cname[-m], sum ))/N,
show.progress=FALSE)
(summ <- summary(res))
## Mean SD t-value MCSE q0.05 q0.5 q0.95 n_eff R_hat
## Alameda 679 14.72 46.1 0.317 654 679 703 2154 1.002
## Amador 738 29.84 24.7 0.565 690 739 788 2790 0.999
## Butte 679 25.54 26.6 0.509 639 679 721 2516 1.000
## Calaveras 743 28.25 26.3 0.610 698 743 791 2146 1.000
## Colusa 551 31.25 17.6 0.599 500 550 601 2725 1.001
## Contra Costa 727 14.62 49.7 0.270 703 727 751 2944 1.002
## Del Norte 678 32.90 20.6 0.674 626 679 732 2386 1.000
## El Dorado 754 26.65 28.3 0.506 710 754 798 2774 0.999
## Fresno 595 14.82 40.1 0.271 570 595 619 3000 1.000
## Glenn 625 30.98 20.2 0.596 574 625 676 2706 1.000
## Humboldt 701 25.99 27.0 0.542 660 701 744 2296 1.003
## Imperial 557 24.67 22.6 0.459 517 557 597 2886 1.000
## Inyo 707 32.62 21.7 0.621 651 708 759 2755 0.999
## Kern 596 15.04 39.6 0.312 571 596 621 2321 1.000
## Kings 595 22.74 26.2 0.450 555 596 632 2554 1.001
## Lake 662 24.80 26.7 0.558 621 662 704 1976 1.001
## Lassen 706 25.01 28.2 0.457 666 706 746 3000 1.000
## Los Angeles 636 8.15 78.0 0.149 622 636 650 3000 1.000
## Madera 615 19.75 31.1 0.364 582 615 647 2946 1.000
## Marin 816 20.52 39.8 0.400 783 816 850 2630 1.001
## Mariposa 723 34.80 20.8 0.727 668 722 781 2289 1.000
## Mendocino 664 27.34 24.3 0.543 620 664 708 2538 1.001
## Merced 578 23.02 25.1 0.454 541 577 616 2576 1.000
## Modoc 671 30.45 22.0 0.657 624 670 722 2148 1.000
## Mono 704 41.38 17.0 0.772 634 703 773 2873 1.000
## Monterey 628 17.76 35.4 0.326 598 629 656 2959 0.999
## Napa 702 21.35 32.9 0.390 667 702 737 3000 1.000
## Nevada 799 28.83 27.7 0.562 751 799 845 2634 1.000
## Orange 692 15.56 44.5 0.336 667 692 718 2141 1.001
## Placer 772 23.85 32.4 0.458 736 771 812 2712 1.000
## Plumas 691 30.97 22.3 0.579 641 691 743 2857 1.000
## Riverside 635 14.55 43.7 0.266 611 636 659 3000 1.001
## Sacramento 668 14.98 44.6 0.274 643 668 692 3000 0.999
## San Benito 708 30.03 23.6 0.577 660 708 757 2708 0.999
## San Bernardino 647 12.83 50.4 0.242 625 647 668 2803 0.999
## San Diego 704 14.84 47.4 0.356 681 704 730 1737 1.002
## San Francisco 638 19.71 32.4 0.360 607 637 671 2996 1.000
## San Joaquin 630 17.78 35.4 0.426 599 631 658 1744 1.001
## San Luis Obispo 752 23.59 31.9 0.444 714 752 791 2823 1.001
## San Mateo 734 21.55 34.1 0.404 699 733 770 2847 1.001
## Santa Barbara 678 19.51 34.8 0.363 647 678 711 2887 1.001
## Santa Clara 733 15.63 46.9 0.288 707 733 758 2944 1.000
## Santa Cruz 680 19.87 34.2 0.374 647 681 712 2829 0.999
## Shasta 696 21.94 31.7 0.452 660 696 733 2352 1.000
## Sierra 709 40.46 17.5 0.806 642 709 778 2521 0.999
## Siskiyou 697 25.41 27.4 0.485 656 696 739 2747 1.001
## Solano 711 20.37 34.9 0.394 677 712 744 2669 0.999
## Sonoma 725 23.09 31.4 0.430 688 725 763 2888 1.000
## Stanislaus 661 19.17 34.5 0.365 629 661 692 2762 1.000
## Sutter 653 24.59 26.5 0.449 612 653 693 3000 1.000
## Tehama 659 27.80 23.7 0.536 614 659 705 2692 1.000
## Trinity 649 39.19 16.5 0.777 583 648 715 2544 1.000
## Tulare 583 21.03 27.7 0.439 546 583 616 2296 1.001
## Tuolumne 718 29.79 24.1 0.584 670 718 769 2601 1.000
## Ventura 709 17.65 40.2 0.349 681 709 738 2552 1.000
## Yolo 687 24.11 28.5 0.452 647 687 725 2845 1.000
## Yuba 626 27.95 22.4 0.547 582 626 673 2611 1.000
theta <- c(tapply(apipop$api00, apipop$cname, mean)) # true population quantities
plot_coef(summ, list(est=theta), n.se=2, est.names=c("mcmcsae", "true"), maxrows=30)
A model with binomial likelihood can also be fit. We now model the
target variable sch.wide
, a binary variable indicating
whether a school-wide growth target has been met. We use the same mean
model structure as above for the linear model, but now using a logistic
link function, \[
y_j \stackrel{\mathrm{iid}}{\sim} {\cal Be}(p_j)\,,\\
\mathrm{logit}(p_j) = \beta' x_j + v_{i[j]}\,,\\
v_i \stackrel{\mathrm{iid}}{\sim} {\cal N}(0, \sigma_v^2)
\]
apisrs$target.met <- as.numeric(apisrs$sch.wide == "Yes")
sampler <- create_sampler(update(mod, target.met ~ .), family="binomial", data=apisrs)
sim <- MCMCsim(sampler, store.all=TRUE, verbose=FALSE)
summary(sim)
## llh_ :
## Mean SD t-value MCSE q0.05 q0.5 q0.95 n_eff R_hat
## llh_ -83.4 2.53 -33 0.0612 -87.8 -83.2 -79.5 1700 1
##
## beta :
## Mean SD t-value MCSE q0.05 q0.5 q0.95 n_eff R_hat
## (Intercept) 4.58751 1.5380 2.983 0.054687 2.1962 4.509295 7.25306 791 1.00
## ell -0.03683 0.0150 -2.462 0.000425 -0.0619 -0.036567 -0.01303 1236 1.00
## meals 0.00131 0.0142 0.092 0.000419 -0.0214 0.000831 0.02568 1156 1.00
## stypeH -2.83101 0.6288 -4.502 0.018573 -3.8795 -2.806771 -1.81862 1146 1.00
## stypeM -1.67328 0.5564 -3.008 0.015826 -2.6033 -1.663576 -0.76778 1236 1.01
## hsg -0.02634 0.0231 -1.143 0.000692 -0.0660 -0.026164 0.01103 1110 1.00
## col.grad -0.03770 0.0272 -1.384 0.000923 -0.0820 -0.037437 0.00675 870 1.00
## grad.sch 0.01410 0.0318 0.444 0.001098 -0.0359 0.012812 0.06995 837 1.00
##
## v_sigma :
## Mean SD t-value MCSE q0.05 q0.5 q0.95 n_eff R_hat
## v_sigma 0.367 0.283 1.3 0.0115 0.0292 0.308 0.905 601 1
##
## v :
## Mean SD t-value MCSE q0.05 q0.5 q0.95 n_eff R_hat
## Alameda -0.18521 0.427 -0.43412 0.01184 -1.004 -0.069558 0.319 1298 1.001
## Amador 0.00790 0.464 0.01704 0.00846 -0.699 0.000461 0.778 3000 0.999
## Butte -0.00181 0.481 -0.00377 0.00878 -0.758 -0.001981 0.740 3000 1.000
## Calaveras 0.02455 0.465 0.05285 0.00881 -0.666 0.001470 0.777 2781 1.000
## Colusa -0.01875 0.455 -0.04119 0.00852 -0.769 -0.001616 0.637 2856 1.000
## Contra Costa 0.01387 0.405 0.03430 0.00911 -0.660 0.000769 0.700 1973 1.000
## Del Norte 0.00899 0.469 0.01917 0.00856 -0.716 -0.000903 0.760 3000 1.001
## El Dorado -0.00649 0.453 -0.01432 0.00876 -0.722 0.000190 0.727 2678 1.001
## Fresno -0.03499 0.395 -0.08850 0.00803 -0.726 -0.008858 0.593 2422 1.001
## Glenn -0.00568 0.464 -0.01224 0.00847 -0.775 0.003167 0.699 3000 1.000
## ... 47 elements suppressed ...
To predict the population fractions of schools that meet the growth target by county,
samplesums <- tapply(apisrs$target.met, apisrs$cname, sum)
samplesums[is.na(samplesums)] <- 0 # substitute 0 for out-of-sample areas
res <- predict(sim, newdata=apipop, labels=names(N),
fun=function(x, p) (samplesums + tapply(x[-m], apipop$cname[-m], sum ))/N,
show.progress=FALSE)
(summ <- summary(res))
## Mean SD t-value MCSE q0.05 q0.5 q0.95 n_eff R_hat
## Alameda 0.784 0.0614 12.78 0.001674 0.667 0.792 0.864 1343 1.000
## Amador 0.841 0.1192 7.06 0.002191 0.600 0.900 1.000 2957 0.999
## Butte 0.836 0.0763 10.96 0.001655 0.708 0.833 0.938 2124 1.000
## Calaveras 0.873 0.0937 9.33 0.001756 0.700 0.900 1.000 2844 0.999
## Colusa 0.640 0.1580 4.05 0.002989 0.333 0.667 0.889 2793 1.000
## Contra Costa 0.817 0.0583 14.02 0.001410 0.709 0.821 0.905 1710 1.000
## Del Norte 0.884 0.1061 8.33 0.001937 0.750 0.875 1.000 3000 1.001
## El Dorado 0.849 0.0749 11.34 0.001537 0.725 0.850 0.950 2375 1.001
## Fresno 0.782 0.0632 12.38 0.001517 0.667 0.790 0.871 1734 1.001
## Glenn 0.794 0.1382 5.75 0.002529 0.556 0.778 1.000 2985 1.000
## Humboldt 0.852 0.0759 11.23 0.001579 0.725 0.850 0.950 2312 1.000
## Imperial 0.696 0.1053 6.61 0.002187 0.525 0.700 0.850 2319 1.001
## Inyo 0.782 0.1517 5.15 0.002953 0.571 0.857 1.000 2637 1.000
## Kern 0.805 0.0522 15.42 0.001142 0.711 0.811 0.878 2088 1.000
## Kings 0.851 0.0789 10.79 0.001668 0.720 0.840 0.960 2240 1.002
## Lake 0.827 0.0943 8.77 0.001826 0.636 0.818 0.955 2666 1.000
## Lassen 0.834 0.1102 7.57 0.002079 0.636 0.818 1.000 2813 1.000
## Los Angeles 0.799 0.0429 18.64 0.001034 0.726 0.800 0.868 1719 1.000
## Madera 0.818 0.0757 10.81 0.001527 0.677 0.839 0.935 2458 1.000
## Marin 0.838 0.0704 11.91 0.001687 0.720 0.840 0.940 1742 1.000
## Mariposa 0.862 0.1472 5.86 0.002687 0.600 0.800 1.000 3000 1.000
## Mendocino 0.813 0.0907 8.97 0.001870 0.640 0.840 0.960 2352 1.000
## Merced 0.790 0.0824 9.58 0.001789 0.651 0.794 0.921 2125 1.000
## Modoc 0.832 0.1590 5.23 0.003150 0.600 0.800 1.000 2548 1.000
## Mono 0.750 0.2423 3.09 0.004426 0.333 0.667 1.000 2997 0.999
## Monterey 0.801 0.0683 11.72 0.001587 0.687 0.807 0.916 1855 1.001
## Napa 0.848 0.0769 11.03 0.001671 0.704 0.852 0.963 2118 1.000
## Nevada 0.871 0.0945 9.21 0.001891 0.714 0.857 1.000 2498 1.000
## Orange 0.772 0.0620 12.45 0.001386 0.667 0.773 0.871 2003 1.001
## Placer 0.815 0.0786 10.37 0.001851 0.682 0.818 0.924 1803 1.000
## Plumas 0.739 0.1542 4.79 0.003198 0.444 0.778 1.000 2325 1.000
## Riverside 0.826 0.0529 15.62 0.001262 0.726 0.835 0.898 1759 1.000
## Sacramento 0.846 0.0483 17.52 0.001210 0.767 0.847 0.924 1592 1.000
## San Benito 0.821 0.1256 6.53 0.002416 0.636 0.818 1.000 2704 1.000
## San Bernardino 0.861 0.0427 20.13 0.001025 0.790 0.862 0.931 1738 1.001
## San Diego 0.849 0.0432 19.66 0.000947 0.773 0.852 0.913 2081 1.000
## San Francisco 0.761 0.0740 10.29 0.001467 0.630 0.770 0.870 2540 0.999
## San Joaquin 0.829 0.0565 14.68 0.001318 0.733 0.831 0.911 1836 0.999
## San Luis Obispo 0.855 0.0713 11.98 0.001450 0.725 0.850 0.950 2419 1.000
## San Mateo 0.797 0.0789 10.10 0.002156 0.650 0.811 0.895 1339 1.000
## Santa Barbara 0.828 0.0651 12.72 0.001455 0.716 0.840 0.926 2003 1.001
## Santa Clara 0.783 0.0723 10.83 0.002060 0.642 0.796 0.875 1232 1.001
## Santa Cruz 0.786 0.0754 10.43 0.001524 0.653 0.796 0.898 2444 0.999
## Shasta 0.815 0.0783 10.40 0.001673 0.675 0.825 0.925 2192 1.000
## Sierra 0.729 0.2258 3.23 0.004242 0.333 0.667 1.000 2835 1.000
## Siskiyou 0.845 0.0972 8.69 0.001880 0.667 0.867 1.000 2672 1.000
## Solano 0.864 0.0637 13.56 0.001405 0.750 0.867 0.967 2056 1.002
## Sonoma 0.845 0.0642 13.15 0.001406 0.731 0.852 0.935 2087 1.000
## Stanislaus 0.812 0.0733 11.09 0.002040 0.681 0.824 0.912 1291 0.999
## Sutter 0.831 0.0874 9.51 0.001677 0.700 0.850 0.950 2717 1.000
## Tehama 0.842 0.0967 8.71 0.001891 0.647 0.824 1.000 2614 1.001
## Trinity 0.754 0.2020 3.73 0.003688 0.500 0.750 1.000 3000 1.001
## Tulare 0.831 0.0662 12.56 0.001518 0.718 0.836 0.927 1899 1.000
## Tuolumne 0.878 0.0953 9.21 0.001892 0.750 0.917 1.000 2539 1.000
## Ventura 0.839 0.0505 16.61 0.001161 0.752 0.845 0.913 1893 1.000
## Yolo 0.850 0.0733 11.61 0.001490 0.714 0.857 0.971 2417 1.000
## Yuba 0.795 0.1043 7.62 0.001950 0.632 0.789 0.947 2861 1.000
theta <- c(tapply(apipop$sch.wide == "Yes", apipop$cname, mean)) # true population quantities
plot_coef(summ, list(est=theta), n.se=2, est.names=c("mcmcsae", "true"), maxrows=30)