Bias reduction in generalized linear models

Ioannis Kosmidis

21 March 2017

The brglm2 package

brglm2 provides tools for the estimation and inference from generalized linear models using various methods for bias reduction (Kosmidis 2014). Reduction of estimation bias is achieved either through the adjusted score equations approach in Firth (1993) and Kosmidis and Firth (2009), or through the direct subtraction of an estimate of the bias of the maximum likelihood estimator from the maximum likelihood estimates as prescribed in Cordeiro and McCullagh (1991).

In the special case of generalized linear models for binomial and multinomial responses, the adjusted score equations approach returns estimates with improved frequentist properties, that are also always finite, even in cases where the maximum likelihood estimates are infinite, like in complete and quasi-complete separation as defined in Albert and Anderson (1984).

The workhorse function is brglmFit, which can be passed directly to the method argument of the glm function. brglmFit implements a quasi Fisher scoring procedure, whose special cases result in various explicit and implicit bias reduction methods for generalized linear models (the classification of bias reduction methods into explicit and implicit is given in Kosmidis 2014).

This vignette

This vignette

Other resources

The bias-reducing quasi Fisher scoring iteration is also described in detail in the bias vignette of the enrichwith R package. Kosmidis and Firth (2010) describe a parallel quasi Newton-Raphson procedure.

Most of the material in this vignette comes from a presentation by Ioannis Kosmidis at the useR! 2016 international R User conference at University of Stanford on 16 June 2016. The presentation was titled “Reduced-bias inference in generalized linear models” and can be watched online at this link.

Generalized linear models

Model

Suppose that \(y_1, \ldots, y_n\) are observations on independent random variables \(Y_1, \ldots, Y_n\), each with probability density/mass function of the form \[ f_{Y_i}(y) = \exp\left\{\frac{y \theta_i - b(\theta_i) - c_1(y)}{\phi/m_i} - \frac{1}{2}a\left(-\frac{m_i}{\phi}\right) + c_2(y) \right\} \] for some sufficiently smooth functions \(b(.)\), \(c_1(.)\), \(a(.)\) and \(c_2(.)\), and fixed observation weights \(m_1, \ldots, m_n\). The expected value and the variance of \(Y_i\) are then \begin{align*} E(Y_i) & = \mu_i = b'(\theta_i) \\ Var(Y_i) & = \frac{\phi}{m_i}b''(\theta_i) = \frac{\phi}{m_i}V(\mu_i) \end{align*}

Hence, in this parameterization, \(\phi\) is a dispersion parameter.

A generalized linear model links the mean \(\mu_i\) to a linear predictor \(\eta_i\) as \[ g(\mu_i) = \eta_i = \sum_{t=1}^p \beta_t x_{it} \] where \(g(.)\) is a monotone, sufficiently smooth link function, taking values on \(\Re\), \(x_{it}\) is the \((i,t)th\) component of a model matrix \(X\), and \(\beta = (\beta_1, \ldots, \beta_p)^\top\).

Score functions and information matrix

The derivatives of the log-likelihood about \(\beta\) and \(\phi\) (score functions) are \begin{align*} s_\beta(\beta, \phi) & = \frac{1}{\phi}X^TWD^{-1}(y - \mu) \\ s_\phi(\beta, \phi) & = \frac{1}{2\phi^2}\sum_{i = 1}^n (q_i - \rho_i) \end{align*}

with \(y = (y_1, \ldots, y_n)^\top\), \(\mu = (\mu_1, \ldots, \mu_n)^\top\), \(W = {\rm diag}\left\{w_1, \ldots, w_n\right\}\) and \(D = {\rm diag}\left\{d_1, \ldots, d_n\right\}\), where \(w_i = m_i d_i^2/V(\mu_i)\) is the \(i\)th working weight, and \(d_i = d\mu_i/d\eta_i\). Furthermore, \(q_i = -2 m_i \{y_i\theta_i - b(\theta_i) - c_1(y_i)\}\) and \(\rho_i = m_i a'(-m_i/\phi)\) are the \(i\)th deviance residual (e.g. as implemented in the dev.resid component of a family object) and its expectation, respectively.

The expected information matrix about \(\beta\) and \(\phi\) is \[ i(\beta, \phi) = \left[ \begin{array}{cc} i_{\beta\beta}(\beta, \phi) & 0_p \\ 0_p^\top & i_{\phi\phi}(\beta, \phi) \end{array} \right] = \left[ \begin{array}{cc} \frac{1}{\phi} X^\top W X & 0_p \\ 0_p^\top & \frac{1}{2\phi^4}\sum_{i = 1}^n m_i^2 a''(-m_i/\phi) \end{array} \right]\,, \] where \(0_p\) is a \(p\)-vector of zeros.

Maximum likelihood estimation

The maximum likelihood estimator of \(\beta\) and \(\phi\) satisfies \(s_\beta(\hat\beta,\hat\phi) = 0_p\) and \(s_\phi(\hat\beta, \hat\phi) = 0\).

Bias-reducing adjusted score functions

Let \(A_\beta(\beta, \phi) = -i_\beta(\beta, \phi) b_\beta(\beta, \phi)\) and \(A_\phi(\beta, \phi) = -i_\phi(\beta, \phi) b_\phi(\beta, \phi)\), where \(b_\beta(\beta, \phi)\) and \(b_\phi(\beta, \phi)\) are the first terms in the expansion of the bias of the maximum likelihood estimator of the regression parameters \(\beta\) and dispersion \(\phi\), respectively. The results in Firth (1993) can be used to show that the solution of the adjusted score equations \begin{align*} s_\beta(\beta,\phi) + A_\beta(\beta, \phi) & = 0_p \\ s_\phi(\beta, \phi) + A_\phi(\beta, \phi) & = 0 \end{align*}

results in estimators \(\tilde\beta\) and \(\tilde\phi\) with bias of smaller asymptotic order than the maximum likelihood estimator.

The results in either Kosmidis and Firth (2009) or Cordeiro and McCullagh (1991) can then be used to re-express the adjustments in forms that are convenient for implementation. In particular, and after some algebra the bias-reducing adjustments for generalized linear models are \begin{align*} A_\beta(\beta, \phi) & = X^\top W \xi \,, \\ A_\phi(\beta, \phi) & = \frac{(p - 2)\phi\sum_{i = 1}^n m_i^2 a''(-m_i/\phi) + \sum_{i = 1}^n m_i^3 a'''(-m_i/\phi))}{2\phi^2\sum_{i = 1}^n m_i^2 a''(-m_i/\phi)} \end{align*}

where \(\xi = (\xi_1, \ldots, \xi_n)^T\) with \(\xi_i = h_id_i'/(2d_iw_i)\), \(d_i' = d^2\mu_i/d\eta_i^2\) and \(h_i\) is the “hat” value for the \(i\)th observation (see, e.g. ?hatvalues).

Fitting algorithm in brglmFit

brglmFit implements a quasi Fisher scoring procedure for solving the adjusted score equations \(s_\beta(\beta,\phi) + A_\beta(\beta, \phi) = 0_p\) and \(s_\phi(\beta, \phi) + A_\phi(\beta, \phi) = 0\). The iteration consists of an outer loop and an inner loop that implements step-halving. The algorithm is as follows:

Input

Output

Iteration

Initialize outer loop

  1. \(k \leftarrow 0\)

  2. \(\upsilon_\beta^{(0)} \leftarrow \left\{i_{\beta\beta}\left(\beta^{(0)}, \phi^{(0)}\right)\right\}^{-1} \left\{s_\beta\left(\beta^{(0)}, \phi^{(0)}\right) + A_\beta\left(\beta^{(0)}, \phi^{(0)}\right)\right\}\)

  3. \(\upsilon_\phi^{(0)} \leftarrow \left\{i_{\phi\phi}\left(\beta^{(0)}, \phi^{(0)}\right)\right\}^{-1} \left\{s_\phi\left(\beta^{(0)}, \phi^{(0)}\right) + A_\phi\left(\beta^{(0)}, \phi^{(0)}\right)\right\}\)

Initialize inner loop

  1. \(m \leftarrow 0\)

  2. \(b^{(m)} \leftarrow \beta^{(k)}\)

  3. \(f^{(m)} \leftarrow \phi^{(k)}\)

  4. \(v_\beta^{(m)} \leftarrow \upsilon_\beta^{(k)}\)

  5. \(v_\phi^{(m)} \leftarrow \upsilon_\phi^{(k)}\)

  6. \(d \leftarrow \left|v_\beta^{(m)}\right|_1 + \left|v_\phi^{(m)}\right|\)

Update parameters

  1. \(b^{(m + 1)} \leftarrow b^{(m)} + 2^{-m} v_\beta^{(m)}\)

  2. \(f^{(m + 1)} \leftarrow f^{(m)} + 2^{-m} v_\phi^{(m)}\)

Update direction

  1. \(v_\beta^{(m + 1)} \leftarrow \left\{i_{\beta\beta}\left(b^{(m + 1)}, f^{(m + 1)}\right)\right\}^{-1} \left\{s_\beta\left(b^{(m + 1)}, f^{(m + 1)}\right) + A_\beta\left(b^{(m + 1)}, f^{(m + 1)}\right)\right\}\)

  2. \(v_\phi^{(m + 1)} \leftarrow \left\{i_{\phi\phi}\left(b^{(m + 1)}, f^{(m + 1)}\right)\right\}^{-1} \left\{s_\phi\left(b^{(m + 1)}, f^{(m + 1)}\right) + A_\phi\left(b^{(m + 1)}, f^{(m + 1)}\right)\right\}\)

Continue or break halving within inner loop

  1. if \(m + 1 < M\) and \(\left|v_\beta^{(m + 1)}\right|_1 + \left|v_\phi^{(m + 1)}\right| > d\)

    14.1. \(m \leftarrow m + 1\)

    14.2. GO TO 10

  2. else

    15.1. \(\beta^{(k + 1)} \leftarrow b^{(m + 1)}\)

    15.2. \(\phi^{(k + 1)} \leftarrow f^{(m + 1)}\)

    15.3. \(\upsilon_\beta^{(k + 1)} \leftarrow v_b^{(m + 1)}\)

    15.4. \(\upsilon_\phi^{(k + 1)} \leftarrow v_f^{(m + 1)}\)

Continue or break outer loop

  1. if \(k + 1 < K\) and \(\left|\upsilon_\beta^{(k + 1)}\right|_1 + \left|\upsilon_\phi^{(k + 1)}\right| > \epsilon\)

    16.1 \(k \leftarrow k + 1\)

    16.2. GO TO 4

  2. else

    17.1. \(\tilde\beta \leftarrow \beta^{(k + 1)}\)

    17.2. \(\tilde\phi \leftarrow \phi^{(k + 1)}\)

    17.3. STOP

Notes

References

Albert, A., and J. A. Anderson. 1984. “On the Existence of Maximum Likelihood Estimates in Logistic Regression Models.” Biometrika 71 (1): 1–10.

Cordeiro, G. M., and P. McCullagh. 1991. “Bias Correction in Generalized Linear Models.” Journal of the Royal Statistical Society, Series B: Methodological 53 (3): 629–43.

Firth, D. 1993. “Bias Reduction of Maximum Likelihood Estimates.” Biometrika 80 (1): 27–38.

Kosmidis, I. 2014. “Bias in Parametric Estimation: Reduction and Useful Side-Effects.” Wiley Interdisciplinary Reviews: Computational Statistics 6 (3). John Wiley & Sons, Inc.: 185–96. doi:10.1002/wics.1296.

Kosmidis, I., and D. Firth. 2009. “Bias Reduction in Exponential Family Nonlinear Models.” Biometrika 96 (4): 793–804. doi:10.1093/biomet/asp055.

———. 2010. “A Generic Algorithm for Reducing Bias in Parametric Estimation.” Electronic Journal of Statistics 4: 1097–1112. doi:10.1214/10-EJS579.