VARshrink 0.3: Shrinkage Estimation Methods for Vector Autoregressive Models (A Brief Version)

Namgil Lee, Heon-Young Yang, and Sung-Ho Kim*

2019-10-07

Abstract

We introduce an R software package, VARshrink, for providing shrinkage estimation methods for vector autoregressive (VAR) models. Contrary to the standard ordinary least squares method, shrinkage estimation methods can be applied to high-dimensional VAR models with dimensionality greater than the number of observations. While existing R packages for VAR shrinkage are mostly based on parametric, Bayesian estimation methods using informative priors, the VARshrink aims to be an integrative R package delivering nonparametric, parametric, and semiparametric methods in a unified and consistent manner. The current version of VARshrink contains four shrinkage estimation methods, which are the multivarate ridge regression, a nonparametric shrinkage method, a full Bayesian approach using noninformative priors, and a semiparametric Bayesian approach using informative priors. VAR estimation problems are formulated as a multivariate regression form, so that all the shrinkage estimation methods can be run by one interface function named VARshrink(). We clearly present mathematical expressions of shrinkage estimators of the shrinkage estimation methods. The effective number of parameters can be calculated based on a shrinkage intensity parameter value, which plays an important role for further statistical analyses. We provide a sample R code for demonstration of the package and model comparison.

keywords: Bayes estimation, vector autoregression (VAR), high-dimensionality, shrinkage, multivariate time series

1 Introduction

Let \(\mathbf{y}_t = (y_{t1},y_{t2},\ldots,y_{tK})^\top\) denote a \(K\times 1\) vector of endogenous variables. A vector autoregressive (VAR) model of order \(p\) can be expressed as \[ \mathbf{y}_t = \mathbf{A}_1 \mathbf{y}_{t-1} + \cdots + \mathbf{A}_p \mathbf{y}_{t-p} + \mathbf{C} \mathbf{d}_t + \boldsymbol\epsilon_t, \qquad\qquad\qquad (1) \] where \(\mathbf{d}_t\) is an \(L\times 1\) vector of deterministic regressors, \(\boldsymbol\epsilon_t\) is a \(K\times 1\) noise vector, and \(\mathbf{A}_1,\ldots,\mathbf{A}_p\) and \(\mathbf{C}\) are coefficient matrices (Hamilton 1994; Tsay 2005).

1.1 Backgrounds

Recently, several R packages have been developed for parameter estimation and forecasting using stochastic time series models. The forecast package provides various methods and tools for univariate time series models such as the ARIMA and ETS models (Hyndman et al. 2018). The MTS package has been developed for a wide variety of multivariate linear time series models and multivariate volatility models such as the VARMA, multivariate EWMA, and low-dimensional BEKK models (Tsay and Wood 2018). The vars package provides methods for multivariate time series analysis using the VAR, SVAR, and SVEC models (Pfaff and Stigler 2018).

In this study, we focus on shrinkage estimation of VAR model parameters. Shrinkage estimation methods have been playing crucial roles in high-dimensional statistical modeling; see, e.g., Beltrachini, von Ellenrieder, and Muravchik (2013), Böhm and von Sachs (2009), Fiecas and Ombao (2011) and Ledoit and Wolf (2004). For VAR models, several shrinkage estimation methods have been suggested such as a Stein-type nonparametric shrinkage estimation method (Opgen-Rhein and Strimmer 2007b), Bayesian VARs using informative priors (Bańbura, Giannone, and Reichlin 2010; Doan, Litterman, and Sims 1984; Koop and Korobilis 2010; Litterman 1986), Bayesian VARs using noninformative priors (Ni and Sun 2005; Sun and Ni 2004), and a semiparametric Bayesian approach adopting a modified \(K\)-fold cross validation (Lee, Choi, and Kim 2016).

Due to its popularity in macroeconomic time series analysis, several Bayesian VAR methods have been implemented in R packages. For example, the function BVAR() in MTS implements a Bayesian VAR method using an informative prior (Tsay and Wood 2018); the package bvarsv implements Bayesian VAR models with stochastic volatility and time-varying parameters (Krueger 2015; Koop and Korobilis 2010; Primiceri 2005).

On the other hand, we note that other types of shrinkage methods which have been implemented for other purposes than multivariate time series analysis can be applied to shrinkage estimation of VAR parameters. For instance, the function cov.shrink() in the package corpcor has been implemented to compute shrinkage estimates of covariances, but it has been applied to estimate VAR coefficients in Opgen-Rhein and Strimmer (2007b; Schäfer et al. 2017). Moreover, we note that VAR models can be reformulated into multivariate regression problems, so that penalized least squares methods can be used for shrinkage estimation of VAR parameters, e.g., the functions lm.gls() for generalized least squares and lm.ridge() for ridge regression in the package MASS (Ripley et al. 2018); the function glmnet() for Lasso and Elastic-Net regularized generalized linear models in the package glmnet() (Friedman et al. 2018); the function linearRidge() for ridge regression in the package ridge (Moritz and Cule 2018).

1.2 Main Purpose

While Bayesian approaches have been widely used in the literature, we note that nonparametric and semiparametric approaches have advantages in the case of high-dimensional VARs with more than several hundreds of time series variables due to their relatively low computational costs (Opgen-Rhein and Strimmer 2007b). Despite of their relatively high computational costs, Bayesian approaches can impose proper assumptions on the multivariate time series data flexibly, such as VAR roots near unity and correlations between noise processes (Lee, Choi, and Kim 2016). In this sense, a semiparametric approach can be a trade-off between nonparametric and parametric approaches (Lee, Choi, and Kim 2016).

In this study, we developed an integrative R package, VARshrink, for implementing nonparametric, parametric, and semiparametric approaches for shrinkage estimation of VAR model parameters. By providing a simple interface function, VARshrink(), for running various types of shrinkage estimation methods, the performance of the methods can be easily compared. We note that the package vars implemented an ordinary least squares method for VAR models by the function VAR(). The function VARshrink() was built to extend VAR() to shrinkage estimation methods, so that the VARshrink package takes advantage of the tools and methods available in the vars package. For example, we demonstrate the use of model selection criteria such as AIC and BIC in this paper, where the methods AIC(), BIC(), and logLik() can handle multivariate t-distribution and the effective number of parameters in VARshrink.

This paper is a brief version of the original manuscript. This paper is organized as follows. In Section 2, we explain the formulation of VAR models in a multivariate regression problem, which simplifies implementation of the package. In Section 3, we describe the common interface function and the four shrinkage estimation methods included in the package. We clearly present closed form expressions for the shrinkage estimators inferred by the shrinkage methods, so that we can indicate the role of the shrinkage intensity parameters in each method. In addition, we explain how the effective number of parameters can be computed for shrinkage estimators. In Section 4, we present numerical experiments using benchmark data and simulated data briefly for comparing performances of the shrinkage estimation methods. Discussion and conclusions are provided in Section 5.


2 Models

In general, we can rewrite the model equation Eq. (1) in the form of a multivariate regression as \[ \mathbf{y}_t = \mathbf{\Psi}^\top \mathbf{x}_t + \boldsymbol{\epsilon}_t, \qquad\qquad\qquad (5) \] where \(\mathbf{\Psi} = (\mathbf{A}_1, \mathbf{A}_2, \ldots, \mathbf{A}_p, \mathbf{C})^\top\) is a \((Kp + L) \times K\) matrix of coefficients, \(\mathbf{x}_t = (\mathbf{y}_{t-1}^\top, \ldots, \mathbf{y}_{t-p}^\top, \mathbf{d}_t^\top)^\top\) is a \((Kp + L)\times 1\) vector of regressors. For estimation of VAR parameters from the observed time series data \(\mathbf{y}_1, \mathbf{y}_2, \ldots, \mathbf{y}_T\), we define data matrices as \[\begin{equation} \begin{split} \mathbf{Y} &= ( \mathbf{y}_{p+1}, \mathbf{y}_{p+2}, \ldots, \mathbf{y}_{T} )^\top \in \mathbb{R}^{N \times K}, \\ \mathbf{X} &= ( \mathbf{x}_{p+1}, \mathbf{x}_{p+2}, \ldots, \mathbf{x}_{T} )^\top \in \mathbb{R}^{N \times (Kp + L)}, \end{split} \end{equation}\] with \(N = T-p\). Then, we can rewrite Eq. (5) in a matrix form as \[ \mathbf{Y} = \mathbf{X} \mathbf{\Psi} + \mathbf{E} \in \mathbb{R}^{N \times K}, \qquad\qquad\qquad (6) \] with \(\mathbf{E} = (\boldsymbol\epsilon_{p+1}, \ldots, \boldsymbol\epsilon_T)^\top\) and \(N = T-p\).


3 Shrinkage Estimation Methods

In this section, we will describe the shrinkage estimation methods implemented in this package. The methods are used for estimating the VAR coefficient matrix \(\mathbf{\Psi}\) alone or both of the \(\mathbf{\Psi}\) and \(\mathbf{\Sigma}\) in Eq. (6).

We provide a common R function interface VARshrink() for running the estimation methods, which is defined by

VARshrink(y, p = 1, type = c("const", "trend", "both", "none"),
  season = NULL, exogen = NULL, method = c("ridge", "ns", "fbayes",
  "sbayes", "kcv"), lambda = NULL, lambda_var = NULL, dof = Inf, ...)

The input arguments are described as follows.

The output value is an object of class “varshrinkest”, which inherits the class “varest” in the package vars, and an object of class “varest” can be obtained by the function VAR() in the package vars. The input arguments and the output value indicate that VARshrink() is an extension of VAR() in the vars package. As a result, almost all the methods and functions included in the package vars are available for the package VARshrink, such as fevd(), Phi(), plot(). The methods and functions available in the VARshrink package are summarized in Table 1. The names of the functions for class “varshrinkest” has suffix "_sh" in order to distinguish them from the functions for class “varest”. We remark that the classes “varshrinkest”, “varshirf”, and “varshsum” inherit the classes “varest”, “varirf”, and “varsum” of the vars package, respectively.

Table 1. Structure of the package VARshrink.
Function or method Class Methods for class Functions for class
VAR varshrinkest, varest coef, fevd, fitted, irf, logLik, Phi, plot, predict, print, Psi, resid, summary Acoef_sh, arch.test_sh, Bcoef_sh, BQ_sh, causality_sh, normality.test_sh, restrict_sh, roots_sh, serial.test_sh, stability_sh
fevd varfevd plot, print
irf varshirf, varirf plot, print
predict varprd plot, print fanchart
summary varshsum, varsum print
arch.test_sh varcheck plot, print
normality.test_sh varcheck plot, print
serial.test_sh varcheck plot, print
stability_sh varstabil plot, print

3.1 Multivariate Ridge Regression

The ridge regression method is a kind of penalized least squares (PLS) method, which produces a biased estimate of the VAR coefficient (Hoerl and Kennard 1970). Formally speaking, the ridge regression estimator of \(\mathbf{\Psi}\) can be obtained by minimizing the penalized sum of squared prediction errors (SSPE) as \[\begin{equation} \widehat{ \mathbf{\Psi} }^{\text{R}} (\lambda) = \arg\min_{\mathbf{\Psi}} \ \frac{1}{N} \left\|\mathbf{Y} - \mathbf{X} \mathbf{\Psi} \right\|_F^2 + \lambda \left\| \mathbf{\Psi} \right\|_F^2, \end{equation}\] where \(\|\mathbf{A}\|_F = \sqrt{ \sum_{i} \sum_{j} a_{ij}^2 }\) is the Frobenius norm of a matrix \(\mathbf{A}\), \(N = T-p\), and \(\lambda \geq 0\) is called the regularization parameter or the shrinkage parameter. The ridge regression estimator \(\widehat{ \mathbf{\Psi} }^\text{R} (\lambda)\) can be expressed in the closed form \[\begin{equation} \widehat{ \mathbf{\Psi} }^\text{R} (\lambda) = \left( \mathbf{X}^\top \mathbf{X} + N \lambda \mathbf{I} \right)^{-1} \mathbf{X}^\top \mathbf{Y}, \qquad \lambda \geq 0. \end{equation}\]

The shrinkage parameter \(\lambda\) for the ridge regression can be automatically determined by using the generalized cross-validation (GCV) (Golub, Heath, and Wahba 1979). The GCV selects the value of \(\lambda\) where the GCV score given below is minimized: \[\begin{equation} GCV(\lambda) = \frac{1}{N} \left\| (\mathbf{I} - \mathbf{H}(\lambda) \mathbf{Y} ) \right\|^2_\text{F} \left/ \left[ \frac{1}{N} Trace(\mathbf{I} - \mathbf{H}(\lambda) ) \right]^2 \right. , \end{equation}\] where \(\mathbf{H}(\lambda) = \mathbf{X}^\top \left(\mathbf{X}^\top \mathbf{X} + N \lambda \mathbf{I} \right)^{-1} \mathbf{X}^\top\).

In this package, the interface to the shrinkage estimation methods is provided by the function VARshrink(method = "ridge", ...). If the input argument lambda is set at a value or a vector of values, then corresponding GCV score is computed automatically for each \(\lambda\) value, and the VAR coefficients with the smallest GCV score is selected. If lambda = NULL, then the default value of c(0.0001, 0.0005, 0.001, 0.005, ..., 10, 50) is used.

For example, simulated time series data of length \(T=100\) were generated based on a multivariate normal distribution for noise and a VAR model with \(p=1\), \(K=2\), \(\mathbf{A}_1 = 0.5\mathbf{I}_2\), \(\mathbf{C}=(0.2, 0.7)^\top\), and \(\mathbf{\Sigma} = 0.1^2\mathbf{I}_2\) as follows:

set.seed(1000)
myCoef <- list(A = list(matrix(c(0.5, 0, 0, 0.5), 2, 2)), c = c(0.2, 0.7))
myModel <- list(Coef = myCoef, Sigma = diag(0.1^2, 2), dof = Inf)
Y <- simVARmodel(numT = 100, model = myModel, burnin = 10)
resu_estim <- list()

Then, the multivariate ridge regression can be carried out for VAR models as follows. The result printed on the screen shows all the lambda values considered and the corresponding GCV values. The VAR parameters are estimated using the lambda value with the minimum GCV value.

resu_estim$`Ridge regression` <-
  VARshrink(Y, p = 1, type = "const", method = "ridge", lambda = NULL)
resu_estim$`Ridge regression`
#> 
#> VAR Shrinkage Estimation Results:
#> ================================= 
#> 
#> Estimated coefficients for equation y1: 
#> ======================================= 
#> Call:
#> y1 = y1.l1 + y2.l1 + const 
#> 
#>      y1.l1      y2.l1      const 
#> 0.61424699 0.07318362 0.05264740 
#> 
#> 
#> Estimated coefficients for equation y2: 
#> ======================================= 
#> Call:
#> y2 = y1.l1 + y2.l1 + const 
#> 
#>     y1.l1     y2.l1     const 
#> 0.1195550 0.3254618 0.9055450 
#> 
#> 
#> lambda: 1e-04 5e-04 0.001 0.005 0.01 0.05 0.1 0.5 1 5 10 50 (estimated: TRUE) 
#> GCV:  0.0187 0.0186 0.0186 0.0192 0.02 0.0227 0.025 0.0647 0.1506 0.8406 1.2754 1.9329

The method summary() is available for objects of class “varshrinkest” as follows:

summary(resu_estim$`Ridge regression`)
#> 
#> VAR Shrinkage Estimation Results:
#> =================================== 
#> Endogenous variables: y1, y2 
#> Deterministic variables: const 
#> Sample size: 99 
#> Log Likelihood: 188.793 
#> Roots of the characteristic polynomial:
#> 0.6419 0.2978
#> Call:
#> VARshrink(y = Y, p = 1, type = "const", method = "ridge", lambda = NULL)
#> 
#> 
#> Estimation results for equation y1: 
#> =================================== 
#> y1 = y1.l1 + y2.l1 + const 
#> 
#>       Estimate Std. Error t value Pr(>|t|)    
#> y1.l1  0.61425    0.07636   8.045 2.27e-12 ***
#> y2.l1  0.07318    0.07549   0.969    0.335    
#> const  0.05265    0.10888   0.484    0.630    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> 
#> Residual standard error: 0.08913 on 96.16244 degrees of freedom
#> Multiple R-Squared: 0.3864,  Adjusted R-squared: 0.3747 
#> F-statistic: 32.95 on 1.837559 and 96.16244 DF,  p-value: 4.725e-11 
#> 
#> 
#> Estimation results for equation y2: 
#> =================================== 
#> y2 = y1.l1 + y2.l1 + const 
#> 
#>       Estimate Std. Error t value Pr(>|t|)    
#> y1.l1  0.11955    0.08632   1.385 0.169249    
#> y2.l1  0.32546    0.08534   3.814 0.000242 ***
#> const  0.90555    0.12308   7.357  6.3e-11 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> 
#> Residual standard error: 0.1008 on 96.16244 degrees of freedom
#> Multiple R-Squared: 0.1219,  Adjusted R-squared: 0.1051 
#> F-statistic: 7.263 on 1.837559 and 96.16244 DF,  p-value: 0.00157 
#> 
#> 
#> 
#> Scale matrix, Sigma, of multivariate t distribution for noise:
#>            y1         y2
#> y1  0.0077160 -0.0006829
#> y2 -0.0006829  0.0098611
#> 
#> Degrees of freedom of multivariate t distribution for noise:
#> [1] Inf
#> 
#> Correlation matrix of Sigma:
#>          y1       y2
#> y1  1.00000 -0.07829
#> y2 -0.07829  1.00000

3.2 Nonparametric Shrinkage (NS)

The nonparametric shrinkage (NS) estimation method for VAR models, proposed by Opgen-Rhein and Strimmer (2007b), produces an estimate of \(\mathbf{\Psi}\) based on a James-Stein type shrinkage of sample covariance matrices (Opgen-Rhein and Strimmer 2007a; Schäfer and Strimmer 2005).

We will briefly describe the NS method. In the NS method, the data matrices \(\mathbf{X}\) and \(\mathbf{Y}\) are mean-corrected so that each column has the mean of zero. Let \(\mathbf{Z} = [\mathbf{X}, \mathbf{Y}] \in \mathbb{R}^{N \times (Kp + K + L)}\) be a combined data matrix. The sample covariance matrix of \(\mathbf{Z}\) is partitioned as \[\begin{equation} \mathbf{S}_\text{ZZ} = \frac{1}{N - 1} \mathbf{Z}^\top \mathbf{Z} = \begin{bmatrix} \mathbf{S}_\text{XX} & \mathbf{S}_\text{XY} \\ \mathbf{S}_\text{XY}^\top & \mathbf{S}_\text{YY} \end{bmatrix} \in \mathbb{R}^{(Kp + K + L) \times (Kp + K + L)}, \end{equation}\] where \(\mathbf{S}_\text{XX} = (N - 1)^{-1} \mathbf{X}^\top \mathbf{X}\), \(\mathbf{S}_\text{XY} = (N - 1)^{-1} \mathbf{X}^\top \mathbf{Y}\), and \(\mathbf{S}_\text{YY} = (N - 1)^{-1} \mathbf{Y}^\top \mathbf{Y}\). The matrix \(\mathbf{S}_\text{ZZ}\) can be decomposed as \[\begin{equation} \mathbf{S}_\text{ZZ} = \mathbf{D}_\text{Z}^{1/2} \mathbf{R}_\text{ZZ} \mathbf{D}_\text{Z}^{1/2}, \end{equation}\] where \(\mathbf{R}_\text{ZZ}\) is the sample correlation matrix and \(\mathbf{D}_\text{Z} = \text{diag}(s_{11}, s_{22}, \ldots, s_{Kp + K + L, Kp + K + L})\) is a diagonal matrix with diagonal elements of sample variances. Shrinkage estimates of the correlation matrix and the variances can be written as \[\begin{equation} \widehat{\mathbf{R}}_\text{ZZ} = (1-\lambda) \mathbf{R}_\text{ZZ} + \lambda \mathbf{I} \quad \text{and} \quad \widehat{\mathbf{D}}_\text{Z} = \text{diag}(\hat{s}_{11}, \hat{s}_{22}, \ldots, \hat{s}_{Kp+K+L,Kp+K+L}) \end{equation}\] with \[\begin{equation} \hat{s}_{ii} = (1-\lambda_v) s_{ii} + \lambda_v s_\text{med}, \quad i = 1, 2, \ldots, Kp + K + L, \end{equation}\] where \(s_\text{med}\) is a median of all the sample variances \(s_{ii}\), and \(0\leq \lambda, \lambda_v \leq 1\) are shrinkage parameters. The estimated covariance matrix can be computed by \[\begin{equation} \widehat{\mathbf{S}}_\text{ZZ}(\lambda, \lambda_v) = \widehat{\mathbf{D}}_\text{Z}^{1/2} \widehat{\mathbf{R}}_\text{ZZ} \widehat{\mathbf{D}}_\text{Z}^{1/2}. \end{equation}\] The values of the shrinkage parameters \(\lambda\) and \(\lambda_v\) are determined by the James-Stein type shrinkage method, which we call the NS method described in (Opgen-Rhein and Strimmer 2007a; Schäfer and Strimmer 2005).

The ordinary least squares estimate \(\widehat{\mathbf{\Psi}}^{\text{OLS}}\) of \(\mathbf{\Psi}\) is given by \(\widehat{\mathbf{\Psi}}^{\text{OLS}} = \mathbf{S}_\text{XX}^{-1} \mathbf{S}_\text{XY}\). We define the NS estimate of \(\mathbf{\Psi}\) as \[\begin{equation} \widehat{ \mathbf{\Psi} }^\text{N} (\lambda, \lambda_v) = \widehat{\mathbf{S}}_\text{XX}^{-1} \widehat{\mathbf{S}}_\text{XY}, \qquad 0\leq \lambda, \lambda_v \leq 1. \end{equation}\] where \(\widehat{\mathbf{S}}_\text{XX}\) and \(\widehat{\mathbf{S}}_\text{XY}\) are parts of the estimated covariance matrix, \[\begin{equation} \widehat{\mathbf{S}}_\text{ZZ} (\lambda, \lambda_v) = \begin{bmatrix} \widehat{\mathbf{S}}_\text{XX} & \widehat{\mathbf{S}}_\text{XY} \\ \widehat{\mathbf{S}}_\text{XY}^\top & \widehat{\mathbf{S}}_\text{YY} \end{bmatrix}. \end{equation}\]

In the package VARshrink, the function VARshrink(method = "ns", ...) provides an interface with the NS method. In specific, the package corpcor (Schäfer et al. 2017) includes the R function cov.shrink(), which can determine \(\lambda\) and \(\lambda_v\) and estimate the covariance matrix \(\widehat{\mathbf{S}}_\text{ZZ}(\lambda, \lambda_v)\). The function VARshrink() in the VARshrink package infers the NS estimates of VAR coefficients, \(\widehat{\mathbf{\Psi}}^\text{N} (\lambda, \lambda_v)\), by using the covariance matrix \(\widehat{\mathbf{S}}_\text{ZZ}(\lambda, \lambda_v)\). If the input arguments lambda and lambda_var are set as lambda = NULL and lambda_var = NULL, then \(\lambda\) and \(\lambda_v\) are determined automatically. We can find the estimated \(\lambda\) and \(\lambda_v\) values on the printed screen. For example,

resu_estim$`Nonparametric shrinkage` <-
  VARshrink(Y, p = 1, type = "const", method = "ns",
                    lambda = NULL, lambda_var = NULL)
#> Warning in VARshrink(Y, p = 1, type = "const", method = "ns", lambda =
#> NULL, : 'ns' method does not allow type='const'.. changed to 'none'.
resu_estim$`Nonparametric shrinkage`
#> 
#> VAR Shrinkage Estimation Results:
#> ================================= 
#> 
#> Estimated coefficients for equation y1: 
#> ======================================= 
#> Call:
#> y1 = y1.l1 + y2.l1 
#> 
#>      y1.l1      y2.l1 
#> 0.56430137 0.06139375 
#> 
#> 
#> Estimated coefficients for equation y2: 
#> ======================================= 
#> Call:
#> y2 = y1.l1 + y2.l1 
#> 
#>     y1.l1     y2.l1 
#> 0.1080570 0.2537706 
#> 
#> 
#> lambda: 0.104881997093098 (estimated: TRUE) 
#> lambda_var: 1 (estimated: TRUE)

The summary() shows statistical inference on the estimated VAR coefficients together with the estimated scale matrix \(\mathbf{\Sigma}\) of multivariate t-distribution for noise.

summary(resu_estim$`Nonparametric shrinkage`)
#> 
#> VAR Shrinkage Estimation Results:
#> =================================== 
#> Endogenous variables: y1, y2 
#> Deterministic variables: none 
#> Sample size: 99 
#> Log Likelihood: 188.443 
#> Roots of the characteristic polynomial:
#> 0.5844 0.2337
#> Call:
#> VARshrink(y = Y, p = 1, type = "const", method = "ns", lambda = NULL, 
#>     lambda_var = NULL)
#> 
#> 
#> Estimation results for equation y1: 
#> =================================== 
#> y1 = y1.l1 + y2.l1 
#> 
#>       Estimate Std. Error t value Pr(>|t|)    
#> y1.l1 0.564301   0.007754  72.775  < 2e-16 ***
#> y2.l1 0.061394   0.007315   8.393 3.47e-13 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> 
#> Residual standard error: 0.08828 on 98.81547 degrees of freedom
#> Multiple R-Squared: 0.3446
#> Warning in pf(result$fstatistic[1L], result$fstatistic[2L],
#> result$fstatistic[3L], : NaNs produced
#> ,    Adjusted R-squared:  0.35 
#> F-statistic: -63.71 on -0.8154669 and 98.81547 DF,  p-value: NA 
#> 
#> 
#> Estimation results for equation y2: 
#> =================================== 
#> y2 = y1.l1 + y2.l1 
#> 
#>       Estimate Std. Error t value Pr(>|t|)    
#> y1.l1 0.108057   0.008725   12.38   <2e-16 ***
#> y2.l1 0.253771   0.008231   30.83   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> 
#> Residual standard error: 0.09933 on 98.81547 degrees of freedom
#> Multiple R-Squared: 0.08139
#> Warning in pf(result$fstatistic[1L], result$fstatistic[2L],
#> result$fstatistic[3L], : NaNs produced
#> ,    Adjusted R-squared: 0.08897 
#> F-statistic: -10.74 on -0.8154669 and 98.81547 DF,  p-value: NA 
#> 
#> 
#> 
#> Scale matrix, Sigma, of multivariate t distribution for noise:
#>            y1         y2
#> y1  0.0077791 -0.0006733
#> y2 -0.0006733  0.0098486
#> 
#> Degrees of freedom of multivariate t distribution for noise:
#> [1] Inf
#> 
#> Correlation matrix of Sigma:
#>          y1       y2
#> y1  1.00000 -0.07693
#> y2 -0.07693  1.00000

3.3 Full Bayesian Shrinkage

Ni and Sun (2005) and Sun and Ni (2004) studied Bayesian estimation methods using noninformative priors for VAR models, where they used Markov Chain Monte Carlo (MCMC) methods for estimating coefficient matrices, noise covariance matrices, and other hyperparameters in Bayesian VAR models. In Ni and Sun (2005), various Bayesian estimators were compared using several types of loss functions, noninformative priors, and multivariate normal and t-distributions. Among them, Bayesian estimators using a certain type of noninformative priors showed higher accuracies than the other Bayesian estimators in simulated experiments. The noninformative prior for coefficient matrices was called the shrinkage prior, and the prior for the noise covariance matrix was called the reference prior.

In the package VARshrink, we can obtain Bayesian estimators using the shrinkage prior and the reference prior, which are the noninformative priors that yielded the highest accuracies in the simulation results (Ni and Sun 2005). As a Bayesian estimator of VAR parameters, the minimizer of the posterior expectation of quadratic loss is computed, which is the mean of the posterior distribution. Additionally, the minimizer of the posterior expectation of LINEX loss is also computed (Ni and Sun 2005; Zellner 1986). In this section, we will explain the model assumptions in more detail.

First, the noise vectors are independent and identically distributed from multivariate t-distributions with the degree of freedom \(\nu\), i.e., \(\boldsymbol\epsilon_t \sim t_\nu (\mathbf{0}, \mathbf{\Sigma})\). It can be expressed as a hierarchical model as \[\begin{equation} \begin{split} ( \boldsymbol\epsilon_t | q_t ) & \sim \text{N}_K (\mathbf{0}, q_t^{-1} \mathbf{\Sigma}). \qquad\qquad\qquad (7) \\ q_t & \sim \text{Gamma}(\nu/2, \nu/2). \end{split} \end{equation}\]

Second, we denote the vectorized version of \(\mathbf{\Psi} \in\mathbb{R}^{(J/K)\times K}\) by \(\boldsymbol\psi = \text{vec}(\mathbf{\Psi}) \in\mathbb{R}^{J}\). Here \(J = K (Kp + L)\). The shrinkage prior \(\pi_\text{S}(\boldsymbol\psi)\) for \(\boldsymbol\psi \in\mathbb{R}^{J}\) is taken as \(\pi_\text{S}(\boldsymbol\psi) \propto \left\| \boldsymbol\psi \right\|^{ -(J-2) }.\) By introducing a latent variable \(\lambda>0\), the shrinkage prior can also be expressed as a scale mixture of multivariate normals as \[\begin{equation} \begin{split} (\boldsymbol\psi | \lambda) & \sim \text{N}_{J} (\mathbf{0}, \lambda^{-1} \mathbf{I}_J), \qquad\qquad\qquad (8) \\ \pi (\lambda) & \propto 1. \end{split} \end{equation}\]

Third, the reference prior \(\pi_\text{R}(\mathbf{\Sigma})\) for \(\mathbf{\Sigma}\) is taken as \[\begin{equation} \pi_\text{R}(\mathbf{\Sigma}) \propto \left| \mathbf{\Sigma} \right|^{-1} \prod_{1\leq i\leq j\leq K} (\lambda_i - \lambda_j)^{-1}, \end{equation}\] where \(\lambda_1 > \lambda_2 > \cdots > \lambda_K\) are eigenvalues of \(\mathbf{\Sigma}\).

Note that no hyperparameters are involved in the shrinkage prior and the reference prior, since they are noninformative priors. The Gibbs MCMC method makes use of the hierarchical expression of \(\boldsymbol\epsilon_t\). The Gibbs MCMC samples the parameters \((\boldsymbol\psi, \lambda, \mathbf{\Sigma}, \mathbf{Q}, \nu)\) with \(\mathbf{Q} = \text{diag}(q_{p+1}, \ldots, q_T)\) from conditional posterior distributions (Ni and Sun 2005). We remark that the mean of the conditional posterior distribution, \(\pi(\boldsymbol\psi | \lambda, \mathbf{\Sigma}, \mathbf{Q}, \nu; \mathbf{Y})\) of \(\boldsymbol\psi\) is given by \[\begin{equation} \widehat{\boldsymbol\psi}^{F} (\lambda) = \left[\left(\mathbf{\Sigma}^{-1} \otimes \left( \mathbf{X}^\top \mathbf{Q} \mathbf{X} \right) \right) + \lambda \mathbf{I}_J \right]^{-1} \text{vec}\left( \mathbf{X}^\top \mathbf{Q} \mathbf{Y} \mathbf{\Sigma}^{-1} \right), \qquad \lambda > 0. \qquad\qquad\qquad (9) \end{equation}\] Note that if \(\mathbf{\Sigma} = \mathbf{I}_K\) and \(q_{p+1}=\cdots=q_T = 1\), then the estimator becomes the ridge regression estimator. That is, Bayesian shrinkage estimators have more flexible and abundant expressions, even though more computational effort is required to estimate more parameters.

In the package VARshrink, the function VARshrink(method = "fbayes", ...) plays the role as an interface with the full Bayesian shrinkage method with the shrinkage prior and the reference prior. The shrinkage parameter \(\lambda\) is not set at a fixed value, i.e., the arguments lambda and lambda_var are of no use here. Instead, the mean of the posterior distribution, \(\hat{\delta} = \mathbb{E}\left[ \lambda^{-1} | \mathbf{Y} \right]\) is estimated during the MCMC process (Ni and Sun 2005), and we define \(\hat{\lambda} = \hat{\delta}^{-1}\). There are several arguments to be specified as follows.

For example, we run the full Bayesian shrinkage method with a fixed \(\nu = 6\) as follows.

resu_estim$`Full Bayes (fixed dof)` <-
  VARshrink(Y, p = 1, type = "const", method = "fbayes", dof = 6,
                     burnincycle = 1000, mcmccycle = 2000)
resu_estim$`Full Bayes (fixed dof)`
#> 
#> VAR Shrinkage Estimation Results:
#> ================================= 
#> 
#> Estimated coefficients for equation y1: 
#> ======================================= 
#> Call:
#> y1 = y1.l1 + y2.l1 + const 
#> 
#>      y1.l1      y2.l1      const 
#> 0.63856333 0.05398115 0.07042395 
#> 
#> 
#> Estimated coefficients for equation y2: 
#> ======================================= 
#> Call:
#> y2 = y1.l1 + y2.l1 + const 
#> 
#>     y1.l1     y2.l1     const 
#> 0.1028193 0.3133995 0.9326989 
#> 
#> 
#> Sigma for noise:
#>               [,1]          [,2]
#> [1,]  0.0073772943 -0.0002864258
#> [2,] -0.0002864258  0.0079703260
#> dof for noise: 6 (estimated: FALSE) 
#> lambda: 1.46381861993198 (estimated: TRUE)

By using summary(), we can find the estimated scale matrix \(\mathbf{\Sigma}\) of multivariate t-distribution for noise.

summary(resu_estim$`Full Bayes (fixed dof)`)
#> 
#> VAR Shrinkage Estimation Results:
#> =================================== 
#> Endogenous variables: y1, y2 
#> Deterministic variables: const 
#> Sample size: 99 
#> Log Likelihood: 184.085 
#> Roots of the characteristic polynomial:
#> 0.6548 0.2971
#> Call:
#> VARshrink(y = Y, p = 1, type = "const", method = "fbayes", dof = 6, 
#>     burnincycle = 1000, mcmccycle = 2000)
#> 
#> 
#> Estimation results for equation y1: 
#> =================================== 
#> y1 = y1.l1 + y2.l1 + const 
#> 
#>       Estimate Std. Error t value Pr(>|t|)    
#> y1.l1  0.63856    0.07898   8.085 1.87e-12 ***
#> y2.l1  0.05398    0.08314   0.649    0.518    
#> const  0.07042    0.12042   0.585    0.560    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> 
#> Residual standard error: 0.08917 on 96.04011 degrees of freedom
#> Multiple R-Squared: 0.4031,  Adjusted R-squared: 0.391 
#> F-statistic:  33.1 on 1.959891 and 96.04011 DF,  p-value: 1.608e-11 
#> 
#> 
#> Estimation results for equation y2: 
#> =================================== 
#> y2 = y1.l1 + y2.l1 + const 
#> 
#>       Estimate Std. Error t value Pr(>|t|)    
#> y1.l1   0.1028     0.0893   1.151  0.25242    
#> y2.l1   0.3134     0.0940   3.334  0.00122 ** 
#> const   0.9327     0.1362   6.850 7.02e-10 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> 
#> Residual standard error: 0.1008 on 96.04011 degrees of freedom
#> Multiple R-Squared: 0.1111,  Adjusted R-squared: 0.09297 
#> F-statistic: 6.125 on 1.959891 and 96.04011 DF,  p-value: 0.003331 
#> 
#> 
#> 
#> Scale matrix, Sigma, of multivariate t distribution for noise:
#>            [,1]       [,2]
#> [1,]  0.0073773 -0.0002864
#> [2,] -0.0002864  0.0079703
#> 
#> Degrees of freedom of multivariate t distribution for noise:
#> [1] 6
#> 
#> Correlation matrix of Sigma:
#>          [,1]     [,2]
#> [1,]  1.00000 -0.03735
#> [2,] -0.03735  1.00000

Instead, we can estimate the degree of freedom, \(\nu\), of multivariate t-distribution by the argument dof = NULL as follows:

resu_estim$`Full Bayes (estim dof)` <-
  VARshrink(Y, p = 1, type = "const", method = "fbayes", dof = NULL,
            burnincycle = 1000, mcmccycle = 2000)
resu_estim$`Full Bayes (estim dof)`
#> 
#> VAR Shrinkage Estimation Results:
#> ================================= 
#> 
#> Estimated coefficients for equation y1: 
#> ======================================= 
#> Call:
#> y1 = y1.l1 + y2.l1 + const 
#> 
#>      y1.l1      y2.l1      const 
#> 0.63942247 0.05771602 0.06449797 
#> 
#> 
#> Estimated coefficients for equation y2: 
#> ======================================= 
#> Call:
#> y2 = y1.l1 + y2.l1 + const 
#> 
#>     y1.l1     y2.l1     const 
#> 0.1024030 0.3152221 0.9299175 
#> 
#> 
#> Sigma for noise:
#>               [,1]          [,2]
#> [1,]  0.0069592834 -0.0001524324
#> [2,] -0.0001524324  0.0079022459
#> dof for noise: 8.21302751071667 (estimated: TRUE) 
#> lambda: 1.41018436005034 (estimated: TRUE)

3.4 Semiparametric Bayesian Shrinkage

Whereas full Bayesian shrinkage methods estimate all the hyperparameters including latent variables via MCMC methods, semiparametric Bayesian methods estimate some of the hyperparameters by a certain nonparametric method and estimate the rest in a Bayesian framework. The semiparametric approach is advantageous especially when the dimensionality of the model is so high that standard MCMC methods are not computationally tractable.

Lee, Choi, and Kim (2016) assumed scale mixtures of multivariate normal distributions for noise vectors as in Eq. (7). The prior distribution for the model coefficients, \(\boldsymbol\psi = \text{vec}(\mathbf{\Psi})\), was set as multivariate normal distributions, similarly to Eq. (8). Here, we can choose either the conjugate prior (CJ) and non-conjugate (NCJ) prior as follows:

  1. The conjugate prior for \(\boldsymbol\psi \in\mathbb{R}^J\) is expressed by \[\begin{equation} (\boldsymbol\psi | \lambda, \mathbf{\Sigma}) \sim \text{N}_{J} \left(\mathbf{0}, \frac{1-\lambda}{(N - 1) \lambda} \mathbf{\Sigma} \otimes \mathbf{I} \right), \qquad 0<\lambda < 1. \end{equation}\]

  2. The non-conjugate prior for \(\boldsymbol\psi \in\mathbb{R}^J\) is expressed by \[\begin{equation} (\boldsymbol\psi | \lambda) \sim \text{N}_{J} \left(\mathbf{0}, \frac{1-\lambda}{(N - 1) \lambda} \mathbf{I}_J \right), \qquad 0<\lambda < 1. \end{equation}\]

The scale matrix \(\mathbf{\Sigma}\) for noise is included in the equation for conjugate prior distribution but not for the non-conjugate prior distribution. The non-conjugate prior is quite similar to the full Bayesian formulation in Eq. (8). However, the main difference is that, in the semiparametric Bayesian approach, the shrinkage parameter \(\lambda\) should be estimated explicitly via a nonparametric method, but in the full Bayes approach, it is a latent variable which should be sampled and estimated implicitly via an MCMC method.

The prior distribution for \(\mathbf\Sigma\) was set as an inverse Wishart distribution as \[\begin{equation} (\mathbf{\Sigma} | \mathbf{L}_0, m_0) \sim \text{InvWishart}( \mathbf{L}_0, m_0), \qquad \mathbf{L}_0 \succ \mathbf{0}, m_0 > K - 1. \end{equation}\] where \(\mathbf{L}_0 \succ \mathbf{0}\) means that \(\mathbf{L}_0\) is positive definite.

Once the shrinkage parameter \(\lambda\) is set at a fixed value, the other parameters, \(\boldsymbol\psi, \mathbf{\Sigma}\), and \(\mathbf{Q}\) can be estimated iteratively in a Bayesian framework efficiently. Briefly speaking, we consider estimating the parameters \(\boldsymbol\psi\) and \(\mathbf{\Sigma}\) at the maximum point (i.e., the mode) of the marginal posterior density function \(\pi (\boldsymbol\psi, \mathbf{\Sigma} | \lambda; \mathbf{Y})\). In the case of the non-conjugate prior, the mode, \((\widehat{\boldsymbol\psi}^\text{S} (\lambda), \widehat{\mathbf{\Sigma}}^\text{S} (\lambda))\), is expressed as \[\begin{equation} \widehat{\boldsymbol\psi}^{S} (\lambda) = \left[\left( (\widehat{\mathbf{\Sigma}}^\text{S})^{-1} \otimes \left( \mathbf{X}^\top \widehat{\mathbf{Q}}^\text{S} \mathbf{X} \right) \right) + \frac{ (N - 1) \lambda}{1-\lambda} \mathbf{I}_J \right]^{-1} \text{vec}\left( \mathbf{X}^\top \widehat{\mathbf{Q}}^\text{S} \mathbf{Y} ( \widehat{\mathbf{\Sigma}}^\text{S})^{-1} \right), \qquad\qquad\qquad (10) \end{equation}\] and
\[\begin{equation} \widehat{\mathbf{\Sigma}}^\text{S} (\lambda) = \frac{1}{m_0 + T + K + 1} \left( \mathbf{L}_0 + \mathbf{Y}^\top \widehat{\mathbf{Q}} \mathbf{Y} - \mathbf{Y}^\top \widehat{\mathbf{Q}} \mathbf{X} \widehat{\mathbf{\Psi}}^\text{S} (\lambda) \right), \end{equation}\] for \(0< \lambda < 1\), where \(\widehat{\mathbf{Q}}^\text{S} = \text{diag}(\hat{q}_{p+1}, \ldots, \hat{q}_T)\) is obtained in an iterative manner. The values \(\widehat{\mathbf{Q}}^\text{S}\) is also expressed as \[\begin{equation} \hat{q}_t = h \left( \left\| (\widehat{\mathbf{\Sigma}}^\text{S})^{-1/2} \hat{\boldsymbol\epsilon}_t \right\|^2 \right), \qquad t = p+1, \ldots, T, \end{equation}\] where \(h(x)\) is defined depending on the noise distribution and \(\hat{\boldsymbol\epsilon}_t\) is the residual given by \(\hat{\boldsymbol\epsilon}_t = \mathbf{y}_t - \widehat{\mathbf{\Psi}}^{\text{S}\top} \mathbf{x}_t\) (Lee, Choi, and Kim 2016).

The shrinkage parameter \(\lambda\) is determined via an internal nonparametric method, which is called the parameterized cross validation (PCV). This algorithm can be considered as a modified \(K\)-fold cross validation, especially for estimating the shrinkage parameter of VAR models; see Lee, Choi, and Kim (2016) for a detailed explanation.

In addition, the semiparametric Bayesian shrinkage method adopts the idea of shrinking both the correlations and variances from the NS method. As a result, there are two shrinkage parameters \(\lambda\) and \(\lambda_v\), where \(0 \leq \lambda \leq 1\) is used for the shrinkage estimation of the VAR coefficient matrix \(\Psi\) and noise covariance matrix \(\mathbf{\Sigma}\) while \(0 \leq \lambda_v \leq 1\) is used for the shrinkage estimation of the variances of the variables \(y_{tj}, j = 1, \ldots, K\).

In the package VARshrink, the function VARshrink(method = "sbayes", ...) is for the semiparametric Bayesian shrinkage method. There are several input arguments to these functions as follows:

For example, we can run the semiparametric shrinkage method as follows.

resu_estim$`Semi Bayes (fixed dof)` <-
  VARshrink(Y, p = 1, type = "const", method = "sbayes", dof = 6,
            lambda = NULL, lambda_var = NULL, prior_type = "NCJ",
            num_folds = 5, m0 = ncol(Y))
resu_estim$`Semi Bayes (fixed dof)`
#> 
#> VAR Shrinkage Estimation Results:
#> ================================= 
#> 
#> Estimated coefficients for equation y1: 
#> ======================================= 
#> Call:
#> y1 = y1.l1 + y2.l1 + const 
#> 
#>      y1.l1      y2.l1      const 
#> 0.52361471 0.05712005 0.07157308 
#> 
#> 
#> Estimated coefficients for equation y2: 
#> ======================================= 
#> Call:
#> y2 = y1.l1 + y2.l1 + const 
#> 
#>      y1.l1      y2.l1      const 
#> 0.09327854 0.36697792 0.97005425 
#> 
#> 
#> Sigma for noise:
#>               y1            y2
#> y1  0.0065088952 -0.0003156327
#> y2 -0.0003156327  0.0106188219
#> dof for noise: 6 (estimated: FALSE) 
#> lambda: 0.00318699518988467 (estimated: TRUE) 
#> lambda_var: 0.940206098485031 (estimated: TRUE)
summary(resu_estim$`Semi Bayes (fixed dof)`)
#> 
#> VAR Shrinkage Estimation Results:
#> =================================== 
#> Endogenous variables: y1, y2 
#> Deterministic variables: const 
#> Sample size: 99 
#> Log Likelihood: 127.887 
#> Roots of the characteristic polynomial:
#> 0.5524 0.3382
#> Call:
#> VARshrink(y = Y, p = 1, type = "const", method = "sbayes", lambda = NULL, 
#>     lambda_var = NULL, dof = 6, prior_type = "NCJ", num_folds = 5, 
#>     m0 = ncol(Y))
#> 
#> 
#> Estimation results for equation y1: 
#> =================================== 
#> y1 = y1.l1 + y2.l1 + const 
#> 
#>       Estimate Std. Error t value Pr(>|t|)    
#> y1.l1  0.52361    0.08491   6.167 1.65e-08 ***
#> y2.l1  0.05712    0.09226   0.619    0.537    
#> const  0.07157    0.13375   0.535    0.594    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> 
#> Residual standard error: 0.09877 on 96.0093 degrees of freedom
#> Multiple R-Squared: 0.2713,  Adjusted R-squared: 0.2561 
#> F-statistic: 17.95 on 1.990699 and 96.0093 DF,  p-value: 2.492e-07 
#> 
#> 
#> Estimation results for equation y2: 
#> =================================== 
#> y2 = y1.l1 + y2.l1 + const 
#> 
#>       Estimate Std. Error t value Pr(>|t|)    
#> y1.l1  0.09328    0.13092   0.713   0.4779    
#> y2.l1  0.36698    0.14224   2.580   0.0114 *  
#> const  0.97005    0.20622   4.704 8.55e-06 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> 
#> Residual standard error: 0.1523 on 96.0093 degrees of freedom
#> Multiple R-Squared: 0.06678, Adjusted R-squared: 0.04743 
#> F-statistic: 3.451 on 1.990699 and 96.0093 DF,  p-value: 0.03589 
#> 
#> 
#> 
#> Scale matrix, Sigma, of multivariate t distribution for noise:
#>            y1         y2
#> y1  0.0065089 -0.0003156
#> y2 -0.0003156  0.0106188
#> 
#> Degrees of freedom of multivariate t distribution for noise:
#> [1] 6
#> 
#> Correlation matrix of Sigma:
#>          y1       y2
#> y1  1.00000 -0.03797
#> y2 -0.03797  1.00000

We can also let the software package to choose the degree of freedom parameter \(\nu\) automatically by setting dof = NULL. In this case, the package uses simply a \(K\)-fold cross validation to find an optimal value of \(\nu\).

resu_estim$`Semi Bayes (estim dof)` <-
  VARshrink(Y, p = 1, type = "const", method = "sbayes", dof = NULL,
            lambda = NULL, lambda_var = NULL, prior_type = "NCJ",
            num_folds = 5, m0 = ncol(Y))
resu_estim$`Semi Bayes (estim dof)`
#> 
#> VAR Shrinkage Estimation Results:
#> ================================= 
#> 
#> Estimated coefficients for equation y1: 
#> ======================================= 
#> Call:
#> y1 = y1.l1 + y2.l1 + const 
#> 
#>       y1.l1       y2.l1       const 
#>  0.50699465 -0.05884566  0.20997630 
#> 
#> 
#> Estimated coefficients for equation y2: 
#> ======================================= 
#> Call:
#> y2 = y1.l1 + y2.l1 + const 
#> 
#>      y1.l1      y2.l1      const 
#> 0.07138759 0.37448706 0.97355673 
#> 
#> 
#> Sigma for noise:
#>               y1            y2
#> y1  5.001873e-03 -9.978788e-06
#> y2 -9.978788e-06  6.991886e-03
#> dof for noise: 0.5 (estimated: TRUE) 
#> lambda: 0.00145763877257461 (estimated: TRUE) 
#> lambda_var: 0.940206098485031 (estimated: TRUE)

3.5 K-fold Cross Validation for Semiparametric Shrinkage

The VARshrink includes an implementation of the K-fold cross validation (CV) method for selecting shrinkage parameters. In the current version of the package, the K-fold CV method can select the \(\lambda\) and \(\lambda_v\) values for the semiparametric Bayesian shrinkage estimator described in Section 3.4. Note the the semiparametric shrinkage method in the previous section selects a \(\lambda\) value by using the PCV method and selects a \(\lambda_v\) value by a Stein-type nonparametric shrinkage method.

The K-fold CV method can be run as follows. The arguments to VARshrink() are same to those for the semiparametric Bayesian shrinkage method except for the argument method = "kcv".

resu_estim$`K-fold CV (fixed dof)` <-
  VARshrink(Y, p = 1, type = "const", method = "kcv", dof = 6,
            lambda = NULL, lambda_var = NULL, prior_type = "NCJ",
            num_folds = 5, m0 = ncol(Y))
resu_estim$`K-fold CV (fixed dof)`
#> 
#> VAR Shrinkage Estimation Results:
#> ================================= 
#> 
#> Estimated coefficients for equation y1: 
#> ======================================= 
#> Call:
#> y1 = y1.l1 + y2.l1 + const 
#> 
#>      y1.l1      y2.l1      const 
#> 0.62604182 0.05632303 0.07184976 
#> 
#> 
#> Estimated coefficients for equation y2: 
#> ======================================= 
#> Call:
#> y2 = y1.l1 + y2.l1 + const 
#> 
#>      y1.l1      y2.l1      const 
#> 0.09004368 0.29111218 0.96959392 
#> 
#> 
#> Sigma for noise:
#>               y1            y2
#> y1  0.0075080166 -0.0003445399
#> y2 -0.0003445399  0.0081370205
#> dof for noise: 6 (estimated: FALSE) 
#> lambda: 0.001 (estimated: TRUE) 
#> lambda_var: 0.001 (estimated: TRUE)

Degree of freedom of multivariate t-distribution for noise can be selected automatically by setting the argument dof = Inf as follows.

resu_estim$`K-fold CV (estim dof)` <-
  VARshrink(Y, p = 1, type = "const", method = "kcv", dof = NULL,
            lambda = NULL, lambda_var = NULL, prior_type = "NCJ",
            num_folds = 5, m0 = ncol(Y))

After all, the function calcSSE_Acoef() computes the sum of squared errors (SSEs) between two VAR model parameters, \(\{\mathbf{A}_k^{(1)}\}\) and \(\{\mathbf{A}_k^{(2)}\}\), as \(SSE = \sum_{k=1}^p \sum_{i,j} (\mathbf{A}^{(1)}_{kij} - \mathbf{A}^{(2)}_{kij})^2\). For example, Table 2 shows the SSEs of the estimated VAR coefficients.

resu_sse <- data.frame(SSE = sapply(resu_estim,
  function(x) calcSSE_Acoef(Acoef_sh(x), myCoef$A)))
Table 2. Sum of squared errors of VAR coefficients estimated by the shrinkage methods.
SSE
Ridge regression 0.063
Nonparametric shrinkage 0.080
Full Bayes (fixed dof) 0.068
Full Bayes (estim dof) 0.067
Semi Bayes (fixed dof) 0.030
Semi Bayes (estim dof) 0.024
K-fold CV (fixed dof) 0.071
K-fold CV (estim dof) 0.062

4 Numerical Experiments

In this section, we apply the shrinkage estimation methods in the package VARshrink to a benchmark data set. Using the benchmark data set, we demonstrate the use of information criteria such as Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) for comparison of VAR models. In this case, the effective number of parameters for an shrinkage estimate has to be re-calculated based on the shrinkage intensity parameter value.

4.1 Benchmark Data

The Canada data set is a benchmark macroeconomic data included in the package vars. It contains four time series variables representing employment(e), labor productivity(prod), real wage(rw), and unemployment rate(U), with \(84\) observations. We took difference on the data to remove trend, yielding \(T = 83\). Figure 1 shows the differencing result.

data(Canada, package = "vars")
Y = diff(Canada)
plot(Y, cex.lab = 1.3)
Figure 1. The benchmark data set obtained by differencing the Canada data.

Figure 1. The benchmark data set obtained by differencing the Canada data.

The shrinkage methods are run with the option const = "const" since the mean of the data has not been corrected. We set the lag order \(p\) at p = 1, 2, 3 to compare several VAR models. For the semiparametric Bayesian method (method = "sbayes"), the degree of freedom \(\nu\) was automatically selected. In addition, we ran the K-fold CV method (method = "kcv") for choosing \(\lambda\) and \(\lambda_v\) values of the semiparametric Bayesian shrinkage estimator.

set.seed(1000)
resu_model <- array(NA, dim = c(5, 2, 3),
  dimnames = list(c("Ridge regression", "Nonparametric shrinkage",
                    "Full Bayes", "Semi Bayes", "K-fold CV"),
                  c("AIC", "BIC"), c("p=1", "p=2", "p=3")))
for (p in 1:3) {
  EstimRidge <- VARshrink(Y, p = p, type = "const", method = "ridge")
  resu_model["Ridge regression", , p] <- c(AIC(EstimRidge), BIC(EstimRidge))

  EstimNS <- VARshrink(Y, p = p, type = "const", method = "ns")
  resu_model["Nonparametric shrinkage", , p] <-
    c(AIC(EstimNS), BIC(EstimNS))

  EstimFB <- VARshrink(Y, p = p, type = "const", method = "fbayes", dof = NULL)
  resu_model["Full Bayes", , p] <- c(AIC(EstimFB), BIC(EstimFB))

  EstimSB <- VARshrink(Y, p = p, type = "const", method = "sbayes",
                       dof = NULL, prior_type = "NCJ")
  resu_model["Semi Bayes", , p] <- c(AIC(EstimSB), BIC(EstimSB))

  EstimKCV <- VARshrink(Y, p = p, type = "const", method = "kcv",
                          dof = NULL, prior_type = "NCJ")
  resu_model["K-fold CV", , p] <- c(AIC(EstimKCV), BIC(EstimKCV))
}

We can compare several models by computing their AIC and BIC. The following results in Table 3 indicate that the NS method produced better VAR coefficients than those of the other methods, and that the AIC took the minimum at \(p=3\) while the BIC took the minimum at \(p=2\).

Table 3. Information criteria (AIC, BIC) for model comparison.
AIC.p=1 BIC.p=1 AIC.p=2 BIC.p=2 AIC.p=3 BIC.p=3
Ridge regression 465.8 504.6 442.9 509.3 445.3 525.9
Nonparametric shrinkage 462.5 496.6 434.6 489.0 430.5 502.9
Full Bayes 458.3 500.4 447.1 514.7 450.6 532.2
Semi Bayes 502.6 549.6 483.8 567.3 496.2 607.8
K-fold CV 526.9 574.3 477.2 526.2 469.6 540.0

The estimated parameters by the NS method with \(p=2\) can be analyzed further by using the methods and functions in Table 1. For example, we can perform time series forecasting as in Figure 2.

plot(predict(VARshrink(Y, p = 2, type = "const", method = "ns")), names = "U")
#> Warning in VARshrink(Y, p = 2, type = "const", method = "ns"): 'ns' method
#> does not allow type='const'.. changed to 'none'.

Figure 2. A 10-step ahead Forecasting of time series by the VAR model estimated by the nonparametric shrinkage method. The differenced Canada data were modeled by a VAR(2) model selected at the minimum BIC.


5 Conclusions

We wrote an R software package VARshrink for shrinkage estimation of VAR model parameters. The shrinkage methods included in this package are the multivariate ridge regression (Hoerl and Kennard 1970; Golub, Heath, and Wahba 1979), a nonparametric shrinkage method (Opgen-Rhein and Strimmer 2007b), a full Bayesian shrinkage method (Ni and Sun 2005), and a semiparametric Bayesian shrinkage method (Lee, Choi, and Kim 2016). An advantage of this package is the integration of the nonparametric, parametric, and semiparametric methods in one frame via a common interface function VARshrink(), which makes it easy and convenient to use various types of shrinkage estimation methods. Moreover, we note that the shrinkage estimation methods implemented in the current version have not been widely implemented in R packages in the context of VAR models.

We demonstrated the use of model selection criteria as AIC and BIC by using benchmark time series data. We note that computation of the log-likelihood of an estimated model must consider the actual distribution assumption of each method. We explained that the multivariate normal distribution is not the only choice for a distribution of noise, but another distributions such as the multivariate t-distribution can be chosen. Moreover, an effective number of parameters must be calculated for computing the AIC and BIC values. In this case, a large number of shrinkage parameter value leads to a small value of the effective number of parameters. In the package VARshrink, effective number of parameters is computed automatically based on the selected shrinkage parameter value.

Shrinkage methods are quite crucial especially for high dimensional VAR models. Bayesian approaches have been developed widely for VAR models in the literature for various purposes. However, the computational time for carrying out MCMC processes may be intractably high for high dimensional VAR models. For this reason, it is important to use nonparametric and semiparametric shrinkage estimation methods together to produce computationally feasible estimates of model parameters. The R package VARshrink is the first step to an integrative and general types of software packages for VAR models. Moreover, this package can be extended to include other shrinkage methods such as Bayesian shrinkage methods using several types of different prior distributions.

References

Bańbura, M., D. Giannone, and L. Reichlin. 2010. “Large Bayesian Vector Auto Regressions.” Journal of Applied Econometrics 25 (1): 71–92. https://doi.org/10.1002/jae.1137.

Beltrachini, L., N. von Ellenrieder, and C. H. Muravchik. 2013. “Shrinkage Approach for Spatiotemporal EEG Covariance Matrix Estimation.” IEEE Transactions on Signal Processing 61 (7): 1797–1808. https://doi.org/10.1109/TSP.2013.2238532.

Böhm, H., and R. von Sachs. 2009. “Shrinkage Estimation in the Frequency Domain of Multivariate Time Series.” Journal of Multivariate Analysis 100 (5): 913–35. https://doi.org/10.1016/j.jmva.2008.09.009.

Doan, T., R. Litterman, and C. Sims. 1984. “Forecasting and Conditional Projection Using Realistic Prior Distributions.” Econometric Reviews 3 (1): 1–100. https://doi.org/10.1080/07474938408800053.

Fiecas, M., and H. Ombao. 2011. “The Generalized Shrinkage Estimator for the Analysis of Functional Connectivity of Brain Signals.” The Annals of Applied Statistics 5 (2A): 1102–25. https://doi.org/10.1214/10-AOAS396.

Friedman, J., T. Hastie, R. Tibshirani, N. Simon, B. Narasimhan, and J. Qian. 2018. “Glmnet: Lasso and Elastic-Net Regularized Generalized Linear Models.” R Package Version 2.0-16.

Golub, G. H., M. Heath, and G. Wahba. 1979. “Generalized Cross-Validation as a Method for Choosing a Good Ridge Parameter.” Technometrics 21 (2): 215–23. https://doi.org/10.1080/00401706.1979.10489751.

Hamilton, J. D. 1994. Time Series Analysis. Princeton: Princeton University Press.

Hoerl, A. E., and R. W. Kennard. 1970. “Ridge Regression: Biased Estimation for Nonorthogonal Problems.” Technometrics 12 (1): 55–67. https://doi.org/10.1080/00401706.1970.10488634.

Hyndman, R., G. Athanasopoulos, C. Bergmeir, G. Caceres, L. Chhay, M. O’Hara-Wild, F. Petropoulos, et al. 2018. “forecast: Forecasting Functions for Time Series and Linear Models.” R Package Version 8.4.

Koop, G., and D. Korobilis. 2010. “Bayesian Multivariate Time Series Methods for Empirical Macroeconomics.” Foundations and Trends in Econometrics 3 (4): 267–358. https://doi.org/10.1561/0800000013.

Krueger, F. 2015. “Bvarsv: Bayesian Analysis of a Vector Autoregressive Model with Stochastic Volatility and Time-Varying Parameters.” R Package Version 1.1.

Ledoit, O., and M. Wolf. 2004. “A Well-Conditioned Estimator for Large-Dimensional Covariance Matrices.” Journal of Multivariate Analysis 88 (2): 365–411. https://doi.org/10.1016/S0047-259X(03)00096-4.

Lee, N., H. Choi, and S.-H. Kim. 2016. “Bayes Shrinkage Estimation for High-Dimensional Var Models with Scale Mixture of Normal Distributions for Noise.” Computationl Statistics & Data Analysis 101: 250–76. https://doi.org/10.1016/j.csda.2016.03.007.

Litterman, R. B. 1986. “Forecasting with Bayesian Vector Autoregressions: Five Years of Experience.” Journal of Business & Economic Statistics 4 (1): 25–38. https://doi.org/10.2307/1391384.

Moritz, S., and E. Cule. 2018. “Ridge: Ridge Regression with Automatic Selection of the Penalty Parameter.” R Package Version 2.3.

Ni, S., and D. Sun. 2005. “Bayesian Estimates for Vector Autoregressive Models.” Journal of Business & Economic Statistics 23 (1): 105–17. https://doi.org/10.1198/073500104000000622.

Opgen-Rhein, R., and K. Strimmer. 2007a. “Accurate Ranking of Differentially Expressed Genes by a Distribution-Free Shrinkage Approach.” Statistical Applications in Genetics and Molecular Biology 6 (1): 9. https://doi.org/10.2202/1544-6115.1252.

———. 2007b. “Learning Causal Networks from Systems Biology Time Course Data: An Effective Model Selection Procedure for the Vector Autoregressive Process.” BMC Bioinformatics 8 (2): S3. https://doi.org/10.1186/1471-2105-8-S2-S3.

Pfaff, B., and M. Stigler. 2018. “vars: VAR Modelling.” R Package Version 1.5-3.

Primiceri, G. E. 2005. “Time Varying Structural Vector Autoregressions and Monetary Policy.” The Review of Economic Studies 72 (3): 821–52. https://doi.org/10.1111/j.1467-937X.2005.00353.x.

Ripley, B., B. Venables, D. M. Bates, K. Hornik, A. Gebhardt, and D. Firth. 2018. “MASS: Support Functions and Datasets for Venables and Ripley’s Mass.” R Package Version 7.3-51.1.

Schäfer, J., R. Opgen-Rhein, V. Zuber, M. Ahdesmäki, A. P. D. Silva, and K. Strimmer. 2017. “Corpcor: Efficient Estimation of Covariance and (Partial) Correlation.” R Package Version 1.6.9.

Schäfer, J., and K. Strimmer. 2005. “A Shrinkage Approach to Large-Scale Covariance Matrix Estimation and Implications for Functional Genomics.” Statistical Applications in Genetics and Molecular Biology 4 (1): 32. https://doi.org/10.2202/1544-6115.1175.

Sun, D., and S. Ni. 2004. “Bayesian Analysis of Vector-Autoregressive Models with Noninformative Priors.” Journal of Statistical Planning and Inference 121 (2): 291–309. https://doi.org/10.1016/S0378-3758(03)00116-2.

Tsay, R. S. 2005. Analysis of Financial Time Series. 2nd ed. Wiley Series in Probability and Statistics. Hoboken, NJ: John Wiley & Sons.

Tsay, R. S., and D. Wood. 2018. “MTS: All-Purpose Toolkit for Analyzing Multivariate Time Series (Mts) and Estimating Multivariate Volatility Models.” R Package Version 1.0.

Zellner, A. 1986. “Bayesian Estimation and Prediction Using Asymmetric Loss Functions.” Journal of the American Statistical Association 81 (394): 446–51. https://doi.org/10.1080/01621459.1986.10478289.