demo-truh

This vignette provides a quick demo of the truh package. The example that we consider here is taken from Figure 3 of the paper: Trambak Banerjee, Bhaswar B. Bhattacharya, Gourab Mukherjee Ann. Appl. Stat. 14(4): 1777-1805 (December 2020) <DOI: 10.1214/20-AOAS1362>.

We will consider a nonparametric two sample testing problem where the \(d\) dimensional baseline (or uninfected) sample \(\boldsymbol{U}=(U_1,\ldots,U_n)\) are i.i.d with cdf \(F_0\) and the \(d\) dimensional treated (infected) sample \(\boldsymbol{V}=V_1,\ldots,V_m\) are i.i.d with cdf \(G\). Here, we assume that the heterogeneity in the baseline population is reflected by \(K\) different subgroups, each having unimodal distributions with distinct modes and cdfs \(F_1,\ldots,F_K\), and mixing proportions \(w_1,\ldots,w_K\) such that \[F_0=\sum_{a=1}^{K}w_aF_a~\text{where}~w_a\in(0,1)~\text{and}~\sum_{a=1}^{K}w_a=1. \]

The goal is to test the following composite hypothesis: \[H_0:G\in\mathcal{F}(F_0)~\text{versus}~H_1:G\notin\mathcal{F}(F_0), \] where \(\mathcal{F}(F_0)\) is the convex hull of \(F_1,\ldots,F_K\). We take \(d=2,n=2000,m=500\) and sample \(U_1,\ldots,U_n\) from \(F_0\) where \[F_0=0.3N(\boldsymbol{0},\boldsymbol{I}_2)+0.3N(\boldsymbol{\mu}_1,\boldsymbol{I}_2)+0.4N(\boldsymbol{\mu}_2,\boldsymbol{I}_2), \] with \(\boldsymbol{\mu}_1=(0,-4)\) and \(\boldsymbol{\mu}_2=(4,-2)\).

n = 2000
d = 2

#Sampling the baseline (uninfected)
set.seed(1)
p<-runif(n,0,1)
set.seed(10)
U<- (p<=0.3)*matrix(rnorm(d*n),n,d)+
  (p>0.3 & p<=0.6)*cbind(matrix(rnorm(n),n,1),
                matrix(rnorm(n,-4),n,1))+
  (p>0.6)*cbind(matrix(rnorm(n,4),n,1),
          matrix(rnorm(n,-2),n,1))

To sample \(V_1,\ldots,V_m\) we consider three settings for \(G\).

Setting 1: \(G=N(\boldsymbol{\mu}_2,\boldsymbol{I}_2)\) which is the third component cdf of \(F_0\). In this setting clearly \(G\in\mathcal{F}(F_0)\) and the null hypothesis \(H_0\) is true.

# Sampling the treated (infected)
m = 500
set.seed(50)
V1<-cbind(matrix(rnorm(m,4),m,1),
          matrix(rnorm(m,-2),m,1))

#Scatter plot of the data
grp = c(rep('Baseline',n),
                    rep('Treated',m))
plot(c(U[,1],V1[,1]), c(U[,2],V1[,2]),
     pch = 19,
     col = factor(grp),
     xlab = 'X_1',
     ylab = 'X_2')

# Legend
legend("topright",
       legend = levels(factor(grp)),
       pch = 19,
       col = factor(levels(factor(grp))))

Setting 2: \(G=0.5N(\boldsymbol{\mu}_3,\boldsymbol{I}_2)+0.5N(\boldsymbol{\mu}_4,\boldsymbol{I}_2)\) where \(\boldsymbol{\mu}_3=0.25\boldsymbol{\mu}_1+0.5\boldsymbol{\mu}_2\) and \(\boldsymbol{\mu}_4=(3/4)\boldsymbol{\mu}_1+(9/8)\boldsymbol{\mu}_2\). Clearly in this case \(G\notin\mathcal{F}(F_0)\).

# Sampling the treated (infected)
m = 500
set.seed(20)
q<-runif(m,0,1)
set.seed(50)
V2<-(q<=0.5)*cbind(matrix(rnorm(m,2),m,1),
          matrix(rnorm(m,-2),m,1))+
  (q>0.5)*cbind(matrix(rnorm(m,3),m,1),
          matrix(rnorm(m,3),m,1))

#Scatter plot of the data
plot(c(U[,1],V2[,1]), c(U[,2],V2[,2]),
     pch = 19,
     col = factor(grp),
     xlab = 'X_1',
     ylab = 'X_2')

# Legend
legend("topright",
       legend = levels(factor(grp)),
       pch = 19,
       col = factor(levels(factor(grp))))

Setting 3: \(G=0.8N(\boldsymbol{0},\boldsymbol{I}_2)+0.1N(\boldsymbol{\mu}_1,\boldsymbol{I}_2)+0.1N(\boldsymbol{\mu}_2,\boldsymbol{I}_2)\). This is the most interesting setting as here \(G\in\mathcal{F}(F_0)\) but \(G\neq F_0\) because the mixing weights differ.

# Sampling the treated (infected)
m = 500
set.seed(20)
q<-runif(m,0,1)
set.seed(50)
V3<-(q<=0.8)*matrix(rnorm(d*m),m,d)+
  (q>0.8 & q<=0.9)*cbind(matrix(rnorm(m),m,1),
                matrix(rnorm(m,-4),m,1))+
  (q>0.9)*cbind(matrix(rnorm(m,4),m,1),
          matrix(rnorm(m,-2),m,1))

#Scatter plot of the data
plot(c(U[,1],V3[,1]), c(U[,2],V3[,2]),
     pch = 19,
     col = factor(grp),
     xlab = 'X_1',
     ylab = 'X_2')

# Legend
legend("topright",
       legend = levels(factor(grp)),
       pch = 19,
       col = factor(levels(factor(grp))))

Let us now execute the truh testing procedure for these scenarios. Recall that the goal is to test the following composite hypothesis: \[H_0:G\in\mathcal{F}(F_0)~\text{versus}~H_1:G\notin\mathcal{F}(F_0). \] - Setting 1: Here we know that \(G=F_0\) and so \(H_0\) is true.

library(truh)
truh.1 = truh(V1,U,B=200)
truh.1$pval

## [1] 0.375

So, truh fails to reject the null hypothesis.

Setting 2: Here we know that \(G\notin\mathcal{F}(F_0)\) and so \(H_0\) is false.

library(truh)
truh.2 = truh(V2,U,B=200)
truh.2$pval

## [1] 0

We see that truh rejects the null hypothesis.

Setting 3: Here \(G\in\mathcal{F}(F_0)\) but \(G\neq F_0\). The null hypothesis \(H_0\) is true in this setting.

library(truh)
truh.3 = truh(V3,U,B=200)
truh.3$pval

## [1] 0.205

In this case, truh makes the correct decision and fails to reject \(H_0\).