demo-truh

This vignette provides a quick demo of the truh package. The example that we consider here is taken from Figure 3 of the paper: Trambak Banerjee, Bhaswar B. Bhattacharya, Gourab Mukherjee Ann. Appl. Stat. 14(4): 1777-1805 (December 2020) <DOI: 10.1214/20-AOAS1362>.

We will consider a nonparametric two sample testing problem where the \(d\) dimensional baseline (or uninfected) sample \(\boldsymbol{U}=(U_1,\ldots,U_n)\) are i.i.d with cdf \(F_0\) and the \(d\) dimensional treated (infected) sample \(\boldsymbol{V}=V_1,\ldots,V_m\) are i.i.d with cdf \(G\). Here, we assume that the heterogeneity in the baseline population is reflected by \(K\) different subgroups, each having unimodal distributions with distinct modes and cdfs \(F_1,\ldots,F_K\), and mixing proportions \(w_1,\ldots,w_K\) such that \[F_0=\sum_{a=1}^{K}w_aF_a~\text{where}~w_a\in(0,1)~\text{and}~\sum_{a=1}^{K}w_a=1. \]

The goal is to test the following composite hypothesis: \[H_0:G\in\mathcal{F}(F_0)~\text{versus}~H_1:G\notin\mathcal{F}(F_0), \] where \(\mathcal{F}(F_0)\) is the convex hull of \(F_1,\ldots,F_K\). We take \(d=2,n=2000,m=500\) and sample \(U_1,\ldots,U_n\) from \(F_0\) where \[F_0=0.3N(\boldsymbol{0},\boldsymbol{I}_2)+0.3N(\boldsymbol{\mu}_1,\boldsymbol{I}_2)+0.4N(\boldsymbol{\mu}_2,\boldsymbol{I}_2), \] with \(\boldsymbol{\mu}_1=(0,-4)\) and \(\boldsymbol{\mu}_2=(4,-2)\).

n = 2000
d = 2

#Sampling the baseline (uninfected)
set.seed(1)
p<-runif(n,0,1)
set.seed(10)
U<- (p<=0.3)*matrix(rnorm(d*n),n,d)+
  (p>0.3 & p<=0.6)*cbind(matrix(rnorm(n),n,1),
                matrix(rnorm(n,-4),n,1))+
  (p>0.6)*cbind(matrix(rnorm(n,4),n,1),
          matrix(rnorm(n,-2),n,1))

To sample \(V_1,\ldots,V_m\) we consider three settings for \(G\).

# Sampling the treated (infected)
m = 500
set.seed(50)
V1<-cbind(matrix(rnorm(m,4),m,1),
          matrix(rnorm(m,-2),m,1))

#Scatter plot of the data
grp = c(rep('Baseline',n),
                    rep('Treated',m))
plot(c(U[,1],V1[,1]), c(U[,2],V1[,2]),
     pch = 19,
     col = factor(grp),
     xlab = 'X_1',
     ylab = 'X_2')

# Legend
legend("topright",
       legend = levels(factor(grp)),
       pch = 19,
       col = factor(levels(factor(grp))))

# Sampling the treated (infected)
m = 500
set.seed(20)
q<-runif(m,0,1)
set.seed(50)
V2<-(q<=0.5)*cbind(matrix(rnorm(m,2),m,1),
          matrix(rnorm(m,-2),m,1))+
  (q>0.5)*cbind(matrix(rnorm(m,3),m,1),
          matrix(rnorm(m,3),m,1))

#Scatter plot of the data
plot(c(U[,1],V2[,1]), c(U[,2],V2[,2]),
     pch = 19,
     col = factor(grp),
     xlab = 'X_1',
     ylab = 'X_2')

# Legend
legend("topright",
       legend = levels(factor(grp)),
       pch = 19,
       col = factor(levels(factor(grp))))

# Sampling the treated (infected)
m = 500
set.seed(20)
q<-runif(m,0,1)
set.seed(50)
V3<-(q<=0.8)*matrix(rnorm(d*m),m,d)+
  (q>0.8 & q<=0.9)*cbind(matrix(rnorm(m),m,1),
                matrix(rnorm(m,-4),m,1))+
  (q>0.9)*cbind(matrix(rnorm(m,4),m,1),
          matrix(rnorm(m,-2),m,1))

#Scatter plot of the data
plot(c(U[,1],V3[,1]), c(U[,2],V3[,2]),
     pch = 19,
     col = factor(grp),
     xlab = 'X_1',
     ylab = 'X_2')

# Legend
legend("topright",
       legend = levels(factor(grp)),
       pch = 19,
       col = factor(levels(factor(grp))))

Let us now execute the truh testing procedure for these scenarios. Recall that the goal is to test the following composite hypothesis: \[H_0:G\in\mathcal{F}(F_0)~\text{versus}~H_1:G\notin\mathcal{F}(F_0). \] - Setting 1: Here we know that \(G=F_0\) and so \(H_0\) is true.

library(truh)
truh.1 = truh(V1,U,B=200)
truh.1$pval
## [1] 0.375

So, truh fails to reject the null hypothesis.

library(truh)
truh.2 = truh(V2,U,B=200)
truh.2$pval
## [1] 0

We see that truh rejects the null hypothesis.

library(truh)
truh.3 = truh(V3,U,B=200)
truh.3$pval
## [1] 0.205

In this case, truh makes the correct decision and fails to reject \(H_0\).