`library(R2sample)`

This package brings together a number of routines for the two sample problem for univariate data. There are two data sets x and y and we want to test whether they were generated by the same probability distribution.

The highlights of this package are:

- it runs 12 different tests simultaneously.

- it works for continuous and for discrete data.

- it uses a permutation test to find p values (except for chi square
tests).

- it uses Rcpp and parallel programming to be very fast.

We generate two data sets of size 100 and 120 respectively from standard normal distributions and run the test:

`set.seed(123)`

**Note** all examples run with arguments *B=500,
maxProcessor = 2* in order to pass *devtools::check()*

```
=rnorm(10)
x=rnorm(12)
ytwosample_test(x, y, B=500, maxProcessor = 2, doMethod="all")
#> $statistics
#> chi large chi small t test KS Kuiper CvM
#> 0.892222222 0.892222222 0.007923895 0.200000000 0.266666667 0.053030303
#> AD LR ZA ZK ZC Wassp1
#> 0.318400187 0.122382155 -3.177381534 0.307906981 -2.788089138 0.279746638
#>
#> $p.values
#> chi large chi small t test KS Kuiper CvM AD LR
#> 0.9257 0.9257 0.9800 0.9200 0.6540 0.8960 0.9540 0.8560
#> ZA ZK ZC Wassp1
#> 0.9780 0.9820 0.9780 0.9560
```

In this case the the null hypothesis is true, both data sets were generated by a standard normal distribution. And indeed, all 12 tests have p values much larger than (say) 0.05 and therefore correctly fail to reject the null hypothesis.

Here we generate two data sets of size 1000 and 1200 respectively from a binomial distribution with 5 trials and success probabilities of 0.5 and 0.55, respectively.

```
=table(rbinom(1000, 5, 0.5))
x1=table(rbinom(1200, 5, 0.55))
y1rbind(x1, y1)
#> 0 1 2 3 4 5
#> x1 30 156 314 311 163 26
#> y1 23 131 334 405 236 71
twosample_test(x1, y1, vals=0:5, B=500,
maxProcessor = 2, doMethod="all")$p.values
#> chi large chi small t test KS Kuiper CvM AD LR
#> 0.000 0.000 0.000 0.002 0.020 0.002 0.000 0.000
#> ZA ZC Wassp1
#> 0.000 0.000 0.000
```

In this case the the null hypothesis is false, all 12 tests have p values much smaller than (say) 0.05 and therefore correctly reject the null hypothesis.

Notice, it is the presence of the *vals* argument that tells
the *twosample_test* command that the data is discrete. The
vectors x and y are the counts. Note that the lengths of the three
vectors have to be the same.

In the continuous case we have a sample x of size of n and a sample y of size m. In the discussion below both are assumed to be ordered. We denote by z the combined sample. In the discrete case we have a random variable with values v, and x and y are the counts. We denote by k the number of observed values.

We denote the edf’s of the two data sets by \(\widehat{F}\) and \(\widehat{G}\), respectively. Moreover we denote the empirical distribution function of the combined data set by \(\widehat{H}\).

**chi large**

In the case of continuous data, it is binned into *nbins[1]*
equal probability bins, with the default *nbins[1]=100*. In the
case of discrete data the number of classes comes from *vals*,
the values observed.

Then a standard chi square test is run and the p value is found using the usual chi square approximation.

**chi small**

Many previous studies have shown the a chi square test with a large
number of classes often has very poor power. So this test in the case of
continuous data uses *nbins[2]* equal probability bins, with the
default *nbins[2]=10*. In the case of discrete data if the number
of classes exceeds *nbins[2]=10* the classes are combined until
there are *nbins[2]=10* classes.

In either case the p values are found using the usual chi-square approximation. The degrees of freedom are the number of bins. For a literature review of chi square tests see (W. Rolke 2021).

For all other tests the p values are found using a permutation test. They are

**t test**

Classic two-sample test with test statistic

\[\frac{|\bar{x}-\bar{y}|}{s_p\sqrt{1/n+1/m}}\]

In the continuous case we have \(\bar{x}=\frac1n\sum_{i=1}^n x_i\) and in the discrete case it is \(\bar{x}=\frac1n\sum_{i=1}^k x_iv_i\). So in the continuous case the calculation of the test statistic requires loops of length n and m, but in the discrete case the loops have length only k and do not grow with the sample sizes. In the test statistics that follow in the continuous case one often needs loops of length n+m, but again only k in the discrete case. Therefore for large sample sizes the tests for discrete data are much faster, and may indeed be the only ones possible. It is then necessary to bin the data.

Here \(s_p\) is the pooled standard deviation. This test is discussed in every statistics textbook, see for example (Bickel and Doksum 2015).

**KS**Kolmogorov-Smirnov test

This test is based on the quantity \(\max\{|\widehat{F}(x)-\widehat{G}(x)|:x\in \mathbf{R}\}\). In the continuous case (without ties) the function \(\widehat{F}(x)-\widehat{G}(x)\) either takes a jump of size \(1/n\) up at a point in the x data set, a jump of size \(1/m\) down if x is a point in the y data set, or is flat. Therefore the test statistic is

\[\max\{\sum_{i=1}^{j} \vert \frac1n I\{z_i \in x\}-\frac1m I\{z_i \in y\} \vert:j=1,..,n+m\}\]

In the discrete case the jumps have sizes \(x_i/n\) and \(y_j/m\), respectively and the test statistic is

\[\max\{\sum_{i=1}^{j} \vert \frac{x_i}n -\frac{y_j}m \vert:j=1,..,k\}\]

This test was first proposed in (Kolmogorov 1933), (Smirnov 1939) and is one of the most widely used tests today.

There is a close connection between the edf’s and the ranks of the x and y in the combined data set. Using this one can also calculate the KS statistic, and many statistics to follow, based on these ranks. This is used in many software routines, for example the ks.test routine in R. However, when applied to discrete data these formulas can fail badly because of the many ties, and so we will not use them in our routine.

**Kuiper**Kuiper’s test

This test is closely related to Kolmogorov-Smirnov, but it uses the sum of the largest positive and negative differences between the edf’s as a test statistic:

\[T_x=\sum_{i=1}^{j} \vert \frac1n I\{z_i \in x\}-\frac1m I\{z_i \in x\} \vert:j=1,..,n\] \[T_y=\sum_{i=1}^{j} \vert \frac1n I\{z_i \in y\}-\frac1m I\{z_i \in y\} \vert:j=1,..,m\] and the the test statistic is \(T_x-T_y\). The discrete case follows directly. This test was first proposed in (Kuiper 1960).

**CvM**Cramer-vonMises test

This test is based on the integrated squared difference between the edf’s:

\[\frac{nm}{(n+m)^2}\sum_{i=1}^{n+m} \left( \hat{F}(z_i)-\hat{G}(z_i) \right)^2 \] the extension to the two-sample problem of the Cramer-vonMises criterion is discussed in (T. W. Anderson 1962).

**AD**Anderson-Darling test

This test is based on

\[nm\sum_{i=1}^{n+m-1} \frac{\left( \hat{F}(z_i)-\hat{G}(z_i) \right)^2}{i(n-i)} \]

It was first proposed in (Theodore W. Anderson, Darling, et al. 1952).

**LR**Lehmann-Rosenblatt test

Let \(r_i\) and \(s_i\) be the ranks of x and y in the combined sample, then the test statistic is given by

\[\frac1{nm(n+m)}\left[n\sum_{i=1}^n(r_i-1)^2+m\sum_{i=1}^m(s_i-1)^2\right]\] (Lehmann 1951) and (Rosenblatt 1952)

**ZA, ZK and ZC**Zhang’s tests

These tests are based on the paper (Zhang
2006). Note that the calculation of *ZC* is the one
exception from the rule: even in the discrete data case the calculation
of the test statistic does require loops of lengths n and m. The
calculation time will therefore increase with the sample sizes, and for
very large data sets this test might have to be excluded.

**Wassp1**Wasserstein p=1 test

A test based on the Wasserstein p1 metric. It compares the quantiles of the x and y in the combined data set. If n=m the test statistic in the continuous case is very simple:

\[ \frac1n\sum_{i=1}^n |x_i-y_i|\] If \(n\ne m\) it is first necessary to find the respective quantiles of the x’s and y’s in the combined data set. In the discrete case the test statistic is

\[\sum_{i=1}^{k-1} \vert \sum_{j=1}^i\frac{x_j}{n}-\sum_{j=1}^i\frac{y_j}{m} \vert(v_{i+1}-v_i)\]

For a discussion of the Wasserstein distance see (Vaserstein 1969).

The routine has the following arguments:

**x**and**y**are numeric vectors. In the continuous data case they are simply the data, in the discrete case the counts.**vals**a numeric vector of values of the discrete random variable. If missing continuous data is assumed. In the discrete case x, y and vals have to be vectorsd of equal length.**B=5000**number of simulation runs for permutation test. If B=0 only test statistics are found.**nbins=c(100, 10)**number of bins to use for chi-square test. In the continuous data case bins are found equi-probable via the quantiles of the combined data set. In the discrete case nbins[1] is changed to the number of values, so the chi-square test is done on the original data set. Then neighboring values are combined until there are nbins[2] classes.**doParallel=TRUE**Should parallel computing be used?**discretize=FALSE**Should continuous data be binned? This is useful in the case of very large data sets to lower the computation time.**doMethod**a character vector giving the names of the methods to include.

Say for example we want to use only the KS and AD tests:

`set.seed(111)`

**Note** all examples run with arguments *B=500,
maxProcessor = 2* in order to pass *devtools::check()*

```
set.seed(111)
=rnorm(10)
x=rnorm(12)
ytwosample_test(x, y, B=500, maxProcessor = 2, doMethod=c("KS","AD"))
#> $statistics
#> KS AD
#> 0.500000 2.116173
#>
#> $p.values
#> KS AD
#> 0.11 0.09
```

A list with vectors statistics (the test statistics) and p.values.

We have carried out extensive simulation studies to assess the power of the respective methods. Based on these studies the default for the doMethod argument are

- Continuous Data

`= c("chi small", "ZK", "ZC", "Wassp1") doMethod `

- Discrete Data

`= c("chi small", "Kuiper", "ZC", "Wassp1") doMethod `

In all the case studies we carried at least one of these methods was among the best, aka had highest power.

If one wishes to run all the methods one can also use

`="all" domethod`

The tests can also be run via a shiny app. To do so run

`run_shiny()`

The routine *twosample_power* allows the estimation of the
power of the various tests, and *plot_power* draws the
corresponding power graph with the methods sorted by their mean
power.

Note that due to the use of permutation tests the power calculations are fairly slow. Because of the time constraints imposed by CRAN the following routines are not run. A full power study with (say) 25 alternatives and fairly large data sets might take several hours to run.

Say we wish to find the powers of the tests when one data set comes from a standard normal distribution with sample size 100 and the other from a t distribution with 1-10 degrees of freedom and sample size 200.

```
plot_power(twosample_power(
f=function(mu) list(x=rnorm(100), y=rt(200, mu)),
mu=1:10
))
```

The arguments of **twosample_power** are

**f**A function that generates data sets x and y, either continuous or the counts of discrete variables. In the discrete case care needs to be taken that the resulting vectors have the same length as*vals*. The output has to be a list with elements x and y. In the case of discrete data the list also has to include a vector**vals**, the values of the discrete variable.**…**arguments passed to f. The most common case would be a single vector, but two vectors or none is also possible.**alpha=0.05**type I error probability to use in power calculation**B=1000**number of simulation runs**nbins=c(100,10)**Number of bins to use in chi square tests**doParallel=TRUE**Should parallel processing be used?**doMethod**a character vector giving the names of the method to include.

The arguments of **plot_power** are a matrix of powers
created by **twosample_power** and (optional) Smooth=FALSE
if no smoothing is desired and a name of the variable on the x axis.

We wish to find the powers of the tests when the data set x are 100 observations comes from a binomial n=10, p=0.5 and y 120 observations from from a binomial n=10 and a number of different success probabilities p. Note that in either case not all the possible values 0-10 will actually appear in every data set, so we need to take some care in assuring that x, y and vals match every time:

```
plot_power(twosample_power(
function(p) {
=0:10
vals=table(c(vals, rbinom(100, 10, 0.5))) - 1 #Make sure each vector has length 11
x=table(c(vals, rbinom(120, 10, p))) - 1 #and names 1-10
y=vals[x+y>0] # only vals with at least one count
valslist(x=x[as.character(vals)],y=y[as.character(vals)], vals=vals)
},p=seq(0.5, 0.6, length=5)
"p") ),
```

We want to assess the effect of the sample size when one data set comes from a standard normal and the other from a normal with mean 1 and standard deviation 1:

```
plot_power(twosample_power(
f=function(n) list(x=rnorm(n), y=rnorm(n, 1)),
n=10*1:10), "n")
```

Anderson, T. W. 1962. “On the Distribution of the Two_sample
Cramer-von Mises Criterion.” *The Annals of Mathematical
Statistics* 33(3): 1148–59.

Anderson, Theodore W, Donald A Darling, et al. 1952. “Asymptotic
Theory of Certain Goodness of Fit Criteria Based on Stochastic
Processes.” *The Annals of Mathematical Statistics* 23
(2): 193–212.

Bickel, Peter J, and Kjell A Doksum. 2015. *Mathematical Statistics
Vol 1 and 2*. CRC Press.

Kolmogorov, A. N. 1933. “Sulla Determinazione Empirica Di Una
Legge Di Distribuzione.” *Giornale Dell’Instituto Italiano
Degli Attuari* 4: 83–91.

Kuiper, N. H. 1960. “Tests Concerning Random Points on a
Circle.” *Proceedings of the Koninklijke Nederlandse Akademie
van Wetenschappen* 63: 38–47.

Lehmann, E. L. 1951. “Consistency and Unbiasedness of Certain
Nonparametric Tests.” *Ann. MAth. Statist.* 22(1): 165–79.

Rosenblatt, M. 1952. “Limit Theorems Associated with Variants of
the von Mises Statistic.” *Ann. Math. Statist.* 23:
617–23.

Smirnov, N. V. 1939. “Estimate of Deviation Between Empirical
Distribution Functions in Two Independent Samples.” *Bull.
Moscow Univ.* 2: 3–16.

Vaserstein, L. N. 1969. “Markov Processes over Denumerable
Products of Spaces, Describing Large Systems of Automata.”
*Problemy Peredachi Informatsii* 5(3): 64–72.

W. Rolke, C. Gutierrez Gongora. 2021. “A Chi-Square
Goodness-of-Fit Test for Continuous Distributions Against a Known
Alternative.” *Computational Statistics*.

Zhang, J. 2006. “Powerful Two Sample Tests Based on the Likelihood
Ratio.” *Techometrics* 48: 95–103.