Abstract

The univariate Kolmogorov-Smirnov (KS) test is a non–parametric statistical test designed to assess whether a set of data is consistent with a given probability distribution (or, in the two-sample case, whether the two samples come from the same underlying distribution). The versatility of the KS test has made it a cornerstone of statistical analysis and is commonly used across the scientific disciplines. However, the test proposed by Kolmogorov and Smirnov does not naturally extend to multidimensional distributions. Here, we present the fasano.franceschini.test package, an R implementation of the 2-D KS two–sample test as defined by Fasano and Franceschini(1). The fasano.franceschini.test package provides three improvements over the current 2-D KS test on the Comprehensive R Archive Network (CRAN): (i) the Fasano and Franceschini test has been shown to run in \(O(n^2)\) versus the Peacock implementation which runs in \(O(n^3)\); (ii) the package implements a procedure for handling ties in the data; and (iii) the package implements a parallelized bootstrapping procedure for improved significance testing. Ultimately, the fasano.franceschini.test package presents a robust statistical test for analyzing random samples defined in 2-dimensions.

Introduction

The Kolmogorov–Smirnov (KS) is a non–parametric, univariate statistical test designed to assess whether a set of data is consistent with a given probability distribution (or, in the two-sample case, whether the two samples come from the same underlying distribution). First derived by Kolmogorov and Smirnov in a series of papers (28), the one-sample KS test defines the distribution of the quantity \(D_{KS}\), the maximal absolute difference between the empirical cumulative distribution function (CDF) of a set of values and a reference probability distribution. Kolmogorov and Smirnov’s key insight was proving the distribution of \(D_{KS}\) was independent of the CDFs being tested. Thus, the test can effectively be used to compare any univariate empirical data distribution to any continuous univariate reference distribution. The two-sample KS test could further be used to compare any two univariate empirical data distributions against each other to determine if they are drawn from the same underlying univariate distribution.

The nonparametric versatility of the univariate KS test has made it a cornerstone of statistical analysis and is commonly used across the scientific disciplines (914). However, the KS test as proposed by Kolmogorov and Smirnov does not naturally extend to distributions in more than one dimension. Fortunately, a solution to the dimensionality issue was articulated by Peacock (15) and later extended by Fasano and Franceschini (1).

Currently, only the Peacock implementation of the 2-D two-sample KS test is available in R (16) with the Peacock.test package via the peacock2() function, but this has been shown to be markedly slower than the Fasano and Franceschini algorithm (17). A C implementation of the Fasano–Franceschini test is available in (18); however, arguments have been made to the validity of the implementation of the test not being distribution-free (19). Furthermore, in the C implementation, statistical testing is based on a fit to Monte Carlo simulation that is only valid for significance levels \(\alpha \lessapprox 0.20\).

Here we present the fasano.franceschini.test package as an R implementation of the 2-D two-sample KS test described by Fasano and Franceschini (1). The fasano.franceschini.test package provides two improvements over the current 2-D KS test available on the Comprehensive Archive Network (CRAN): (i) the Fasano and Franceschini test has been shown to run in \(O(n^2)\) versus the Peacock implementation which runs in \(O(n^3)\); and (ii) the package implements a bootstrapping procedure for improved significance testing and mitigates the limitations of the test brought noted by (19).

Models and software

1-D Kolmogorov–Smirnov Test

The Kolmogorov–Smirnov (KS) test is a non–parametric method for determining whether a sample is consistent with a given probability distribution (20). In one dimension, the Kolmogorov-Smirnov statistic (\(D_{KS}\)) is the defined by the maximum absolute difference between the cumulative density functions of the data and model (one–sample), or between the two data sets (two–sample), as illustrated in Figure 1.

Figure 1 | LEFT: Probability density function (PDF) of two normal distributions: orange sample 1, \(\mathcal{N}(\mu = 0,\,\sigma^{2} = 1)\); blue sample 2, \(\mathcal{N}(\mu = 5,\,\sigma^{2} = 1)\). RIGHT: Cumulative density functions (CDF) of the two PDFs; the black dotted line represents the maximal absolute difference between the CDFs (\(D_{KS}\)).

In the large–sample limit (\(n \geq 80\)), it can be shown (21) that \(D_{KS}\) converges in distribution to

\[\begin{equation} D_{KS} \overset{d}{\rightarrow} \Phi(\lambda) = 2 \sum_{k=1}^{\infty} -1^{k-1}e^{-2k^2\lambda^2} \,. \tag{1} \end{equation}\]

In the one-sample case with a sample of size \(n\), the \(p\) value is given by

\[\begin{equation} \mathbb{P}(D > observed) = \Phi ( D\sqrt{n})\,; \tag{2} \end{equation}\]

in the two-sample case, the \(p\) value is given by

\[\begin{equation} \mathbb{P}(D > observed) = \Phi \left( D\sqrt{\frac{n_1n_2}{n_1+n_2}} \right)\,. \tag{3} \end{equation}\]

where \(n_1\) and \(n_2\) are the number of observations in the first and second samples respectively.

Higher dimensional variations: Peacock Test (1983) and Fasano–Franceschini Test (1987)

Extending the above to two or higher dimension is complicated by the fact that CDFs are not well-defined in more than one dimension. In 2-D, there are 4 ways (3 independent) of defining the cumulative distribution, since the direction in which we order the \(x\) and \(y\) points is arbitrary (Figure 2); more generally, in \(k\)-dimensional space there are \(2^{k}-1\) independent ways of defining the cumulative distribution function (15).

Figure 2 | Four ways (3 independent) of defining the cumulative distribution for a given point in 2-D. Here, the orange point \((X,Y)\) is chosen as the origin; the density of observations may be integrated as \(\mathbb{P}(x < X, y > Y)\) (A); \(\mathbb{P}(x < X \cup y < Y)\) (B); \(\mathbb{P}(x < X, y < Y)\) (C); \(\mathbb{P}(x > X, y > Y)\) (D).

(15) solved the higher dimensionality issue by defining the 2-D test statistic as the largest difference between the empirical and theoretical cumulative distributions, after taking all possible ordering combinations into account. Peacock’s test thus computes the total probability—i.e. fraction of data—in each of the four quadrants around all possible tuples in the data. For example, for \(n\) points in a two-dimensional space, the empirical cumulative distribution functions is calculated in the \(4n^2\) quadrants of the plane defined by all pairs \((X_i, Y_j): i,j\in[1,n]\), where \(X_i\) and \(Y_j\) are any observed value of \(x\) and \(y\) (whether or not they are observed as a pair). There are \(n^2\) such pairs, each of which can define four quadrants in the 2-D plane; by ranging over all possible pairs of data points and quadrants, the 2-dimensional \(D\) statistic is defined by the maximal difference of the integrated probabilities between samples.

The variation defined by (1) was to only consider quadrants centered on each observed \((x, y)\) pair to compute the cumulative distribution functions. That is, rather than looking over all \(n^2\) points \({(X_i, Y_j): i,j \in [1,n]}\), Fasano and Franceschini only use the observed \(n\) points \({(X_i, Y_i): i \in [1,n]}\). Thus for any given \(n\) points in a two-dimensional space, those \(n\) points define \(4n\) (rather than \(4n^2\)) quadrants. The procedure is illustrated in Figure 3. The algorithm loops through each point in one sample in turn to define the origin of 4 quadrants (grey dotted lines in Figure 3). The fraction of points in each sample is computed in each quadrant, and the quadrant with the maximal difference is designated with the current maximum for the specified origin. By iterating over all data points and quadrants, the test statistic \(D_{FF,1}\) is defined by the maximal difference of the integrated probabilities between samples in any quadrant for any origin from the first sample. In Figure 3, using the orange point as the origin, the maximal difference is \(D_{FF,1} = 0.52\).

Figure 3 | Illustration of the Fasano–Franceschini algorithmic search for the maximal difference (\(D_{FF,1}\)) between sample 2-D eCDFs. Looping through each point in the sampled data to define a unique origin (grey dotted line), the fraction of orange and blue points in each quadrants are computed (plot corners). For each origin, the quadrant which maximizes the absolute difference in the integrated probabilities is indicated. The origin which maximizes the overall absolute difference in the integrated probabilities between samples is highlighted by the orange box.

This process is repeated using the points from other sample as the origins to compute the maximal \(D_{FF,2}\) with origins from the second sample. \(D_{FF,1}\) and \(D_{FF,2}\) are then averaged to compute the overall \(D_{FF}\) for hypothesis testing, \(D_{FF}=(D_{FF,1}+D_{FF,2})/2\).

It may be that some points are tied with the \(X\) and/or \(Y\) coordinates of the origin, creating an ambiguity when computing the fraction of points in each quadrant. Since the test attempts to define the maximal difference of the cumulative probabilities, a natural solution would be to treat a point that is tied with the current \(X\) and/or \(Y\) coordinates of the origin as equally likely to have been drawn from any of the tied quadrants. Hence, any data point sharing the same \(X\) or \(Y\) coordinate as the origin is evenly distributed across the tied quadrants, with each of the two quadrants receiving half a count. Any data point sharing the both the same \(X\) and \(Y\) coordinates as the current origin (including the origin itself) is evenly distributed across all quadrants, with all four quadrants receiving a quarter count.

Null distribution of \(D_{FF}\)

Using Monte Carlo simulation, Fasano and Franceschini created a look-up table of critical values of \(D_{FF}\) as a function of \(D_{FF}\), the sample size, and the coefficient of correlation \(r\). (18) later defined an approximate fit to the lookup table as follows. For a single sample of size \(n\),

\[\begin{equation} \mathbb{P}(d_{FF} > D_{FF}) = \Phi \left( \frac{D_{FF}\sqrt{n}}{1+\sqrt{1-r^2}(0.25-0.75/\sqrt{n})} \right) \, . \tag{4} \end{equation}\]

where \(\Phi(\cdot)\) is as defined in Eq 1. The two sample case uses the same formula as above, but with the slight variation where

\[\begin{equation} n = \frac{n_1n_2}{n_1+n_2}\, . \tag{5} \end{equation}\]

In both cases, \(r\) is defined in the usual way as

\[\begin{equation} r = \frac{\sum_{i}(X_i-\bar{X})(Y_i-\bar{Y})}{\sqrt{\sum_{i}(X_i-\bar{X})^2}\sqrt{\sum_{i}(Y_i-\bar{Y})^2}}\, . \tag{6} \end{equation}\]

Illustrations

Fasano–Franceschini test usage

In their paper, Fasano and Franceschini use Monte Carlo simulation to approximate the distribution of \(D_{FF}\) as a function of the sample size \(n\) and the coefficient of correlation \(r\). Notably, unlike the 1-D KS test, the distribution of \(D_{FF}\) is not completely independent of the shape of the 2-D distribution of the underlying data, but depends on the correlations between the variables. In the case where the variables \(X\) and \(Y\) are perfectly correlated (\(r = 1\)), the 2-D distribution lies along a single line and thus the 1-D KS test could be used; at the other extreme were \(X\) and \(Y\) are perfectly uncorrelated (\(r = 0\)), the 2-D distribution is independent in the \(X\) and \(Y\) directions and one could apply the 1-D KS test on the marginal distributions. Results from Monte Carlo simulation support these expectations, showing that the distribution of \(D\) is nearly identical for varying distributions with the same correlation coefficient (1). The approximation by (18) (Eq 45) can be used to test the significance levels for the 2-D K-S test using the following code:

library(fasano.franceschini.test)

#set seed for reproducible example
set.seed(123)

#create 2-D samples with the same underlying distributions
sample1Data <- data.frame(
  x = rnorm(n = 10, mean = 0, sd = 1),
  y = rnorm(n = 10, mean = 0, sd = 1)
)
sample2Data <- data.frame(
  x = rnorm(n = 10, mean = 0, sd = 1),
  y = rnorm(n = 10, mean = 0, sd = 1)
)

fasano.franceschini.test(S1 = sample1Data, S2 = sample2Data)
## 
##  Fasano-Francheschini Test
## 
## data:  sample1Data and sample2Data
## D-stat = 0.4, p-value = 0.3091
## sample estimates:
## dff,1 dff,2 
## 0.375 0.425

Bootstrap version of the Fasano–Franceschini test

It has been noted that the approximation from (18) is only accurate when \(n \gtrsim 20\) and the \(p\)-value is less than (more significant than) \(\sim 0.2\) (19). While this inaccuracy still allows a simple rejection decision to be made at any \(\alpha\leq0.2\), it is sometimes useful to quantify large \(p\) more exactly (such as if one was to do a cross-study concordance analysis comparing \(p\) values between studies as in (22)), and to apply it to smaller datasets. To address these limitations, one can bootstrap the significance levels for the particular multidimensional statistic directly from the particular data set under study. As Fasano and Franceschini’s paper was originally released in 1987, this approach was unfeasible at scale. Today, modern computers can rapidly compute a bootstrapped null distribution of \(D_{FF}\) from the data to test significance.

The fasano.franceschini.test R package implements a parallelized bootstrapping procedure. The marginal distribution from 2-dimensional data set is resampled with replacement to generate randomized 2-dimensional data sets nBootStrap times. The frequency count by quadrant is performed for each bootstrapped resampling as described above to compute the \(D_{FF}\). The observed \(D_{FF}\) is then compared to the distribution of bootstrapped \(D_{FF}\) to compute a \(p\) value. The bootstrapped version of the Fasano–Franceschini test can be run as follows (see fasano.franceschini.test() for further source code details and implementation).

#set seed for reproducible example
set.seed(123)

#create 2-D samples with the same underlying distributions
sample1Data <- data.frame(
  x = rnorm(n = 10, mean = 0, sd = 1),
  y = rnorm(n = 10, mean = 0, sd = 1)
)
sample2Data <- data.frame(
  x = rnorm(n = 10, mean = 0, sd = 1),
  y = rnorm(n = 10, mean = 0, sd = 1)
)

fasano.franceschini.test(S1 = sample1Data, S2 = sample2Data, nBootstrap = 10, cores = 1)
## 
##  Fasano-Francheschini Test
## 
## data:  sample1Data and sample2Data
## D-stat = 0.4, p-value = 0.2727
## sample estimates:
## dff,1 dff,2 
## 0.375 0.425

To improve run time, one may adjust the cores parameter; see the R parallel package and mcapply() the function for further details.

Computational efficency

Figure 4 | Computational efficiency benchmarks. A: Runtime of the Fasano–Franceschini test relative to the Peacock test at four different sample sizes (\(n=10, 100, 1000, 5000\)). Points represent the average of 10 benchmark runs. B: Runtime of the Fasano–Franceschini bootstrapping procedure for various sample sizes (\(n= 10, 100, 1000, 5000\)) as a function of the number of cores used. Within each panel, lines are colored by the number of bootstrap iterations (no bootstrap, 10, 100, 1000). Points represent the average of 10 benchmark runs. Note the logarithmic \(y\)-axis in (B).

To assess the computational efficiency, we benchmarked the package as follows. Using the rbenchmark package to evaluate runtime, the Fasano–Franceschini test and Peacock test were run under four different samples sizes (\(n=10, 100, 1000, 5000\)), with 10 replicates for each run. The Fasano–Franceschini test bootstrap procedure was further evaluated under four different bootstrap iterations (no bootstrap, 10, 100, 1000), again using 10 replicates for each run. Reported results represent the average run time of the 10 replicate benchmarks. All benchmark tests were run on a 2018 macBook Pro Mac (macOS Catalina) with a 2.7-GHz Quad-Core Intel Core i7 processor and 16 GB of 2133 MHz LPDDR3 memory.

The main distinction between the Peacock and Fasano–Franceschini tests is in computational efficiency, with Fasano–Franceschini scaling as \(O(n^2)\) relative to Peacock’s complexity of \(O(n^3)\) (17). Our benchmarks also show this advantage, as shown in Figure 4A. While the implementation of the bootstrapping procedure increases runtime in comparison to the approximate fit from (18), parallelization of the Fasano–Franceschini test shows a four-fold reduction in run time when parallelized across 8 cores (Figure 4B).

Summary and discussion

The fasano.franceschini.test package is an R implementation of the 2-D two-sample KS test as defined by Fasano and Franceschini (1). It improves upon existing packages by implementing a fast algorithm and a parallelized bootstrapping procedure for improved statistical testing. Complete package documentation and source code is available via the Comprehensive R Archive Network (CRAN) at https://CRAN.R-project.org/ and the package website at https://nesscoder.github.io/fasano.franceschini.test/.

Computational details

The results in this paper were obtained using R 4.0.3 with the fasano.franceschini.test 1.0.0 package. R itself and all package dependencies (methods 4.0.3; parallel 4.0.3) are available from the Comprehensive Archive Network (CRAN) at https://CRAN.R-project.org/.

Acknowledgments

Research reported in this publication was supported by the NSF-Simons Center for Quantitative Biology at Northwestern University, an NSF-Simons MathBioSys Research Center. This work was supported by a grant from the Simons Foundation/SFARI (597491-RWC) and the National Science Foundation (1764421). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Science Foundation and Simons Foundation.

E.N.C developed the fasano.franceschini.test package and produced the tutorials/documentation; E.N.C. and R.B. wrote the paper.

References

1.
G. Fasano, A. Franceschini, A multidimensional version of the kolmogorov-smirnov test. Monthly Notices of the Royal Astronomical Society 225, 155–170 (1987).
2.
A. N. Kolmogorov, Sulla Determinazione Empirica di Una Legge di Distribuzione. Giornale dell’Istituto Italiano degli Attuari, 83–91 (1933).
3.
A. N. Kolmogorov, Über die Grenzwertsätze der Wahrscheinlichkeitsrechnung. Bull. Acad. Sci. URSS 1933, 363–372 (1933).
4.
N. V. Smirnov, Sur la distribution de \(\omega^2\) (criterium de M.R. v. Mises). Com. Rend. Acad. Sci. (Paris) 202, 449–452 (1936).
5.
N. V. Smirnov, On the distribution of the mises \(\omega^2\) criterion [in Russian]. Rec. Math. N.S. [Mat. Sbornik] 2, 973–993 (1937).
6.
N. V. Smirnov, On the deviations of the empirical distribution curve [in Russian]. Rec. Math. N.S. [Mat. Sbornik] 6, 3–26 (1939).
7.
N. V. Smirnov, Approximate laws of distribution of random variables from empirical data. Uspehi Matem. Nauk 10, 179–206 (1944).
8.
N. V. Smirnov, Table for estimating the goodness of fit of empirical distributions. The Annals of Mathematical Statistics (1948) https:/doi.org/10.1214/aoms/1177730256.
9.
S. Atasoy, et al., Connectome-harmonic decomposition of human brain activity reveals dynamical repertoire re-organization under LSD. Scientific Reports 7, 1–18 (2017).
10.
F. Chiang, O. Mazdiyasni, A. AghaKouchak, Amplified warming of droughts in southern united states in observations and model simulations. Science Advances 4, eaat2380 (2018).
11.
J. M. Hahne, M. A. Schweisfurth, M. Koppe, D. Farina, Simultaneous control of multiple functions of bionic hand prostheses: Performance and robustness in end users. Science Robotics 3 (2018).
12.
S. M. Hargreaves, W. M. C. Araújo, E. Y. Nakano, R. P. Zandonadi, Brazilian vegetarians diet quality markers and comparison with the general population: A nationwide cross-sectional study. PloS One 15, e0232954 (2020).
13.
F. Wong, J. J. Collins, Evidence that coronavirus superspreading is fat-tailed. Proceedings of the National Academy of Sciences 117, 29416–29418 (2020).
14.
S. Kaczanowska, et al., Genetically engineered myeloid cells rebalance the core immune suppression program in metastasis. Cell 184, 2033–2052 (2021).
15.
J. A. Peacock, Two-dimensional goodness-of-fit testing in astronomy. Monthly Notices of the Royal Astronomical Society 202, 615–627 (1983).
16.
R Core Team, R: A language and environment for statistical computing (R Foundation for Statistical Computing, 2020).
17.
R. H. C. Lopes, I. Reid, P. R. Hobson, The two-dimensional Kolmogorov-Smirnov test in XI International Workshop on Advanced Computing and Analysis Techniques in Physics Research, (2007).
18.
W. H. Press, S. A. Teukolsky, W. T. Vetterling, B. P. Flannery, Numerical recipes 3rd edition: The art of scientific computing, 3rd Ed. (Cambridge University Press, 2007).
19.
G. Babu, E. Feigelson, Astrostatistics: Goodness-of-fit and all that! in Astronomical Data Analysis Software and Systems XV, (2006), p. 127.
20.
M. A. Stephens, Introduction to Kolmogorov (1933) On the Empirical Determination of a Distribution in (Springer, New York, NY, 1992), pp. 93–105.
21.
M. G. Kendall, A. Stuart, The Advanced Theory of Statistics (Griffin, 1946).
22.
E. Ness-Cohn, M. Iwanaszko, W. L. Kath, R. Allada, R. Braun, TimeTrial: An interactive application for optimizing the design and analysis of transcriptomic time-series data in circadian biology research. Journal of Biological Rhythms 35, 439–451 (2020).