This package provides a fast algorithm to calculate the required sample size for a Pearson correlation to stabilize in the sequential framework of Schönbrodt and Perugini (2013; 2018). I assume that you have read the original paper or at least have an idea of how it works in principle. Essentially you want to find the sample size at which you can be sure that 1-\(\alpha\) percent of many studies would fall into a specified corridor of stability around an assumed population correlation and stay inside the corridor if you add more participants to your study. For instance, how many participants per study are required so that out of 100k studies, 90% would fall into the region between .4 to .6 (a Pearson correlation) and not leave this region anymore when you add more participants (under the assumption that the population correlation is .5). This is also referred to as the critical point of stability.
If you have found this page, I assume you either want (1) to calculate the critical point of stability for your own study or (2) to explore the method in general. If this is the case, read on and hopefully you will find what you are looking for. Let us first load the package and set a seed for reproducibility:
In most cases you will just need the function find_critical_pos which will you give you the critical point of stability for your specific parameters.
Let us reproduce one example from Schönbrodt and Perugini’s work (this should take only a couple of seconds on a modern CPU):
find_critical_pos(rho = .7, sample_size_min = 20, sample_size_max = 1000,
n_studies = 10000)
#> rho_pop 80% 90% 95% sample_size_min sample_size_max lower_limit upper_limit
#> 1 0.6997798 63 95 130 20 1000 0.6 0.8
#> n_studies n_not_breached precision precision_rel
#> 1 10000 0 0.1 FALSE
If you compare this with their original table or the data on github (https://github.com/nicebread/corEvol) the results should be fairly close.
Note that find_critical_pos will throw a message if at least one study did not reach the corridor of stability with the maximum sample size. This happened in Schönbrodt and Perugini’s work, but quite seldom. Still, it should be be avoided for a proper estimate of the point of stability.
find_critical_pos(rho = .7, sample_size_min = 20, sample_size_max = 400,
n_studies = 10000)
#> Warning in find_critical_pos(rho = 0.7, sample_size_min = 20, sample_size_max = 400, : 3 simulation[s] did not reach the corridor of
#> stability.
#> Increase sample_size_max and rerun the simulation.
#> rho_pop 80% 90% 95% sample_size_min sample_size_max lower_limit upper_limit
#> 1 0.6998704 65 97 131 20 400 0.6 0.8
#> n_studies n_not_breached precision precision_rel
#> 1 10000 3 0.1 FALSE
In this case, do what the message suggests and increase the maximum sample size. Note that larger sample sizes are more resource intensive because the correlations are calculated in the reverse way (from the maximum sample size downwards). Thus, you usually would not like to increase the maximum sample size, unless there are studies that did not reach the corridor of stability.
If you need different confidence levels, just state it:
find_critical_pos(rho = .7, sample_size_min = 20, sample_size_max = 1000,
n_studies = 10000, confidence_levels = c(.6, .85))
#> rho_pop 60% 85% sample_size_min sample_size_max lower_limit upper_limit
#> 1 0.7006911 38 78 20 1000 0.6 0.8
#> n_studies n_not_breached precision precision_rel
#> 1 10000 0 0.1 FALSE
This has no effect on resource consumption because the time consuming part is to simulate the distribution, not calculating quantiles of the distribution.
If you need a different precision level or even relative precision, specify it:
find_critical_pos(rho = c(.5, .7), sample_size_min = 20, sample_size_max = 2500,
n_studies = 10000, precision = .10, precision_rel = T)
#> Warning in find_critical_pos(rho = c(0.5, 0.7), sample_size_min = 20, sample_size_max = 2500, : 9 simulation[s] did not reach the corridor of
#> stability.
#> Increase sample_size_max and rerun the simulation.
#> rho_pop 80% 90% 95% sample_size_min sample_size_max lower_limit
#> 1 0.4990567 596 856 1119.05 20 2500 0.45
#> 2 0.6997279 137 199 261.00 20 2500 0.63
#> upper_limit n_studies n_not_breached precision precision_rel
#> 1 0.55 10000 9 0.1 TRUE
#> 2 0.77 10000 0 0.1 TRUE
As you can see in the output, the limits were set relatively to the population correlation (+-25% of the population correlation).
If you want to dig deeper, you can have a look at the functions that fastpos builds upon. simulate_pos is the workhorse of the package. It calls a C++ function to calculate correlations sequentially and it does this pretty fast (but you know that already, right?). A rawish approach would be to create a population with create_pop and pass it to simulate_pos:
pop <- create_pop(0.5, 1000000)
pos <- simulate_pos(x_pop = pop[,1],
y_pop = pop[,2],
number_of_studies = 1000,
sample_size_min = 20,
sample_size_max = 1000,
replace = T,
lower_limit = 0.4,
upper_limit = 0.6)
hist(pos, xlim = c(0, 1000), xlab = c("Point of stability"),
main = "Histogram of points of stability for rho = .5+-.1")
Note that no warning message appears if the corridor is not reached. It will simply return the maximum sample size. So pay careful attention if you work with this function and adjust the maximum sample size as needed.
create_pop creates the population matrix by using mvrnorm. This is a much simpler way compared to Schönbrodt and Perugini’s approach, but the results do not seem to differ. If you are interested in how population parameters (e.g. skewness) affect the point of stability, you should rather refer to the population generating functions in Schönbrodt and Perugini’s work.
As you can see, there is not really much to the sequential definition of stability, except for calculating billions of correlations. This is done quite fast with the help of Rcpp.
Let us finally reproduce Schönbrodt and Peruigini’s quite famous, often cited table of the critical points of stability for the precision 0.1. We set the maximum sample size a bit higher, so we avoid studies where the corridor is never reached. Furthermore, we will reduce the number of studies to 10k so that it runs fairly quickly.
find_critical_pos(rho = seq(.1, .7, .1), sample_size_max = 1000,
n_studies = 10000)
#> Warning in find_critical_pos(rho = seq(0.1, 0.7, 0.1), sample_size_max = 1000, : 30 simulation[s] did not reach the corridor of
#> stability.
#> Increase sample_size_max and rerun the simulation.
#> rho_pop 80% 90% 95% sample_size_min sample_size_max lower_limit
#> 1 0.1005987 253.0 365 479.00 20 1000 0.0
#> 2 0.1998043 234.0 335 433.00 20 1000 0.1
#> 3 0.3004318 214.2 304 393.00 20 1000 0.2
#> 4 0.4004198 184.0 267 354.05 20 1000 0.3
#> 5 0.4986327 145.0 208 276.00 20 1000 0.4
#> 6 0.5991916 104.0 150 200.00 20 1000 0.5
#> 7 0.6999457 65.0 96 128.05 20 1000 0.6
#> upper_limit n_studies n_not_breached precision precision_rel
#> 1 0.2 10000 11 0.1 FALSE
#> 2 0.3 10000 10 0.1 FALSE
#> 3 0.4 10000 6 0.1 FALSE
#> 4 0.5 10000 2 0.1 FALSE
#> 5 0.6 10000 1 0.1 FALSE
#> 6 0.7 10000 0 0.1 FALSE
#> 7 0.8 10000 0 0.1 FALSE
You can obviously parallelize the process, which will be especially useful if you want to run many simulations. For instance, if you increase the number of studies to 100k (as in the original article), it will take less than a minute on a modern CPU with several cores. On my i7-2640 with 4 cores, it takes about 30 s. Overall, this is a speedup of more than 1000 compared to Schönbrodt and Perugini’s code.
If you are interested in this package, there is still some work to do and I am happy if you like to contribute. Specifically, I would like to use RcppParallel to speed up the simulation directly in C++. This is rather of academic interest, as the functions are fast enough to find the point of stability for an individual study in a few seconds for most use cases. Indeed, I hope the package will be used this way – quite similar to a power analysis for significance testing.
Schönbrodt, F. D. & Perugini, M. (2013). At what sample size do correlations stabilize? Journal of Research in Personality, 47, 609-612. [https://doi.org/10.1016/j.jrp.2013.05.009]
Schönbrodt, F. D. & Perugini, M. (2018) Corrigendum to “At what sample size do correlations stabilize?” [J. Res. Pers. 47 (2013) 609–612. https://doi.org/10.1016/j.jrp.2013.05.009]. Journal of Research in Personality, 74, 194. [https://doi.org/10.1016/j.jrp.2018.02.010]