anscombiser package is named after the famous
Anscombe’s quartet datasets (Anscombe
(1973)). Frank Anscombe created these datasets to emphasize the
importance of graphical techniques in statistical analyses. The datasets
each consist of 11 pairs of \((x, y)\)
points and they have almost exactly the same values of an apparently
remarkable number of sample summary statistics. However, scatter plots
of these data reveal that the behaviours exhibited in these datasets are
very different. That inspecting summary statistics can give a very
misleading impression of a dataset is an important point for students of
Statistics to appreciate.
datasauRus package (Locke and
D’Agostino McGowan (2018)) provides further examples of datasets
that have markedly different scatter plots but nevertheless share many
sample summary statistics. These datasets were produced by using a
simulated annealing algorithm that seeks to morph incrementally an
initial dataset towards a target shape while maintaining the same sample
summary statistics (Matejka and Fitzmaurice
(2017)). In principle, any set of summary statistics can be used.
Indeed, Locke and D’Agostino McGowan
(2018) provides not only datasets that have the same values of
Anscombe’s statistics (essentially sample means, variances and
correlation) but also datasets that are constrained to share the same
sample median, interquartile range and Spearman’s rank correlation.
anscombiser package takes a simpler and quicker
approach to the same problem, using Anscombe’s statistics. It uses
shifting, scaling and rotating to transform the observations in an input
dataset to achieve a target set of Anscombe’s statistics. These
statistics can be set directly or by calculating them from a target
dataset, perhaps one of Anscombe’s quartet. If the input dataset has
statistics that are similar to the target statistics then the output
dataset will look rather similar to the input dataset. Otherwise, the
output dataset will be a squashed and/or rotated version of the input
dataset, but the general shape of the input dataset will still be
visible. It will be like viewing the input dataset from a different
Thus, we can easily create many datasets that have different general natures but share the same values of Anscombe’s statistics. In addition, this method works in more than two dimensions.
anscombise() function takes an input two-dimensional
dataset and outputs a dataset that shares Anscombe’s statistics with his
quartet of datasets. The
which argument chooses which of
Anscombe’s datasets to use. The default is
which = 1. Of
course, this affects the output dataset only minimally but it matters if
we want to plot the input dataset, as we do in an example below.
anscombiser packages provides 8 input files:
input8 that can be used to create
Anscombe-like, with the same sample size of 11 as the original Anscombe
quartet. The following example uses input data arranged on the edge of a
<- anscombise(input2, which = 4) a2 plot(a2)
Now we transform the Old Faithful Geyser data so that it shares the sample summary statistics of the Anscombe quartet.
<- anscombise(datasets::faithful, which = 4) new_faithful plot(new_faithful)
plot(new_faithful, input = TRUE)
If we view a plot of the outline of the coast of Italy from a strange angle then the resulting dataset has the same sample summary statistics as those above.
<- mapdata("Italy") italy <- anscombise(italy, which = 4) new_italy plot(new_italy)
mimic() function of the
package transforms an input dataset, as outlined above, to mimic another
dataset, in the sense of replicating its values of Anscombe’s
statistics. A particularly effective feature of the
datasauRus package is a dataset that draws a picture of a
dinosaur. Here, we show that a plot of the outline of the coast of the
UK needs little adjustment to replicate the sample summary statistics of
the dinosaur dataset.
library(datasauRus) library(maps) <- datasaurus_dozen_wide[, c("dino_x", "dino_y")] dino <- mapdata("UK") UK <- mimic(UK, dino) new_UK plot(new_UK, legend_args = list(x = "right"))
plot(new_UK, input = TRUE, legend_args = list(x = "topright"))
We finish this section with another example involving the dinosaur.
<- mimic(dino, trump) new_dino plot(new_dino, legend_args = list(x = "topright"))
plot(new_dino, input = TRUE, legend_args = list(x = "bottomright"), pch = 20)
The final image was created by Accentaur from the Noun Project.
We conclude with a brief 3D example, using the
trees datasets in the
<- mimic(datasets::randu, datasets::trees) new_randu # new_randu and trees share the same sample summary statistics <- get_stats(new_randu) new_randu_stats <- get_stats(datasets::trees) trees_stats # For example $correlation trees_stats#> Girth Height Volume #> Girth 1.0000000 0.5192801 0.9671194 #> Height 0.5192801 1.0000000 0.5982497 #> Volume 0.9671194 0.5982497 1.0000000 $correlation new_randu_stats#> new1 new2 new3 #> new1 1.0000000 0.5192801 0.9671194 #> new2 0.5192801 1.0000000 0.5982497 #> new3 0.9671194 0.5982497 1.0000000 pairs(trees)
It is well-known that in three-dimensional displays of the
randu data non-random structure is evident, but this isn’t
evident in these pairwise displays.