1 Exploratory Data Analysis

“Exploratory data analysis is detective work” [Tukey, 1977, p.2]. This package enables the user to use graphical tools to find ‘quantitative indications’ enabling a better understanding of the data at hand. “As all all detective stories remind us, many of the circumstances surrounding acrime are accidental or misleading. Equally, many of the indications to be discerned in bodies of data are accidental or misleading [Tukey, 1977, p.3].” The solution is to compare many different graphical tools with the goal to find an agreement or to generate an hypothesis and then to confirm it with statistical methods. This package serves as a starting point.

1.1 Synoptic Overview

library(DataVisualizations)

## The legacy packages maptools, rgdal, and rgeos, underpinning the sp package,
## which was just loaded, will retire in October 2023.
## Please refer to R-spatial evolution reports for details, especially
## https://r-spatial.org/r/2023/05/15/evolution4.html.
## It may be desirable to make the sf package available;
## package maintainers should consider adding sf to Suggests:.
## The sp package is now running under evolution status 2
##      (status 2 uses the sf package in place of rgdal)

data("Lsun3D")
Pixelmatrix(Lsun3D$Data)

1.2 Distribution Analysis

“A scientifically sound procedure for the identification and analysis of empirical distributions is a comparison to a known theoretic distribution. The quantile/quantile plot (QQ-plot) allows comparing an empirical distribution to a known distribution [Michael, 1983]. Here, in 100 quantiles the model of a Gaussian distribution is compared to the data, and a straight line confirms a good data fit of the model. The Gaussian distribution is the canonical starting point for such a comparison[…]

[t]he precise form, i.e., the type, nature and parameters of the formal model of the probability density function (pdf) is the […] goal of [Distribution] analysis. Usually, this is performed using kernel density estimators. The simplest of such a density estimation is the histogram. However, histograms are often misleading and require critical parameters such as the width of the bin [Keating and Scott, 1999]. A specially designed density estimation, which has been successfully proved in many practical applications is the “Pareto Density Estimation” (PDE). PDE consists of a kernel density estimator representing the relative likelihood of a given continuous random data [Ultsch, 2005]. PDE has been shown to be particularly suitable for the discovery of structures in continuous data hinting at the presence of distinct groups of data and particularly suitable for the discovery of mixtures of Gaussians [Ultsch, 2005]. The parameters of the kernels are auto-adopted to the date using an information theoretic optimum on skewed distributions [Ultsch, Thrun, Hansen-Goos, and Lötsch, 2015].” [Thrun/Ultsch 2018].

library(DataVisualizations)
data(MTY)
InspectVariable(MTY,'MTY')

1.3 Mirrored Density Plots (MD-plots)

A clear model behind density estimation can outperform conventional visualization approaches. MD Plot combines the syntax of ggplot2 with Pareto density estimation and additional functionality usefull from the Data Scientist’s point of view. The approach is published in [Thrun et al., 2020]. A detailed description of the usage and functionality can be found in https://md-plot.readthedocs.io/en/latest/index.html .

The MD plot is also available in Python https://pypi.org/project/md-plot/

All dependencies have to be installed so that the MDplot can be used:

install.packages("DataVisualizations",dependencies = TRUE)

Here, one feature is bi-modal the other one has a large range of values.

library(DataVisualizations)
library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.3.1

data(ITS)
data(MTY)
library(vioplot)
Data=cbind(ITS,MTY)
#MDplot(Data)+ylim(0,6000)+ggtitle('Two Features With Adjusted Range')

#MDplot(Data,Scaling = "Robust")+ggtitle('"Shape-Invariant" Normalization')

#Data is now capped
#Data[Data[,2]>6000,2]=6000
MDplot(Data)+ylim(0,6000)+ggtitle('Two Features with MTY Capped')

## Warning: Removed 1614 rows containing non-finite values (`stat_pd_edensity()`).

boxplot(Data,main='Two Features with MTY Capped')

vioplot(Data[,1],Data[,2])
title('Two Features with MTY Capped')

1.4 Correlation Analysis

Often it is better to visualize the density of scatter plots before calculating correlation coefficients.

library(DataVisualizations)
data(ITS)
data(MTY)
Ind2=which(ITS<900&MTY<8000)
if(requireNamespace("ScatterDensity"))
V=DensityScatter(ITS[Ind2],MTY[Ind2],
                 xlab = 'ITS in EUR',
                 ylab ='MTY in EUR',
                 main='Scatter density plot using PDE',
                 Plotter="native",
                 DensityEstimation = "PDE")

A Shortcut to visualize correlation coefficients,if many features have to be compared against each other:

data("Lsun3D")
n=nrow(Lsun3D$Data)
Data=cbind(Lsun3D$Data,runif(n),rnorm(n),rt(n,2),rlnorm(n),rchisq(100,2))

## Warning in cbind(Lsun3D$Data, runif(n), rnorm(n), rt(n, 2), rlnorm(n),
## rchisq(100, : number of rows of result is not a multiple of vector length (arg
## 6)

Header=c('x','y','z','uniform','gauss','t','log-normal','chi')
cc=cor(Data,method='spearman')
diag(cc)=0
Pixelmatrix(cc,YNames = Header,XNames = Header,main = 'Spearman Coeffs')

1.5 Distribution of Distances

The distance distribution in the input space can be bimodal, indicating a distinction between the inter- versus intracluster distances. This can serve as an indication of distance-based cluster structures (see [Thrun, 2018A, 2018B]).

library(DataVisualizations)
data("Lsun3D")
InspectDistances(Lsun3D$Data,method="euclidean")

## [1] "Distances are not in a symmetric matrix, Datamatrix is assumed and dist() ist called"

1.6 Spatial Visualization - Choropleth Map

A thematic map with areas colored in proportion to the measurement of the statistical variable being displayed on the map. A political map generated by this approach if a classification is available. Many postal codes are required to see a structure. Exemplary two postal codes in the upper left corner of the map. Bins are only presented in the map if the have values within.

If you download the package from CRAN, please do the following steps:

Step: Downlaod the shape file from https://github.com/Mthrun/DataVisualizations/tree/master/data
Step: load it from the local path of the downloaded file with

load(file='GermanPostalCodesShapes.rda')

If you download the package from GitHub, you can omit the two steps above. Then, do not use the ‘PostalCodesShapes’ input parameter. Many postal codes are required to see a structure. The code below shows exemplary two postal codes in the upper left corner of the map.

out=Choroplethmap(c(4,8,5,4),c('49838', '26817', '49838', '26817'),NumberOfBins=2,PlotIt=FALSE)
out$chorR6obj$render()

1.7 Spatial Visualization - World Map with classification

The data is taken from [Thrun, 2018b].

library(DataVisualizations)
Cls=c(1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 
2L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 1L, 1L, 1L, 
1L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 1L, 
2L, 2L, 2L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 3L, 2L, 2L, 2L, 1L, 
2L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 
1L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 
2L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 
2L, 2L, 1L, 2L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 
2L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 
2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 1L, 1L, 1L
)
Codes=c("AFG", "AGO", "ALB", "ARG", "ATG", "AUS", "AUT", "BDI", "BEL", 
"BEN", "BFA", "BGD", "BGR", "BHR", "BHS", "BLZ", "BMU", "BOL", 
"BRA", "BRB", "BRN", "BTN", "BWA", "CAF", "CAN", "CH2", "CHE", 
"CHL", "CHN", "CIV", "CMR", "COG", "COL", "COM", "CPV", "CRI", 
"CUB", "CYP", "DJI", "DMA", "DNK", "DOM", "DZA", "ECU", "EGY", 
"ESP", "ETH", "FIN", "FJI", "FRA", "FSM", "GAB", "GBR", "GER", 
"GHA", "GIN", "GMB", "GNB", "GNQ", "GRC", "GRD", "GTM", "GUY", 
"HKG", "HND", "HTI", "HUN", "IDN", "IND", "IRL", "IRN", "IRQ", 
"ISL", "ISR", "ITA", "JAM", "JOR", "JPN", "KEN", "KHM", "KIR", 
"KNA", "KOR", "LAO", "LBN", "LBR", "LCA", "LKA", "LSO", "LUX", 
"MAC", "MAR", "MDG", "MDV", "MEX", "MHL", "MLI", "MLT", "MNG", 
"MOZ", "MRT", "MUS", "MWI", "MYS", "NAM", "NER", "NGA", "NIC", 
"NLD", "NOR", "NPL", "NZL", "OMN", "PAK", "PAN", "PER", "PHL", 
"PLW", "PNG", "POL", "PRI", "PRT", "PRY", "ROM", "RWA", "SDN", 
"SEN", "SGP", "SLB", "SLE", "SLV", "SOM", "STP", "SUR", "SWE", 
"SWZ", "SYC", "SYR", "TCD", "TGO", "THA", "TON", "TTO", "TUN", 
"TUR", "TWN", "TZA", "UGA", "URY", "USA", "VCT", "VEN", "VNM", 
"VUT", "WSM", "ZAF", "ZAR", "ZMB", "ZWE")
Worldmap(Codes,Cls)

1.8 Visualization of categorical features

A fanplot is better alternative to the pie chart. It represents amount of values given in data. A normal pie plot is dificult to interpret for a human observer, because humans are not trained well to observe angles [Gohil, 2015, p. 102].

library(DataVisualizations)
data(categoricalVariable)
Fanplot(categoricalVariable)

Piechart(categoricalVariable)

2 Evaluation of Clustering

““[The quality of clustering is measured using a] “procedure for validating a cluster structure […]. This can be based on an internal index, an external index or resampling. An internal index scores the degree of correspondence between the data and the cluster structure. An external index compares the cluster structure with a structure given externally. A resampling is used to see whether the cluster structure is stable with respect to data change” [Mirkin, 2005, p. 205] (see also [Jain/Dubes, 1988, p. 161ff]). Internal and external indices are also often called intrinsic or extrinsic indices, respectively; here, they are referred to as supervised or unsupervised indices, respectively” [Thrun, 2018A, p. 28].

2.1 Visualizations of Unsupervised Indices

Examples and interpretations of Heatmaps and Silhouette plots are presented in [Thrun 2018A, 2018B].

library(DataVisualizations)
data("Lsun3D")

Heatmap(Lsun3D$Data,Lsun3D$Cls,method = 'euclidean')

FALSE Distances are not in a symmetric matrix, Datamatrix is assumed and parallelDist::parDist is called

Silhouetteplot(Lsun3D$Data,Lsun3D$Cls,PlotIt = T)

FALSE Distances are not in a symmetric matrix, Datamatrix is assumed and dist() ist called

2.2 Visualizations of Supervised Indices

Stochastic algorithms like k-means have and clustering output depending on the trial. In [Thrun et al.,2018] we proposed and approach to compare different clustering algorithms with each other.We would normally choose adjusted Rand index or something similar which can be calculated using other CRAN packages like FCPS (e.g. ClusterAccuracy). In this special case, the mean silh does not change much and the distribution of changes is uniform. However, large variances and various distributions are very common (https://doi.org/10.1038/s41598-021-98126-1) if not a hierarchical clustering algorithm is used.

library(DataVisualizations)
data("Lsun3D")
Accuracy=matrix(NaN,100,2)
Algorithms=c("MacQueen","Lloyd")
colnames(Accuracy)=Algorithms
for(i in 1:100){
  Cls=kmeans(Lsun3D$Data,4,algorithm=Algorithms[1])$cluster
  Cls2=kmeans(Lsun3D$Data,4,algorithm=Algorithms[2])$cluster
  #this is an artifical example, because the problem of arbitrary class labels is not accounted for
  #please choose an appropiate internal index or an external index
  Accuracy[i,1]=sum(Cls==Lsun3D$Cls)/length(Lsun3D$Cls)
  Accuracy[i,2]=sum(Cls2==Lsun3D$Cls)/length(Lsun3D$Cls)
}

MDplot(Accuracy) + xlab('Output of Evaluation of two Algorithms') +
  ylab('Range of Values of the Evaluation of an Algorithm') +
  ggtitle("Simple Benchmarking")

## Using statistics for ordering instead of default

3 References

[Thrun, 2018A] Thrun, M. C.: Projection Based Clustering through Self-Organization and Swarm Intelligence, doctoral dissertation 2017, Springer, Heidelberg, ISBN: 978-3-658-20539-3, https://doi.org/10.1007/978-3-658-20540-9, 2018.

[Thrun, 2018B] Thrun, M. C.: Cluster Analysis of Per Capita Gross Domestic Products, Entrepreneurial Business and Economics Review (EBER), Vol. 7(1), pp. 217-231, https://doi.org/10.15678/EBER.2019.070113, 2019.

[Thrun/Ultsch, 2018] Thrun, M. C., & Ultsch, A.: Effects of the payout system of income taxes to municipalities in Germany, in Papiez, M. & Smiech,, S. (eds.), Proc. 12th Professor Aleksander Zelias International Conference on Modelling and Forecasting of Socio-Economic Phenomena, pp. 533-542, Cracow: Foundation of the Cracow University of Economics, Cracow, Poland, 2018.

[Thrun et al., 2020] Thrun, M. C., Gehlert, T. & Ultsch, A.: Analyzing the Fine Structure of Distributions, PLoS ONE, Vol. 15(10), pp. 1-66, DOI 10.1371/journal.pone.0238835, 2020.

[Tukey, 1977] Tukey, J. W.: Exploratory data analysis, United States Addison-Wesley Publishing Company, ISBN: 0-201-07616-0, 1977.

A Quick Tour in Data Visualizations

Michael C. Thrun

10 Oct 2023