Package preproviz takes the data and constructs features of it that express the quality of a data point such as number of missing values in the point. Constructed features are visualized to help the analyst understand how data quality issues can be interdependent.
library(preproviz)
demoiris <- iris
demoiris[10:20,1] <- NA
a <- preproviz(demoiris)
## [1] "Data in process: controlobject"
plotBAR(a)
In the barplot above, 11/150 rows have missing values and one data point has LOF score that is clearly different from the rest.
plotHEATMAP(a)
plotVARCLUST(a)
We can see that Scattercounter and ClassificationCertainty are relatively close to each other as are DistanceToNearest and LOF score.
plotCMDS(a)
There is a cluster of points in the bottom of the graph that are similar to each other as far as data quality scores are concerned.
plotVARIMP(a)
The package supports comparison of multiple data sets or different versions of a same data set.
User-defined constructed features can be added.
MissingValueShare, count the number of missing values on a row and divide it by the total number of features
LenghtOfIQR, min-max normalize the data and compute the length of IQR for each row
DistanceToNearest, min-max normalize the data and compute the Euclidean distance to the nearest data point
ClassificationCentainty, compute the random forest class propability that has the highest value
ClassificationScatter. compute the scatter in 1/10 point neighbourhood
NearestPointPurity,Neareast point is of same class or not
NeighborhoodDiversity, count the number of dominant class points in 1/10 data set size neighborhood and divide it with the total number of classes in the data set.
LOFScore, compute LOF scores
MahalanobisDistance, compute Mahalanobis distance to class center
ClusteringTendency, Count the Hopkins statistic without a row and divide it with Hopkings statistic for all rows
Constructed features build on the work of:
Breunig, M. M., Kriegel, H-P., Ng, R. T., and Sander, J., (2000). LOF: Identifying Density-Based Local Outliers, Proc. ACM SIGMOD 2000 Int. Conf. On Management of Data, pp. 93-104
Breiman, L. (2001). Random Forests, Machine Learning, Vol. 45, pp. 5–32
Juhola, M., Siermala, M. (2012). A scatter method for data and variable importance evaluation, Integrated Computer-Aided Engineering, vol. 19, no. 2, pp. 137-149, 2012
Lawson, R.G. and Jurs, P.C.(1990). New index for clustering tendency and its application to chemical problems. Journal of Chemical Information and Computer Sciences. (Journal of Chemical Information and Computer Sciences, 1990, 30(1):36-41)
An article with 6 cases from the business performance measurement system domain is a work in progress.