MRReg: MDL Multiresolution Linear Regression Framework

Travis CI build status minimal R version CRAN Status Badge Download arXiv License

In this work, we provide the framework to analyze multiresolution partitions (e.g. country, provinces, subdistrict) where each individual data point belongs to only one partition in each layer (e.g. i belongs to subdistrict A, province P, and country Q).

We assume that a partition in a higher layer subsumes lower-layer partitions (e.g. a nation is at the 1st layer subsumes all provinces at the 2nd layer).

Given N individuals that have a pair of real values (x,y) that generated from independent variable X and dependent variable Y. Each individual i belongs to one partition per layer.

Our goal is to find which partitions at which highest level that all individuals in the these partitions share the same linear model Y=f(X) where f is a linear function.

The framework deploys the Minimum Description Length principle (MDL) to infer solutions.

Installation

For the newest version on github, please call the following command in R terminal.

remotes::install_github("DarkEyes/MRReg")

This requires a user to install the “remotes” package before installing MRReg.

Example: Inferred optimal homogeneous partitions

In the first step, we generate a simulation dataset.

All simulation types have three layers except the type 4 has four layers.

The type-1 simulation has all individuals belong to the same homogeneous partition in the first layer.

The type-2 simulation has four homogeneous partitions in a second layer. Each partition has its own models.

The type-3 simulation has eight homogeneous partitions in a third layer. Each partition has its own models

The type-4 simulation has one homogeneous partition in a second layer, four homogeneous partitions in a third layer, and eight homogeneous partitions in a fourth layer. Each partition has its own model.

In this example, we use type-4 simulation.

library(MRReg)

# Generate simulation data type 4 by having 100 individuals per homogeneous partition.
DataT<-SimpleSimulation(100,type=4)

gamma <- 0.05 # Gamma parameter

out<-FindMaxHomoOptimalPartitions(DataT,gamma)

Then we plot the optimal homogeneous tree.

plotOptimalClustersTree(out)

The red nodes are homogeneous partitions. All children of a homogeneous partition node share the same linear model.

Lastly, we can print the result in text form.

PrintOptimalClustersResult(out, selFeature = TRUE)

The result is below.

[1] "========== List of Optimal Clusters =========="
[1] "Layer2,ClS-C1:clustInfoRecRatio=0.08,modelInfoRecRatio=0.72, eta(C)cv=1.00"
[1] "Selected features"
[1] 2
[1] "Layer3,ClS-C11:clustInfoRecRatio=0.10,modelInfoRecRatio=0.63, eta(C)cv=1.00"
[1] "Selected features"
[1] 2
[1] "Layer3,ClS-C12:clustInfoRecRatio=0.10,modelInfoRecRatio=0.70, eta(C)cv=1.00"
[1] "Selected features"
[1] 3
[1] "Layer3,ClS-C13:clustInfoRecRatio=0.10,modelInfoRecRatio=0.68, eta(C)cv=1.00"
[1] "Selected features"
[1] 4
[1] "Layer3,ClS-C14:clustInfoRecRatio=0.09,modelInfoRecRatio=0.61, eta(C)cv=1.00"
[1] "Selected features"
[1] 5
[1] "Layer4,ClS-C21:clustInfoRecRatio=NA,modelInfoRecRatio=0.61, eta(C)cv=1.00"
[1] "Selected features"
[1] 2
[1] "Layer4,ClS-C22:clustInfoRecRatio=NA,modelInfoRecRatio=0.58, eta(C)cv=1.00"
[1] "Selected features"
[1] 3
[1] "Layer4,ClS-C23:clustInfoRecRatio=NA,modelInfoRecRatio=0.61, eta(C)cv=1.00"
[1] "Selected features"
[1] 4
[1] "Layer4,ClS-C24:clustInfoRecRatio=NA,modelInfoRecRatio=0.46, eta(C)cv=1.00"
[1] "Selected features"
[1] 5
[1] "Layer4,ClS-C25:clustInfoRecRatio=NA,modelInfoRecRatio=0.55, eta(C)cv=1.00"
[1] "Selected features"
[1] 6
[1] "Layer4,ClS-C26:clustInfoRecRatio=NA,modelInfoRecRatio=0.60, eta(C)cv=1.00"
[1] "Selected features"
[1] 7
[1] "Layer4,ClS-C27:clustInfoRecRatio=NA,modelInfoRecRatio=0.63, eta(C)cv=1.00"
[1] "Selected features"
[1] 8
[1] "Layer4,ClS-C28:clustInfoRecRatio=NA,modelInfoRecRatio=0.61, eta(C)cv=1.00"
[1] "Selected features"
[1] 9
[1] "min eta(C)cv:1.000000"

Note for selected features: 1 is reserved for an intercept, and d is a selected feature if Y[i] ~ X[i,d-1] in linear model. Note that the clustInfoRecRatio values are always NA for last-layer partitions.

Explanation: FindMaxHomoOptimalPartitions(DataT,gamma)

Citation

Chainarong Amornbunchornvej, Navaporn Surasvadi, Anon Plangprasopchok, and Suttipong Thajchayapong (2021). Identifying Linear Models in Multi-Resolution Population Data using Minimum Description Length Principle to Predict Household Income. ACM Transactions on Knowledge Discovery from Data (TKDD), 15(2), 1-30. https://doi.org/10.1145/3424670

Contact