application-on-real-data-with-na

The problem

Two possible measurements scales of the social class of the participants built from profession can have been collected at each updates of data collection.

The first one corresponds to the Goldthorp Social Class 90 scale (GO90), a scale with 12 categories, while the second one corresponds to the RG Social Class 91 scale (RG91), a scale with only 7 categories respectively described in the documentation of the ncds_14 and ncds_5 databases.

If during the first survey waves, the information related to the social level of each participant is collected using the GO90 scale, it is the RG91 scale that was used exclusively from wave 5. This variation in data collection could pose a problem related to the exploitation of this variable, when including new participants in cohort studies, especially because there is actually no overlaps bethween the two databases.

Question: how to work with the greatest number of participants encountered in the study without losing the information related to the individual social level?

Answer: We will see in this vignette how the use of the OTrecod functions can provide a concrete answer to this question.

The solution

Functions from the OTrecod package can help users to solve the problem introduced in the previous section by recoding the missing scale GO90 or/and RG91 in the corresponding databases as described in the following paragraphs dedicated to the imputation of GO90 in ncds_5. However, it would be obvioulsy possible to impute RG91 in ncds_14 or both in a similar process.

Package installation

If the package OTrecod is not installed in their current R versions, users can install it by following the standard instruction:

install.packages("OTrecod")

Each time an R session is opened, the OTrecod library must be loaded with:

library(OTrecod)

Moreover, the development version of OTrecod can be installed actually from GitHub with:

# Install development version from GitHub
devtools::install_github("otrecoding/OTrecod")

databases installation

Participants from waves 1 to 4 were stored in a database called ncds_14 provided in the OTrecod package. In the same way, informations related to new participants included in the study from the wave 5 were stored in the ncds_5 database. These two base are available in OTrecod and can be simply loaded as follows:

# load("C:\\Users\\secr\\Desktop\\newpack\\datas_vignet2.RData")
data(ncds_14); data(ncds_5)

Here is the description of all the variables contained in each of the databases:

summary(ncds_14); summary(ncds_5)
#>     ncdsid               GO90      health         employ      gender    
#>  Length:5476        20     :876   1   :1596   1      :1436   1   :2262  
#>  Class :character   71     :840   2   :2429   7      :1250   2   :2408  
#>  Mode  :character   10     :766   3   : 516   4      : 777   NA's: 806  
#>                     31     :727   4   :  72   3      : 431              
#>                     60     :509   NA's: 863   6      :  89              
#>                     (Other):952               (Other):  68              
#>                     NA's   :806               NA's   :1425              
#>   study     
#>  1   :3006  
#>  2   :1664  
#>  NA's: 806  
#>             
#>             
#>             
#> 
#>     ncdsid          gender  RG91     health        employ    study  
#>  Length:365         1:178   0 : 8   1   :121   1      :102   1:231  
#>  Class :character   2:187   10:18   2   :195   7      : 87   2:134  
#>  Mode  :character           20:95   3   : 41   4      : 76          
#>                             31:88   4   :  6   3      : 32          
#>                             32:74   NA's:  2   6      :  7          
#>                             40:68              (Other):  3          
#>                             50:14              NA's   : 58

From these descriptive statistics, we can make the following remarks:

Each of the files contains only one scale measuring the social class of the participants: the ncds_14 database measures the social class from the G090 scale only, which currently makes it impossible to study this variable on all the waves of data collection carried out.
The other variables seem at first sight labeled and coded in a similar way from one dataset to another.
Some variables have incomplete informations (including social class scales in ncds_14) and the missingness structures seems independant from the databases.
There are only a few variables contained in the two databases.
Same labels predictors are not sorted in the same order from one database to another.

Handling missing information

Firsly, JOINT algorithms from the OT_joint function of the package requires complete predictors to run, which is not the case for OUTCOME algorithms implemented in the OT_outcome function. Keeping this information in mind, we know that for a comparison of these two families of algorithm, it will be necessary to work either with complete cases predictors or to carry out a preliminary imputation on them before the data fusion.

The merge_dbs function can solve problems 2 and 3 introduced in the previous paragraph:

by removing from the upcoming data fusion, participants whose social class informations are missing from their databases (corresponding to 806 participants in ncds_14).
by detecting and excluding predictors, labeled and/or coded differently from one database to another.
possibly imputing incomplete predictors using MICE or FAMD procedures (see the documentation of imput_cov for mored details about the methods).
finally, by superimposing the two datasets by only keeping the shared predictors.

Here is the R code related to a MICE imputation (with only 2 replications for running time reasons of the vignette). Using the MICE procedure directly integrated in merge_dbs supposes that the user accepts the hypothesis that all predictors will participate to the imputation.

In situation where this hypothesis does not hold but imputation still appear as necessary, the imputation of missing values will have to be done through another function (not provided by the package).

merged_tab <- merge_dbs(ncds_14, ncds_5,
                        row_ID1 = 1, row_ID2 = 1,
                        NAME_Y = "GO90", NAME_Z = "RG91",
                        ordinal_DB1 = 3, ordinal_DB2 = 4,
                        impute = "MICE", R_MICE = 2,
                        seed_choice = 3023)
#> DBS MERGING in progress. Please wait ...
#> DBS MERGING OK
#> -----------------------
#> 
#> SUMMARY OF DBS MERGING:
#> Nb of removed subjects due to NA on targets: 806(14%)
#> Nb of removed covariates due to differences between the 2 bases: 0
#> Nb of remained covariates: 4
#> Imputation on incomplete covariates: MICE

Process details indicate here that 806 participants have been deleted from the data fusion because of missing information on GO90 in the ncds_14 database.

By definition, GO90 in ncds_14 and RG91 in ncds_5 does not seem to correspond clearly to ordered scales and can be considered as simple factors. This is the case here, because the only variable defined as an ordered factor in the study is the health variable. This variable correspond to the third column of ncds_14 and to the fourth column of ncds_5. A seed value is arbitrarily fixed to 3023 here to guarantee the reproductibility of the results in the vignette.

In the output list, The DB_READY object provides the overlayed database with the shared predictors imputed as described by:

head(merged_tab$DB_READY); dim(merged_tab$DB_READY)
#>      DB  Y    Z employ gender health study
#> 5708  1 20 <NA>      3      1      1     1
#> 5709  1 71 <NA>      1      1      2     1
#> 5710  1 32 <NA>      1      2      1     1
#> 5711  1 60 <NA>      7      1      1     1
#> 5712  1 31 <NA>      7      2      2     1
#> 5714  1 31 <NA>      1      1      3     1
#> [1] 5035    7

All predictors same labels were finally kept: their coding being visibly in adequacy from one dataset to another.

Matching predictors evaluation

Here, the limited number of predictors at the beginning of the study makes it unnecessary to use a selection procedure like random forest, nevertheless the select_pred function can provide users interesting information about the association of predictors, their ability to fit the target scales in each dataset, and their risks of collinearities, especially if it is necessary to further reduce the set of sharing predictors before recoding scales.

Keeping too many predictors for the fusion does not improve its quality and greatly increases the resolution time of the algorithms, that is why, as an advice, keeping only 3 predictors for each fusion seems to be a good compromise between computation time and efficiency.

Let’s use the results of the select_pred function to only keep 3 predictors for data fusion (instead of the four currently present).

Here is the related R code taking the database obtained after executing merge_dbs as argument, and the variables GO90 and RG91 as respective outcomes in the two datasets:

# ncds_14
sel_pred_GO90 <- select_pred(merged_tab$DB_READY,
                             Y = "Y", Z = "Z",
                             ID = 1 , OUT = "Y",
                             nominal = c(1:5,7), ordinal = 6,
                             RF = FALSE)
#> The select_pred function is running for outcome= Y. Please wait ...
#> The process is now successfully completed
#> ---------
#> For comparison with another outcome from two overlayed tables  : 
#> just adapt the OUT option keeping all the others unchanged in the function
#> ---
#> For comparison with another outcome from two unoverlayed tables:
#> just adapt the arguments from Y to convert_class
#> ---------

# ncds_5
sel_pred_RG91 <- select_pred(merged_tab$DB_READY, Y = "Y", Z = "Z",
                             ID = 1, OUT = "Z",
                             nominal = c(1:5,7), ordinal = 6,
                             RF = FALSE)
#> The select_pred function is running for outcome= Z. Please wait ...
#> The process is now successfully completed
#> ---------
#> For comparison with another outcome from two overlayed tables  : 
#> just adapt the OUT option keeping all the others unchanged in the function
#> ---
#> For comparison with another outcome from two unoverlayed tables:
#> just adapt the arguments from Y to convert_class
#> ---------

Automatic exits inform the user about the state of the process and assist him in the implementation of the function.

We first study the existence of potential collinearity between the predictors. At the acceptability thresholds set by default as input to the merge_dbs function, and in particular, the one associated with the V-Cramer criterion, we detect collinearity in the two bases, between the variables employ and gender, as described below:

## ncds_14
sel_pred_GO90$collinear_PB$VCRAM
#>     name1  name2 V_Cramer CorrV_Cramer    N
#> 21 employ gender   0.4831       0.4819 4670

## ncds_5
sel_pred_RG91$collinear_PB$VCRAM
#>     name1  name2 V_Cramer CorrV_Cramer   N
#> 21 employ gender   0.6169       0.6042 365

This result suggests that we should keep only one of these two predictors. But which of the two should we keep? employ or gender ?

To help us decide, we can simply observe the ability of the two variables to predict the target variables in the two databases:

## ncds_14
sel_pred_GO90$vcrm_OUTC_cat
#>   name1  name2 V_Cramer CorrV_Cramer    N
#> 3     Y gender   0.4911       0.4888 4670
#> 5     Y  study   0.4491       0.4465 4670
#> 2     Y employ   0.1765       0.1698 4670
#> 4     Y health   0.1141       0.1033 4670

## ncds_5
sel_pred_RG91$vcrm_OUTC_cat
#>   name1  name2 V_Cramer CorrV_Cramer   N
#> 5     Z  study   0.4507       0.4326 365
#> 3     Z gender   0.4367       0.4180 365
#> 2     Z employ   0.2070       0.1637 365
#> 4     Z health   0.1579       0.0923 365

These results present two variables whose ability to predict the target variable is equivalent in the two databases. How do these variables interact with the other predictors?

## ncds_14
sel_pred_GO90$vcrm_X_cat
#>     name1  name2 V_Cramer CorrV_Cramer    N
#> 21 employ gender   0.4831       0.4819 4670
#> 41 employ  study   0.2318       0.2290 4670
#> 23 health  study   0.1623       0.1603 4670
#> 31 employ health   0.1167       0.1111 4670
#> 32 gender  study   0.0720       0.0705 4670
#> 22 gender health   0.0433       0.0351 4670

## ncds_5
sel_pred_RG91$vcrm_X_cat
#>     name1  name2 V_Cramer CorrV_Cramer   N
#> 21 employ gender   0.6169       0.6042 365
#> 41 employ  study   0.2979       0.2692 365
#> 31 employ health   0.2676       0.2358 365
#> 23 health  study   0.1873       0.1640 365
#> 22 gender health   0.0716       0.0000 365
#> 32 gender  study   0.0017       0.0000 365

The stronger association in both datasets of the variable employ with study, compared to that observed with the variable gender, would rather encourage us to keep gender for the data fusion. Even if, it is also true that this selection criterion remains very tenuous here.

We therefore decide to keep the study, gender and health predictors to solve our recoding problem.

merged_fin = merged_tab$DB_READY[, -4]
head(merged_fin, 3)
#>      DB  Y    Z gender health study
#> 5708  1 20 <NA>      1      1     1
#> 5709  1 71 <NA>      1      2     1
#> 5710  1 32 <NA>      2      1     1

Imputation of GO90 using optimal transportation theory

The imputation of GO90 in the ncds_5 can be the result of two family of algorithms in OTrecod: OUTCOME or JOINT. Both of them used optimal transportation theory and can be eventually improved by relaxing the distributional assumptions.

OUTCOME algorithms are implemented in the OT_outcome function while JOINT algorithms are implemented in OT_JOINT.

It is suggested to fit and compare several models using the output criteria of the verif_OT function, to determine which one provides the better set of GO90 predictions in ncds_5.

Here is the R commands to fit an OUTCOME model and a JOINT model with no relaxation and regularization parameter.

outc1 = OT_outcome(merged_fin, nominal = c(1:4), ordinal = 5:6,
                   dist.choice = "E", indiv.method = "sequential",
                   which.DB = "B")
#> ---------------------------------------
#> OT PROCEDURE in progress ...
#> ---------------------------------------
#> Type                     = OUTCOME
#> Distance                 = Euclidean
#> Percent closest knn      = 100%
#> Relaxation parameter     = NO
#> Relaxation value         = 0
#> Individual pred process  = Sequential
#> DB imputed               = B
#> ---------------------------------------

outj1 = OT_joint(merged_fin, nominal = 1:4, ordinal = 5:6,
                 dist.choice = "E", which.DB = "B")
#> ---------------------------------------
#> OT JOINT PROCEDURE in progress ...
#> ---------------------------------------
#> Type                  = JOINT
#> Distance              = Euclidean
#> Percent closest       = 100%
#> Relaxation term       = 0
#> Regularization term   = 0
#> Aggregation tol cov   = 0.3
#> DB imputed            = B
#> ---------------------------------------

Now let’s see the validity criteria of the two models using the verif_OT function. we simply want to retrieve the stability criteria and Cramer’s V which compares the association of the two different (unordered) scales and so complete the arguments according to this choice:


### For the model outc1
verif_outc1   = verif_OT(outc1, ordinal = FALSE, 
                         stab.prob = TRUE, min.neigb = 5)

### For the model outj1
verif_outj1   = verif_OT(outj1, ordinal = FALSE, 
                         stab.prob = TRUE, min.neigb = 5)

Firstly, the estimates of the Hellinger distance for the two models (both upper 0.05) confirm that the two scales could respectively share the same distribution in the two databases. Consequently, there is no need here to add relaxation parameters in the models to have acceptable predictions of GO90 in ncds_5.


### For the model outc1: Hellinger distance
verif_outc1$hell
#>                 YA_YB ZA_ZB
#> Hellinger dist. 0.085    NA

### For the model outj1: Hellinger distance
verif_outj1$hell
#>                 YA_YB ZA_ZB
#> Hellinger dist. 0.208    NA

The V Cramer criterion informs about the strength of the association between the two scales. Even if they have specific encodings, GO90 and RG91 summarize a same information: the social class of the participant. Therefore, we can reasonably think that the stronger will be the association, the more reliable will be the predictions.


### For the model outc1: Hellinger distance
verif_outc1$res.prox
#>        N   V_cram rank_cor 
#>  365.000    0.640    0.736

### For the model outj1: Hellinger distance
verif_outj1$res.prox
#>        N   V_cram rank_cor 
#>  365.000    0.740    0.899

The large number of modalities of GO90 explains the relative stability observed for the predictions:


### For the model outc1: Hellinger distance
verif_outc1$res.stab
#>          N min.N mean    sd
#> 2nd DB 270     5 0.77 0.288

### For the model outj1: Hellinger distance
verif_outj1$res.stab
#>          N min.N  mean    sd
#> 2nd DB 270     5 0.644 0.152