medicalrisk: Calculating risk and comorbidities from ICD-9-CM codes

Patrick McCormick
patrick.mccormick@alum.mit.edu

2015-01-05

Introduction

Administrative healthcare data is frequently the only available source for determining individual risk of mortality when looking at thousands or millions of patient records. Medical chart abstraction just isn't feasible for projects of this scale.

In the United States, the records for every inpatient and outpatient encounter is reviewed by a qualified medical coder who assigns a set of diagnosis and procedural codes based on phrases within the medical record. The coding system currently in use is ICD-9-CM. ICD-9-CM is an adaptation of the venerable ICD-9 standard which was developed in 1978. The U.S. National Center for Health Statistics (NCHS) developed ICD-9-CM, which has been required for Medicare and Medicaid claims since 1979. ICD-9-CM is updated annually.

At some point, perhaps as soon as October 2015, ICD-10-CM codes will need to be used instead. It is likely that “dual coding” of claims in both sets will continue for some time.

In the meantime, there is a wealth of administrative data available within the ICD-9-CM diagnostic and procedural codes stored within US healthcare systems. The routines in this package are designed to help determine comorbidity and medical risk status of a given patient using several popular models published in the peer-reviewed literature.

Working with ICD-9-CM Data

In order to demonstrate this package, this package includes data on 100 patients from the Vermont Uniform Hospital Discharge Data Set for 2011, Inpatient.

library(medicalrisk)
library(plyr)
data(vt_inp_sample)
x <- count(vt_inp_sample, c('id'))
cat("average count of ICD codes per patient is: ", mean(x$freq))
## average count of ICD codes per patient is:  11.52
y <- count(vt_inp_sample, c('icd9cm'))
# top 5 most popular ICD-9-CM codes in this dataset
print(head(y[order(-y$freq),], n=5), row.names=F)
##  icd9cm freq
##   D4019   34
##  D53081   22
##   D2724   19
##   D3051   18
##  D25000   17

Within this package, ICD-9-CM codes are presented as a string where the first letter is “P” or “D” depending on whether the code is Procedure or Diagnosis. The rest of the code is present as a string of numbers. Periods are omitted. In the list above, the code “D4019” is diagnostic code 401.9 which corresponds to Hypertension.

Comorbidity Maps

The package includes a set of mapping functions that transform a list of ICD-9-CM codes into a comorbidity matrix:

“Charlson” refers to the Charlson Comorbidity Index. The names “Deyo”, “Romano”, and “Quan” refer to the primary authors of different methods of determining Charlson comorbidities from ICD-9-CM codes.

“Elixhauser” refers to the Elixhauser comorbidities, which is a more detailed list than Charlson. “AHRQ37” is an adapation of the AHRQ version 37 software. “Quan” refers to the same paper by Quan mentioned above.

“RCRI” is the Revised Cardiac Risk Index set of categories using a method published by Boersma.

For example, the #5 ICD-9-CM code above is D25000, or “250.00”, which is for “Diabetes Mellitus Unspecified Type”. Here's what happens when that code is passed to a few of the mapping functions listed above:

icd9cm_charlson_quan(c('D25000'))
##           mi   chf perivasc   cvd dementia chrnlung rheum ulcer liver   dm
## D25000 FALSE FALSE    FALSE FALSE    FALSE    FALSE FALSE FALSE FALSE TRUE
##         dmcx  para renal tumor modliver  mets  aids
## D25000 FALSE FALSE FALSE FALSE    FALSE FALSE FALSE
icd9cm_elixhauser_ahrq37(c('D25000'))
##          chf arrhythmia valve pulmcirc perivasc   htn htncx  para neuro
## D25000 FALSE      FALSE FALSE    FALSE    FALSE FALSE FALSE FALSE FALSE
##        chrnlung   dm  dmcx hypothy renlfail liver ulcer  aids lymph  mets
## D25000    FALSE TRUE FALSE   FALSE    FALSE FALSE FALSE FALSE FALSE FALSE
##        tumor rheum  coag obese wghtloss lytes bldloss anemdef alcohol
## D25000 FALSE FALSE FALSE FALSE    FALSE FALSE   FALSE   FALSE   FALSE
##         drug psych depress
## D25000 FALSE FALSE   FALSE
icd9cm_rcri(c('D25000'))
##          chf   cvd   dm ischemia renlfail
## D25000 FALSE FALSE TRUE    FALSE    FALSE

For each of these maps the “dm” column becomes TRUE.

The most efficient way to use these maps for a set of patients is to generate a single map for all ICD-9-CM codes in the set and then apply that map to each patient. Here's an example that generates a comorbidity matrix for the first five patients in the Vermont dataset:

cases <- vt_inp_sample[vt_inp_sample$id %in% 1:5, c('id','icd9cm')]
cases_with_cm <- merge(cases, icd9cm_charlson_quan(levels(cases$icd9cm)), 
   by.x="icd9cm", by.y="row.names", all.x=TRUE)

# generate crude comorbidity summary for each patient
print(
  ddply(cases_with_cm, .(id), 
   function(x) { data.frame(lapply(x[,3:ncol(x)], any)) }),
  row.names=F)
##  id    mi   chf perivasc   cvd dementia chrnlung rheum ulcer liver    dm
##   1 FALSE  TRUE    FALSE FALSE    FALSE    FALSE FALSE FALSE FALSE FALSE
##   2 FALSE FALSE    FALSE FALSE    FALSE    FALSE FALSE FALSE FALSE FALSE
##   3 FALSE FALSE    FALSE FALSE    FALSE    FALSE FALSE FALSE FALSE FALSE
##   4 FALSE FALSE    FALSE FALSE    FALSE    FALSE FALSE FALSE FALSE  TRUE
##   5 FALSE FALSE    FALSE FALSE    FALSE     TRUE FALSE FALSE FALSE FALSE
##   dmcx  para renal tumor modliver  mets  aids
##  FALSE FALSE FALSE FALSE    FALSE FALSE FALSE
##  FALSE FALSE FALSE FALSE    FALSE FALSE FALSE
##  FALSE FALSE FALSE FALSE    FALSE FALSE FALSE
##  FALSE FALSE  TRUE FALSE    FALSE FALSE FALSE
##  FALSE FALSE FALSE FALSE    FALSE FALSE FALSE

The above process is encapsulated in a single function generate_comorbidity_df. This function also includes an optimization from Van Walraven that reduces dmcx to dm if the specific diabetic complication is separately coded.

generate_comorbidity_df(cases, icd9mapfn=icd9cm_charlson_quan)
##   id    mi   chf perivasc   cvd dementia chrnlung rheum ulcer liver    dm
## 1  1 FALSE  TRUE    FALSE FALSE    FALSE    FALSE FALSE FALSE FALSE FALSE
## 2  2 FALSE FALSE    FALSE FALSE    FALSE    FALSE FALSE FALSE FALSE FALSE
## 3  3 FALSE FALSE    FALSE FALSE    FALSE    FALSE FALSE FALSE FALSE FALSE
## 4  4 FALSE FALSE    FALSE FALSE    FALSE    FALSE FALSE FALSE FALSE  TRUE
## 5  5 FALSE FALSE    FALSE FALSE    FALSE     TRUE FALSE FALSE FALSE FALSE
##    dmcx  para renal tumor modliver  mets  aids
## 1 FALSE FALSE FALSE FALSE    FALSE FALSE FALSE
## 2 FALSE FALSE FALSE FALSE    FALSE FALSE FALSE
## 3 FALSE FALSE FALSE FALSE    FALSE FALSE FALSE
## 4 FALSE FALSE  TRUE FALSE    FALSE FALSE FALSE
## 5 FALSE FALSE FALSE FALSE    FALSE FALSE FALSE

This function only considers each ICD-9-CM code once and then merges the resulting comorbidity flags together for each patient. This makes the function quite fast for large data sets.

Given appropriate arguments, the generate_comorbidity_df function will use the parallel backend provided by foreach to improve performance.

Comorbidity Index

It is common in the medical literature to see a set of comorbidities reduced to an index. When the Charlson Comorbidity Index was first published it had the following weights for each comorbidity:

data(charlson_weights_orig)
print(t(charlson_weights_orig), row.names=F)
##      mi chf perivasc cvd dementia chrnlung rheum ulcer liver dm dmcx para
## [1,] 1  1   1        1   1        1        1     1     1     1  2    2   
##      renal tumor modliver mets aids
## [1,] 2     2     3        6    6

However, these weights have not stood the test of time. For example, the prognosis for HIV/AIDS has dramatically improved.
The medicalrisk package offers the revised Charlson weights developed by Schneeweiss:

data(charlson_weights)
print(t(charlson_weights), row.names=F)
##      mi chf perivasc cvd dementia chrnlung rheum ulcer liver dm dmcx para
## [1,] 1  2   1        1   3        2        0     0     2     1  2    1   
##      renal tumor modliver mets aids
## [1,] 3     2     4        6    4

The generate_charlson_index_df function will sum the weights for each patient to generate a final index:

print(generate_charlson_index_df(generate_comorbidity_df(cases)), row.names=F)
##  id index
##   1     2
##   2     0
##   3     0
##   4     4
##   5     2

Risk Stratification Index

The Risk Stratification Index uses ICD-9-CM codes to determine four risk estimates:

The author of the paper (Sessler) published SPSS code to perform the calculation. The medicalrisk implements the RSi calculation using a method based on the SPSS code.

ddply(cases, .(id), function(x) { icd9cm_sessler_rsi(x$icd9cm) } )
##   id rsi_1yrpod rsi_30dlos rsi_30dpod rsi_inhosp
## 1  1 -2.0186043  0.1560323  -1.699242 -1.8483037
## 2  2 -4.1423990  0.8927947  -3.802495 -3.5425015
## 3  3 -2.6265277  0.8311247  -2.910939 -2.8607594
## 4  4 -0.7984382  0.3357922  -1.551285 -0.2669842
## 5  5  2.5803930 -1.7904270   2.455086  1.7615180

Conclusion

The medicalrisk package can be used to generate risk data from ICD-9-CM codes in large datasets. The above discussion describes basic use of the package. There are some additional helper functions not described above which are included in the per function documentation.

The aim of this package is to include future medical risk estimation procedures as they are published in the literature.