DIHI Mortality Model Information/Evaluation

logo

About


The mortality models described in this document predict the likelihood of a patient dying during and inpatient stay, within 30 days of admission, and within 6 months of the admission. In order to do this, the model ingests hundreds of variables including a patient’s prior encounter information, diagnosis codes, problem lists, procedures, medications, orders, laboratory values, vital signs, and more. This document will serve as a reference for how the model was trained, the data that was used to inform it, an evaluation on prospective data, and other general information.

Model Overview



The mortality models use the information described in order to predict a patient’s risk of dying within the hospital stay (inpatient mortality), within 30 days of admission, or within 6 months of admission.

Training Set Summary


The data used in the training set consists of all patient encounters to the 3 major hospitals (Duke University Hospital, Duke Raleigh Hospital, and Duke Regional Hospital) between 01-01-2015 and 12-31-2018. Only patients who were 18 years of age or older at the time of the inpatient admission were considered for this model.

Basic stats
Total Population (n=288,116) Died Inpatient* (n=6,211) Died within 30 Days* (n=6,164)) Died within 6 months* (n=16,279)
Sex (%)
  Male 44.2 45.8 51.9 51.8
  Female 55.8 54.2 48.1 48.2
Age (Years)
  Median 60.3 68.8 71.2 68.9
  Mean 57.5 67.4 70.4 67.7
  Std 18.9 15.5 14.8 15.2
Hospital (%)
  Duke University 59.9 68.9 59.3 65.6
  Duke Regional 25.1 18.0 21.7 18.4
  Duke Raleigh 14.9 12.8 18.7 15.6
Admission Source (%)
  Home 78.7 52.8 65.3 71.2
  Transfer from Hospital 11.3 30.7 16.8 15.1
  SNF 1.5 8.3 7.3 3.5
  Clinic 5.6 3.9 7.8 6.8
Admission Type (%)
  Emergency 50.2 81.5 74.4 66.6
  Urgent 21.5 13.9 21.1 23.1
  Routine 28.2 4.6 4.5 10.3
Length of Stay (Days)
  Median 3.5 5.9 5.0 5.6
  Mean 5.8 11.2 6.3 9.8
  Std 8.9 21.4 4.8 13.1
* Columns are mutually exclusive

Mortality Stats @ Duke


In our data, the prevalence for the outcomes was as follows:

  • Inpatient Mortality: 2.52%
  • 30-day Mortality: 4.85%
  • 6-month Mortality: 11.47%

Note that the 30-day risk fully encompasses inpatient death (as long as the encounter lasted less than 30 days). The same logic applies to 6-month Mortality.

Using a dataset from 2019, we can see that 184 people a day on average are admitted to a Duke Hospital and of those, 4.5 will go on to die within the inpatient encounter, 8.6 will die within 30 days, and 20.8 will die within 6 months.

Feature Set (Predictors)


Overview


Click on the Details tab for more information on how our predictors were created.

The predictors we used in our model included:

  • Patient History
    • Diagnoses
    • Problem List
    • Procedures
    • Encounters
  • Pre-Admission Features
    • Chief Complaint
    • Means of Arrival
    • Encounter Type
    • Vital Signs
    • Medication Administrations
    • Laboratory Tests
    • Orders

Detailed


The model predictors include elements from the patient history as well as information gathered during the encounter, but prior to the inpatient admission time. In this section, we will go through a description of how each of these was encoded.

Diagnoses

Dignoses were defined as having a diagnosis code associated with any diagnosis captured in the EHR within the year prior to the admission. Diagnoses were mapped from ICD-10 codes to their Single-level CCS categories and rare instances were dropped. In addition, comorbidities were broken up into instances found between 12 and 3 months prior to the encounter as well as the most recent 3 months. This allows for the model to learn whether recency has any effect. The final representation consists of roughly 400 indicator columns that capture this information.

Problem List

Although it is known that problem lists can be misleading due to their rare upkeep, the hypothesis here is that if something was ever entered into the problem list, it must have been indicative of a patients’ health at some point. These codes were aggregated and mapped to 258 Single-Level CCS code categories.

Procedures

All procedures that a patient has had over the past year were mapped to binary indicators. This was done by mapping a patient’s CPT codes to the corresponding Single-level CCS code.

Encounter Information

There were several encounter-based features that were used to inform the model. These included the time since the last hospitalization, the number of hospitalizations in the past year, the admission source, and which hospital the patient was admitted to.

Chief Complaint + Mode of Arrival

The chief complaint and means of arrival columns, as stored within the EHR, are used in their raw form.

Encounter Type

Whether an encounter was an emergent, urgent, or routine stay was factored into the model’s predictions

Vital Signs

A patient’s vital signs prior to their admission (for example in the emergency department) such as their blood pressure, oxygen flow rate, pulse, pulse oximetry, respiratory rate, and temperature are included in the model’s predictions. The model looks at the minimum, maximum, mean, standard deviation, and number of times that a vital was collected in the ED as predictors

Medication Administrations

A patient’s medication administrations are used as well. Working with clinical investigators, DIHI has mapped medication names to appropriate clinical categories and these are used in the model predictions.

Laboratory Tests

Similarly to medications, laboratory tests have been mapped and are used in the model. In particular, predictors are created if an analyte was collected as well as resulted. This is due to the fact that data from an analyte result may not be available at the time of admission, even though it was collected. This collection could point to clinical suspicion of certain conditions, and is often useful in its own right. If the analyte is both collected and resulted, then both an indicator for the collection as well as the raw value is included in the prediction.

Orders

So far, orders such as echocardiograms, telemetry, ecg, and others are included in the prediction as indicators.

Model Design


High-level Description


The model used is an implementation of gradient-boosted machines known as LightGBM

Variants of these types of models have been shown to be highly effective at modeling problems that have large amounts of structured data. Essentially, the models build many many decision trees. A decision tree in this case might be:

  if
    patient age > 65
  and
    patient has Heart Failure Comorbidity 
  and
    patient has had 3 prior admissions in the past year
  and
    patient did not have any medications administered prior to hospitalization
  then
    the patient has a 12% probability of mortality within the inpatient encounter

Clearly, this an overly simplistic algorithm. LightGBM and other gradient-boosted techniques build hundreds of these trees that are informed by how well they perform on the training dataset. In particular, boosting methods try to make sure that each successive tree is learning about patients that previous trees have had difficulty classifying. In the case of our models, we often construct as many as 2,000 trees, each with up to 6 or more splits similar to the ones depicted above. We discuss evaluation of such methods in a later section.

Once trained, the models can take in this information at run-time and generate a score as to how at risk a patient is.

Details


This model was trained using an ensemble of gradient boosted trees for binary classification. In order to control for the sparsity of the underlying problem, we iterated over a hyperparameter space that consisted of several factors of upweighting for positive examples. The results were trained according to the logloss function, or binary cross-entropy. Two cross-validation schemes were used in training the model. In the first scheme, we performed 5-fold cross-validation for every iteration of a grid of hyperparameters that were hand-selected for the problem. In each of the folds, we evaluated the performance of the model according to the Area under the precision-recall curve. This was because the end goal is to identify patients who were deemed positive, rather than optimizing for two-class performance. In addition, it has been shown that AUC tends to not be a good measure in highly-imbalanced problems.

The second tuning/validation method was to use a held out set from a time period that was not in the training set. Each setting in the hyperparameter grid was used to evaluate AUPRC in the validation fold. We set an early stopping criteria of 10 iterations of boosting that did not improve in the AUPRC criterion.

In practice, the tuning was done in an iterative fashion, with certain trends such as clear overfitting occurring early in the process. We discuss some of the parameters used to counteract this in the even more details section, though early stopping was the most important.

Even More Details…


We note that it is very easy to overfit the training set in this setting. In order to counteract this, we use a low colsample_bytree parameter, which creates trees that have higher variance. In addition, lowering the learning rate and implementing early stopping as described in the Detailed section seemed to fix many of the issues as compared with the other cross-validation scheme.

In particular, the generalization of the model trained simply on a training set from 2015 to 2018 on data after 2018 did not seem to match. This was another reason that the early stopping scheme worked well – the validation set was meant to represent information that did not match temporally with the data in the training set. We also noted this in that a random sample of training samples held aside as a test set always outperformed the model when evaluated on the validation sets.

When dealing with missing data, we use LightGBM’s internal mechanism. In particular, if we split trees according to some function of the first and second order loss derivatives, the splitting mechanism first bins the data into a histogram, the granularity we control via a hyperparameter. Afterwards, the optimal split is determined. Then, missing data is allocated to either side of the binary split to see which side has a better gain. We allow this behavior rather than explicitly modeling the missingness, which has shown to be an effective method with machine learning models that do not elegantly handle missing data.

In addition, categorical data is also handled by first reducing the cardinality of the categorical feature space by dropping very underrepresented categories. The remaining data is then fed into LightGBM’s categorical data handler. This is done after we manually encode features using integers going from 0 to k − 1 for a categorical features of cardinality |k|. LightGBM uses a sorted histogram approach to find the optimal split in roughly O(k * log(k)) time rather than a one-hot encoded vector which may take O(2k − 1 − 1) time.

Evaluation


The data was evaluated on a held-out set of data from 2019. All encounters where the patient was 18 or over during their stay are included in this analysis. The only exception is that for the 6-month mortality model, we exlude patients who have not yet had 6 months for their data to get populated. This still leaves roughly 8 months of data for evaluation. The key note here is that the model has never seen this data during the training process.

At a glance

Model AUC AUPRC
Inpatient Mortality 0.87 0.22
Thirty-Day Mortality 0.88 0.34
Six-Month Mortality 0.87 0.49

Below, we include interactive visualizations that can be used to assess how the models should be thresholded and the trade offs between false positives, true positives, sensitivity and other metrics. Hover over each of the plots to learn more!


Inpatient Mortality


In the following section, we will drill down a bit deeper into the performance of the inpatient mortality model. In particular, we can set thresholds and the resulting confusion matrix to see how this may affect the number of false alarms and true alarms fired on patients who are at risk of dying within their inpatient stays. The output of this model is NOT calibrated to a probability This makes it even more important to set a correct threshold.





Key Points:

  • When the model successfully captures 20% of patients who go on to die within their inpatient stay, the false alarm to true alarm ratio is 2.6 to 1.
  • When the model successfully captures 50% of patients, the false alarm to true alarm ratio is 4.7 to 1.

Model Thresholds:

The inpatient mortality model has been thresholded in the following way:

A Critical score corresponds to a score of 0.501. 1.97% of patients who die within their inpatient stay fall into this category historically. Alarms fired in this category are accurate 72.7% of the time.

A High score corresponds to a score of 0.331, 10.8% of patients who die within their inpatient stay fall into this category historically. Alarms fired in this category are accurate 37.8% of the time.

A Medium score corresponds to a score of 0.232, 22.8% of patients who die within their inpatient stay fall into this category historically. Alarms fired in this category are accurate 17% of the time.


30-day Mortality


In the following section, we will drill down a bit deeper into the performance of the 30-day mortality model. In particular, we can set thresholds and the resulting confusion matrix to see how this may affect the number of false alarms and true alarms fired on patients who are at risk of dying within their inpatient stays. The output of this model is NOT calibrated to a probability This makes it even more important to set a correct threshold.





Key Points:

  • When the model successfully captures 20% of patients who go on to die within 30 days of admission, the false alarm to true alarm ratio is 1 to 1.
  • When the model successfully captures 50% of patients, the false alarm to true alarm ratio is 2.5 to 1.

Model Thresholds:

The 30-day mortality model has been thresholded in the following way:

A Critical score corresponds to a score of 0.94. 1.69% of patients who die within 30 days of inpatient admission fall into this category historically. Alarms fired in this category are accurate 84.1% of the time.

A High score corresponds to a score of 0.548. 18.7% of patients who die within 30 days of inpatient admission fall into this category historically. Alarms fired in this category are accurate 48.2% of the time.

A Medium score corresponds to a score of 0.252. 29.6% of patients who die within 30 days of inpatient admission fall into this category historically. Alarms fired in this category are accurate 22.3% of the time.


6-month Mortality


In the following section, we will drill down a bit deeper into the performance of the 6-month mortality model. In particular, we can set threholds and the resulting confusion matrix to see how this may affect the number of false alarms and true alarms fired on patients who are at risk of dying within their inpatient stays. The output of this model is NOT calibrated to a probability This makes it even more important to set a correct threshold.





Key Points:

  • When the model successfully captures 20% of patients who go on to die within 6 months of admission, the false alarm to true alarm ratio is .53 to 1.
  • When the model successfully captures 50% of patients, the false alarm to true alarm ratio is 1.09 to 1.

Model Thresholds:

The 6-month mortality model has been thresholded in the following way:

A Critical score corresponds to a score of 0.811. 1% of patients who die within 6 months of inpatient admission fall into this category historically. Alarms fired in this category are accurate 73.4% of the time.

A High score corresponds to a score of 0.561. 21.8% of patients who die within 6 months of inpatient admission fall into this category historically. Alarms fired in this category are accurate 53.2% of the time.

A Medium score corresponds to a score of 0.386. 19.5% of patients who die within 6 months of inpatient admission fall into this category historically. Alarms fired in this category are accurate 36.8% of the time.


Prospective Evaluation:

The model is currently running in silent mode. Every hour, the necessary data is fetched from Epic and the model is run for all patients who are currently not discharged. From the period of 2020-04-15 to 2020-8-11, the metrics for the 30-day model, computed in a staging setting (using real-time data looking forward) are:

  • AUROC: .86
  • AUPRC: .31

which is very close to the retrospective performance. This indicates that the model has held up well during the transition from development to staging and switching data sources. As further data is collected, the 6-month performance will also be added to this document.

Frequently Asked Questions:


The Inpatient Risk is High/Critical, but the 30-day or 6-month model are not! Why?

The model is trained on each outcome separately using close to 1000 features about a patient. In addition, the model is trained to maximize a particular metric for each outcome. This may mean that predictions for certain types of patients are better in a particular model than others. In addition, it means that the features that the model has learned to highlight may be different for an inpatient death versus a 6-month death. It is possible that a patient has features that are correlated with an inpatient death, but somehow does not mean the criteria for 6-month death. As an example, perhaps the patient is new to Duke and has no history of comorbidities, but has some laboratory values that are associated with inpatient death. Even though the 6-month death also includes inpatient death, the features that are important might be weighted more towards patients who die outside of the inpatient setting because more patients die after their inpatient stay than during it. It can be difficult to determine on an individual basis why this phenomenon might occur.

In addition, because the thresholds are different for each model, the exact definition of “High” or “Medium” may not be defined the same way for each of the models. These decisions were made based on looking at particular points on the precision recall curves. For more information, see the above visualizations and Key Points sections.

Why is the risk score so high for a particular patient?

It can be difficult to determine the reason why any particular patient has a risk score that is high or low. The reason is that the calculations on the input data involve tens of thousands of computations and it is difficult to reason about the result for one particular patient. In the case that the model predicts a score that does not match up with clinical intuition, disregard the model results. A tool to use for helping to disentangle the model predictions is underway

Can I use this model to inform admission decisions?

No. This model should only be used as information about a patient’s risk of mortality within their inpatient stay, 30 days, and/or 6 months from admission given the standard of care. By using the model to change the care for the patient in a dramatic way, the model becomes invalid.


Questions or Concerns?

 




Copyright 2019, Duke Institute for Health Innovation (DIHI), Duke University School of Medicine, Durham NC. All Rights Reserved.