According to the Centers of Disease Control and Prevention (CDC), heart disease is one of the leading causes of death in the United States, accounting for nearly 20% of all deaths in 2020. In 2017 and 2018, the estimated cost of health care services, medication to treat heart disease, and missed days of work accumulated to a staggering $229 billion per year.
For our customer, a U.S. healthcare insurance company, a sample analysis of heart-related chronic conditions showed that a fraction of the population contributed to high costs because of heart-related illnesses. If an intervention was targeted towards high-risk patients on the path of chronic illnesses, a significant proportion of cost could be recovered. The analysis showed that in California alone the potential savings based on the sample of approx. 5,000 patients with heart failure related illness is $515 million.
Hence, our customer wanted to undertake a proof-of-concept on their data to identify high-risk patients 6 months in advance to allow doctors enough time to develop intervention plans and improve their patients’ health. A team of Teradata data scientists at Global Delivery Center (GDC) Pakistan and industry experts working alongside the customer aimed to develop a solution that would predict the onset of heart failure 6 months in advance with high accuracy to avert the high claims costs that arise after the onset. In addition to identifying high-risk patients, the customer also wanted to understand the driving factors behind patients’ predicted outcome.
Data Prep and Feature Engineering Using Teradata
We sampled two groups of patients based on the claims data: those who went through heart failure (cases) and those with similar demographics but no heart failure (controls). The age range was between 40 and 85 years.
Patient data was anonymized (NOPHI - No Personal Health Information) and complied with HIPAA standards.
We extracted patients’ visit records consisting of diagnoses, medications, procedures, and demographics. In addition, we also added a temporal aspect to the medical features. We differentiated between events occurring 1-3 months before heart failure, 3-6 months, and 6-12 months, before the onset.
To reduce the number of features for model building, we used official medical groupers to aggregate the codes into broader categories. In addition, we applied regression-based feature reduction technique to eliminate uninformative features. These two steps reduced total number of features by manifold, which reduces the complexity of our model while improving its predictive power.
Model Development and Results
Using the remaining time-based variables, the best model that accurately identified heart failure patients was selected after comparing several models with different hyperparameter settings. Using Teradata Vantage in-database function for data preparation of the customer claims data, we were able to build a model to correctly predict ~70% of heart failure occurrences with ~90% accuracy, 6 months in advance, for tens of thousands of patients at scale.
Moreover, we estimated how much each variable increases the odds of heart failure on average, and these temporal variables were displayed as a sequence of events leading up to heart failure.
Translating Model Accuracy to Financial Savings
According to our analysis, the cost per patient nearly triples after the onset of heart failure, serving as substantial impetus to avert the outcome as much as possible.
If health practitioners can intervene at the right time based on the prediction of our model and prevent even a fraction of patients from having heart failure, we can not only improve and extend patients’ lives but also potentially save millions of dollars in cost of care.
Interpreting Model Decisions
For clinicians and healthcare analysts, it is important that the predictive model is both accurate and interpretable. They need to know the reasoning behind the predictions for the following reasons:
1. To ascertain that the predicted outcome is reasonable and logically follows from the driving factors (explanation) given by the model.
2. To inform clinical decisions and design appropriate intervention plans based on the explanation.
Our model is a tree-based learner that is nonparametric, meaning that it does not output inherently meaningful feature contributions or parameters. To achieve interpretability, we estimated feature contributions using a technique known as Shapley (SHAP) values. It gauges the contribution for each feature by isolating its effect on the outcome. We were able to infer how much each feature affects the odds of having heart failure. We leveraged our selected tree-based model’s superior predictive accuracy as well as the explainability of a highly interpretable model, without making a tradeoff.
Summing up, our solution is a combination of Teradata Vantage’s scalable technology for advanced analytics and a seamless integration with open-source tools that offer a business-friendly user interface. It is a powerful tool that brings down the effort of sifting through millions of datapoints to a simplified easy to use tool with a few clicks, driving fast actionable insights for clinical decision support at scale. As a next step we have extended our initial use-case to include capability to choose any condition of interest to predict in a period of choice that a doctor or user may decide.
*Dr. Bilal Khaliq also contributed to this project and article
About Bilal Kahliq:
Bilal is Principal Data Scientist at Teradata focused on the Healthcare industry. He has been working with U.S. accounts teams to provide advanced analytics solutions for payors & providers. He leads a team of Data Scientists focusing on engagements to analyze and manage cost of care as well as modelling & prediction of medical outcomes and patient behavior.