ancient Roman aqueduct bridge
Pont du Gard, Gard, France. Photo by Xuan Nguyen on Unsplash

The Problem

At the Duke Institute for Health Innovation, we are excited by the prospect of putting machine learning to practice in medicine. When we first began developing machine learning solutions, machine learning in healthcare was in its infancy. However, we felt then—as we do now—that all of the pieces were in place to allow data-driven approaches to transform medicine. It seems like a forgone conclusion that by leveraging the vast amount of data that is present in the medical record, machine learning can aid physicians in detecting disease earlier, informing treatment protocols, diagnosing, and identifying patients who need specialized care.

However, our initial efforts were anything but streamlined. We quickly realized that any machine learning solution was gated by access to data. During early projects, it would take upwards of 6 months or more to get access and access the data required. Given our short pilots, it was immediately clear that until we had a reliable and timely way of accessing clean medical record data, our projects would continue to rate limited.

Even once we had the data, it was evident that there was much more work to be done before we could deliver robust machine-learning based solutions. Many of our early efforts involved going through the process of resolving different names for the same clinical concept, harmonizing units for lab tests, and similarly engaging work. Our clinical partners, who we asked to help with this effort, seemed delighted by the iterative and manual nature of the work.

Our Solution

As we began curating this information for use in future projects, we also began receiving more and more project proposals having to do with machine learning. We soon realized that in order to keep up with the increasing volume of data needs, we needed a scalable way to work with data, which led us to build the DIHI Data Pipeline.

At its core, the DIHI pipeline allows users to work with clean and reproducible data. By clean, we mean that rather than grouping raw data elements over and over again for different projects, we can house all of this knowledge in the same place. Lab test result units are converted to ensure consistency across the same analyte. Existing references for methods to group ICD codes, medication therapeutic classes, procedures, and other data fields should be easily accessible. This both reduces the time it takes to go from raw data to a dataset ready for analysis and the amount of redundancy across projects and groups at Duke Health. In addition, we ensure that queries to the system are reproducible where possible so that analysis performed today should be able to be consistent two years from now. In building the pipeline, we leveraged technologies and best practices that are used and developed at major tech companies such as Google, Uber, and AirBnB and are continuing to add features as we service new use cases.

Impact

This pipeline has allowed us to complete projects at a much faster rate and allowed us to focus on the truly difficult parts of innovation in healthcare—implementation and workflow design. We believe that much of our success is in part due to this pipeline and the team that helped to design and build it. As we continued to tackle more difficult challenges, we expect that the pipeline will continue to accelerate and enable innovation in healthcare at Duke and beyond.

The Duke Health Data Pipeline, powered by DIHI, is a foundational, fully-automated data curation tool enabling data liquidity which accelerates quality improvement, learning health, research and innovation projects. By integrating and standardizing the EHR, clinical outcomes, claims and other data sources the pipeline provides comprehensive, timely, accurate and linkable information to support system-wide innovation and transformation. Successful implementation of the data pipeline across the health system will allow us to have significant impact on care across clinical areas and could help us to more accurately predict and hence prevent adverse outcomes.

More Projects