policychat_user_photo_07_higherres

TLDR

Duke Health Policy Chat” is an AI-powered assistant that helps healthcare staff at Duke access and understand policies and procedures. This AI tool uses natural language processing to answer questions about policies, find relevant policy documents, and present information in an easy-to-understand way.

Policy Chat was designed to provide clear, concise answers to specific policy-related questions, making it easy for users to get the information they need without wading through complex documents. It performs best when responding to focused, actionable questions—those that aim to clarify a single, well-defined point of policy. However, it is not intended to process broad, exploratory requests that require summarizing or presenting entire policies. By aligning with its intended use, Policy Chat helps streamline workflows.

For Duke Users

Tired of wasting time searching for Duke Health policies? Duke Health Policy Chat is an AI-powered solution that provides instant access to over 4,000 policies. Ask questions in natural language and get accurate, timely answers. Duke Health Policy Chat offers:

  • An AI assistant that understands your questions and cites relevant policies, saving you time and reducing staff disagreements.
  • Real-time policy updates so you always have the latest information.
  • A user-friendly interface for easy navigation and search.

Duke Health Policy Chat empowers staff with knowledge, improves policy adherence, and enhances patient care. Join us in transforming policy management at Duke Health.

Problem

Duke Health faces a significant challenge with inconsistent accessibility and understanding of its 4,000+ policies among healthcare staff. This impacts efficiency, policy adherence, and, ultimately, patient care. To address this, the Duke Institute for Health Innovation (DIHI) is developing Duke Health Policy Chat, an AI-powered knowledge management system.

  • Problem Identification: Healthcare staff, particularly new recruits, struggle to access and understand Duke’s vast array of policies and procedures. This is due to manual dissemination methods and the cumbersome nature of navigating through the existing policy center.
    • “Inconsistent accessibility and understanding of Duke’s policies and processes among healthcare staff, especially new recruits, impact efficiency, and policy adherence.” – Application for the Innovation Award
  • Affected Demographics: New and existing healthcare providers and administrators across the Duke Health Enterprise (DHE).
  • Current Handling: Manual dissemination and individual exploration of policies, leading to inefficiencies and inconsistencies in application.

Solution

Duke Health Policy Chat is an AI-driven knowledge management system that leverages Retrieval Augmented Generation (RAG) technology. It provides instantaneous access to Duke’s policies and procedures, improving comprehension and adherence. Key features include:

  • AI-Powered Assistant: Utilizes natural language processing (NLP) to interpret user inquiries and retrieve relevant policy documents. It presents information in a digestible format and can handle multi-level inquiries, providing detailed insights.
  • Real-Time Policy Updating: Ensures policies are current and accurate by incorporating changes automatically and verifying them through collaborative algorithms.
  • Faceted Search: Allows manual exploration of policies based on criteria like department, relevance, and date, providing flexibility.
  • User Interaction Analytics: Tracks user interactions to identify trends and potential areas for policy clarification and tailor system enhancements.
  • User-Friendly Interface: Designed for intuitive navigation and ease of use.

Examples of Functionality

Testing and development have demonstrated the system’s capabilities. It accurately identifies relevant policies and provides concise answers to a wide range of user questions.

Example 1:

  • Question: Do we need to count instruments for hernia repair?
  • Answer: “Yes. An initial instrument count (pre-incision) is required for procedures that enter into the pelvic cavity. Additionally, a first closing instrument count (initiated during initial wound or body cavity closure) is required in all cases with an incision of a body cavity equal or greater than 5 cm.”

Example 2:

  • Question: What are the protocols for patient identification before surgery?
  • Answer: Provides a detailed summary of the DUHS Patient Identification Policy (9165), referencing the specific policy section (Policy VII) for verification.

Impact

Aspires to enhance policy adherence and staff proficiency at Duke. Enhances staff training, improves policy accessibility, and potentially uplifts care quality.

  • Improved Policy Adherence & Comprehension: Provides staff with easy access to accurate and up-to-date policies, resulting in better adherence and understanding.
  • Increased Efficiency: Reduces time spent searching for and understanding policies, freeing up staff time for patient care and other tasks.
  • Potential Uplift in Care Quality: By ensuring consistent policy adherence, the system can improve patient safety and outcomes.
  • Scalable and Adaptable: The system is designed to be scalable and adaptable, with potential applications in other areas such as workplace conflict resolution and augmented safety reporting.

Technology & Innovation

  • Retrieval Augmented Generation (RAG): Employs advanced AI techniques that utilize mathematical text embeddings to compare user questions with policy content, delivering accurate and contextually relevant responses.
    • “While tools are rapidly becoming easier to use (more “point and click”), end-to-end design project frameworks especially in large enterprises such as Duke are still emerging. Both the data content and structure of this project are ideal for such a RAG-based strategy.” – PITCH Slides, Dr. Michael Kent
  • Semantic Search & Context Awareness: Understands and utilizes the most relevant content from policy documents, including maintaining conversational context for follow-up questions.
  • Prompt Engineering: Carefully crafted prompts guide the AI to generate accurate, coherent, and contextually appropriate responses.
  • Continuous Expert Feedback: A web application captures user feedback, including ratings of AI responses and access to source documents for verification. This ensures ongoing improvement and user-centricity

“This continuous feedback loop ensures that our solution is not just effective but also user-centric and constantly improving, making our stakeholders involved and integral to the project.” – Matt Gardner

Conclusion

The ongoing development of the “Duke Health Policy Assistant” through the DIHI 2024 RFA initiative underscores the transformative potential of AI-powered knowledge management systems. Using the sophisticated Retrieval-Augmented Generation (RAG) technique, we aim to address the pressing challenge of providing timely and accurate access to Duke Health’s extensive policy documents. This project underscores DIHI’s commitment to addressing real-world healthcare challenges and improving clinical decision-making through efficient knowledge management and access to critical information.

We are planning a limited system deployment among a cohort of clinical domain experts. This phase will generate valuable data on usage patterns and user feedback, further informing our iterative improvements. Through continuous refinement and optimization, we aim to develop a robust AI assistant that significantly enhances access to policy information, ultimately improving decision-making and patient outcomes at Duke Health.

 

For Data Scientists and Engineers

The Iterative Development Approach 

While the technical aspects of configuring and deploying a RAG solution have become relatively straightforward, ensuring the accuracy of the generated responses remains a significant challenge. We will next describe our iterative approach to ensure high accuracy, which is crucial for user trust and enterprise adoption. 

To enhance the accuracy of our RAG solution, we are concentrating on the following key components:  

  • Document Retrieval (Semantic Search), Context Awareness,  
  • Prompt Engineering, and  
  • Continuous Expert Feedback 

 RAG Component: Document Retrieval (Semantic Search) 

We have experimented with different text embedding models. We leverage the HuggingFace Massive Text Embedding Benchmark (MTEB) Leaderboard, a comprehensive evaluation platform designed to assess and compare the performance of different text embedding models across a wide array of tasks. The MTEB gives us insight into the top-performing embedding models optimized for semantic search. To test search accuracy, we embedded the entire corpus of DHE policies. Policy content experts have created a bank of common questions along with the relevant policy documents they expect to be retrieved. We are translating these questions into automated test cases that can be run continuously. This approach ensures that iterative enhancements to document splitting strategies and the embedding model do not compromise search effectiveness and help identify areas where search parameters might need further optimization. 

RAG Component: Context Awareness 

Context Awareness refers to the system’s ability to comprehend, retain, and utilize the most pertinent content from Duke Health Policy documents in response to specific questions, such as: “How do I drape a patient for surgery?” We need to determine the optimal context scope to provide to the Large Language Model (LLM)—whether to use the most relevant chunks (e.g., sentences, paragraphs) from relevant documents or the entire documents. 

This optimization challenge intensifies when RAG is enabled to support conversational interactions. To handle follow-up questions effectively, tracking prior questions and generated responses as part of the context is essential. For example, a follow-up question like “What about a craniotomy?” should be reformulated using the prior conversational history to something more context-aware, such as “How do I drape a patient for a craniotomy?” 

The DIHI RAG framework keeps track of all prior context, including the document context retrieved via semantic search and the conversational history. This comprehensive tracking capability is instrumental in identifying areas where we need to optimize context awareness to improve conversational experiences. 

RAG Component: Prompt Engineering 

The challenge here is to craft prompts that guide the Large Language Model (LLM) to generate accurate, coherent, and contextually appropriate responses. There are two levels of prompting that we are currently focused on optimizing: 

  • Prompt for a new conversation topic: (system/LLM instructions + policy document context + user question) 
  • Prompt in the context of an existing conversation topic: (system/LLM instructions + policy document context + prior conversation history + user question) 

We discovered several intricate considerations can significantly impact the quality, relevance, and coherence of the responses generated by the system. For example: 

  • Crafting clear and concise prompts helps avoid ambiguity and ensures the LLM can focus on the most pertinent details.  For our use case, instructing the LLM to list the names of the policy documents referenced – and the Duke Entity they applied to – was critical. 
  • Structuring the prompt in a logical sequence that mirrors the conversation flow aids in maintaining coherence and relevance. 
  • Developing truncation strategies that retain the most significant information when the context and conversational history exceed token limits 

The most important lesson we learned is that tracking prompt changes and iterations is important for understanding what modifications lead to improvements and replicating success. The DIHI RAG Framework records all prompts used so we can perform retrospective analysis on prompting strategies. 

RAG Component: Continuous Expert Feedback 

The integration of continuous expert feedback is not just a feature, but a vital part of our ongoing enhancement of the RAG solution. We developed a Web application that allows end users to experiment with the “DHE Policy Assistant.” Conversational history with the AI assistant is captured in a database repository. This includes questions asked, document context referenced, responses generated, and the technical parameters used, such as the model used to embed documents/queries and the LLM used to generate responses. In addition, users can rate AI-generated responses and provide feedback. This continuous feedback loop ensures that our solution is not just effective but also user-centric and constantly improving, making our stakeholders involved and integral to the project. 

We also recognized that access to the source policy documents was necessary to facilitate user evaluation and feedback. This allows users to dig deeper to verify the accuracy of an LLM-generated response or to gain extended insight into discrepancies between generated responses and source documents. The Web application provides quick links to all referenced source documents in the context of a conversation.   

Evaluation Description

View the PDF

The Question-Answer system was developed using Retrieval Augmented Generation (RAG), in which policies are embedded using an embedding model. At runtime, a user query is embedded and relevant documents are retrieved, added to the context, and an answer is provided using the user query and retrieved context. There are several components of the system that are important to evaluate. A gold-standard evaluation set was developed by asking members of the team to answer questions based on the text of a relevant policy. There are two separate components of the system which must be evaluated separately:

  1. How often is the correct policy which contains the answer to the user’s question returned?
  2. Does the answer provided align with the context that is provided?

These will be answered using the following metrics:

  1. Contextual Recall
  2. Faithfulness

These are defined as follows:

Contextual Recall

Contextual recall is defined as the proportion of relevant documents that are retrieved. For the gold standard cases, the queries have one document which answers the question and is appropriate. Therefore, the contextual recall is simply the proportion of queries for which the correct document is retrieved.

  • Contextual Recall = # Correct Document Retrieved / # Total Queries

 

Faithfulness

Here, we define faithfulness to simply be whether or not the answer that is provided by the system is related to the context that was provided. This is currently evaluated via an LLM, and will be updated to use the GEval Framework. The rubric is defined in the following prompt:

You will be given a user query, the context that is provided to answer the query, and the response the response aligns with context.

<query>
{query}
</query>
<context>
{context}
</context>
<response>
{response}
</response>

You should use the following rubric to determine the alignment score:

  • If the response can be attributed to the context exactly, score 3
  • If the response can be attributed to the context but with some minor differences, score 2
  • If the response can be attributed to the context but with significant differences, score 1
  • If the response cannot be attributed to the context at all, score 0

You should return the score as well as the specific span of text from the context that supports

  • score: The alignment score
  • relevant_spans: A list of spans of text from the context that supports the response

The final score on November 14, 2024, was 100%