Guidelines and Evaluation for Clinical Explainable AI

How to design and evaluate explainable AI in real-world high-stakes domains?

  • The Clinical Explainable AI Guidelines provides design and evaluation criteria that supports the XAI design and evaluation for clinical use.
  • Explanation form is chosen based on G1 Understandability and G2 Clinical relevance.
  • Explanation method is chosen based on G3 Truthfulness and G4 Informative plausibility.
  • Evaluations on two medical datasets showed existing heatmap methods met G1, partially met G2, but failed G3 and G4.
  • We propose a novel problem of multi-modal medical image explanation and its metrics.

Related Publication

  1. MedIA
    Guidelines and evaluation of clinical explainable AI in medical image analysis
    Jin, Weina, Li, Xiaoxiao, Fatehi, Mostafa, and Hamarneh, Ghassan
    Medical Image Analysis 2023
  2. MethodsX
    Generating post-hoc explanation from deep neural networks for multi-modal medical image analysis tasks
    Jin, Weina, Li, Xiaoxiao, Fatehi, Mostafa, and Hamarneh, Ghassan
    MethodsX 2023

A precursor of this work is published at AAAI 22 Social Impact Track:

Evaluating Explainable AI on a Multi-Modal Medical Imaging Task: Can Existing Algorithms Fulfill Clinical Requirements?

  1. AAAI
    Evaluating Explainable AI on a Multi-Modal Medical Imaging Task: Can Existing Algorithms Fulfill Clinical Requirements?
    Jin, Weina, Li, Xiaoxiao, and Hamarneh, Ghassan
    Proceedings of the AAAI Conference on Artificial Intelligence Jun 2022

    Acceptance rate: 15%

The overarching problem is: how to design and evaluate explainable AI in real-world high-stakes domains. We propose a novel problem in the medical domain, multi-modal medical image explanation, and use it as an example to demonstrate our evaluation process that incorporates both technical and clinical requirements.

Our evaluation is on the commonly used heatmap methods for end-user understandability. We cover both gradient and perturbation-based methods.

Based on the explanation goals in real-world critical tasks, we set two primary evaluation objectives on faithfulness and plausibility. Three evaluations on faithfulness show all the examined algorithms did not faithfully represent the AI model decision process at feature level. And plausibility evaluation results show that users’ assessment of how plausible explanations are, is not indicative for model decision quality.

Our systematic evaluation provides a roadmap and objectives for the design and evaluation of explainable AI in critical tasks.


Link to the previous work-in-progress paper: One Map Not Fit All.

  1. ICML-w
    One Map Does Not Fit All: Evaluating Saliency Map Explanation on Multi-Modal Medical Images
    Jin, Weina, Li, Xiaoxiao, and Hamarneh, Ghassan
    ICML 2021 Workshop on Interpretable Machine Learning in Healthcare 2021

    Spotlight paper (top 10%), oral presentation