MINT: A wrapper to make multi-modal and multi-image AI models interactive

Jan Freyberg,Abhijit Guha Roy,Terry Spitz,Beverly Freeman,Mike Schaekermann,Patricia Strachan,Eva Schnider,Renee Wong,Dale R Webster,Alan Karthikesalingam,Yun Liu,Krishnamurthy Dvijotham,Umesh Telang

from arxiv, 15 pages, 7 figures

During the diagnostic process, doctors incorporate multimodal information including imaging and the medical history - and similarly medical AI development has increasingly become multimodal. In this paper we tackle a more subtle challenge: doctors take a targeted medical history to obtain only the most pertinent pieces of information; how do we enable AI to do the same? We develop a wrapper method named MINT (Make your model INTeractive) that automatically determines what pieces of information are most valuable at each step, and ask for only the most useful information. We demonstrate the efficacy of MINT wrapping a skin disease prediction model, where multiple images and a set of optional answers to $25$ standard metadata questions (i.e., structured medical history) are used by a multi-modal deep network to provide a differential diagnosis. We show that MINT can identify whether metadata inputs are needed and if so, which question to ask next. We also demonstrate that when collecting multiple images, MINT can identify if an additional image would be beneficial, and if so, which type of image to capture. We showed that MINT reduces the number of metadata and image inputs needed by 82% and 36.2% respectively, while maintaining predictive performance. Using real-world AI dermatology system data, we show that needing fewer inputs can retain users that may otherwise fail to complete the system submission and drop off without a diagnosis. Qualitative examples show MINT can closely mimic the step-by-step decision making process of a clinical workflow and how this is different for straight forward cases versus more difficult, ambiguous cases. Finally we demonstrate how MINT is robust to different underlying multi-model classifiers and can be easily adapted to user requirements without significant model re-training.

翻译：在诊断过程中，医生会综合影像和病史等多模态信息——类似地，医学人工智能的开发也日益走向多模态。本文致力于解决一个更为微妙的问题：医生通过有针对性的病史询问获取最相关的信息；我们如何使人工智能也能做到这一点？我们开发了一种名为MINT（让模型具备交互性）的封装方法，它能自动判断每一步最具价值的信息片段，并仅请求最有效的信息。我们通过MINT封装一个皮肤病预测模型来验证其有效性——该模型利用多张图像和一组针对25个标准元数据问题（即结构化病史）的可选答案集，通过多模态深度网络进行鉴别诊断。研究表明，MINT能够判断是否需要元数据输入，以及需要时下一步应提出哪个问题。我们还证明，在采集多张图像时，MINT能识别是否需要额外图像，以及需要时该拍摄何种类型图像。实验显示，MINT在保持预测性能的同时，可分别减少82%的元数据输入和36.2%的图像输入需求。基于真实世界AI皮肤病系统数据，我们证明减少输入量能留住可能中途退出提交流程而未获诊断的用户。定性案例表明，MINT能够紧密模拟临床工作流的逐步决策过程，并清晰呈现简单病例与复杂模糊病例之间的决策差异。最后，我们展示了MINT对底层不同多模态分类器的鲁棒性，以及其在无需大规模模型重训练的情况下可轻松适配用户需求。