Interpretability from a new lens: Integrating Stratification and Domain knowledge for Biomedical Applications

The use of machine learning (ML) techniques in the biomedical field has become increasingly important, particularly with the large amounts of data generated by the aftermath of the COVID-19 pandemic. However, due to the complex nature of biomedical datasets and the use of black-box ML models, a lack of trust and adoption by domain experts can arise. In response, interpretable ML (IML) approaches have been developed, but the curse of dimensionality in biomedical datasets can lead to model instability. This paper proposes a novel computational strategy for the stratification of biomedical problem datasets into k-fold cross-validation (CVs) and integrating domain knowledge interpretation techniques embedded into the current state-of-the-art IML frameworks. This approach can improve model stability, establish trust, and provide explanations for outcomes generated by trained IML models. Specifically, the model outcome, such as aggregated feature weight importance, can be linked to further domain knowledge interpretations using techniques like pathway functional enrichment, drug targeting, and repurposing databases. Additionally, involving end-users and clinicians in focus group discussions before and after the choice of IML framework can help guide testable hypotheses, improve performance metrics, and build trustworthy and usable IML solutions in the biomedical field. Overall, this study highlights the potential of combining advanced computational techniques with domain knowledge interpretation to enhance the effectiveness of IML solutions in the context of complex biomedical datasets.

翻译：机器学习技术在生物医学领域的应用日益重要，尤其是在COVID-19疫情爆发后产生的大量数据背景下。然而，由于生物医学数据集的复杂特性以及黑盒机器学习模型的使用，领域专家可能对此类技术缺乏信任与采用。为此，可解释机器学习方法应运而生，但生物医学数据集的维度灾难仍会导致模型不稳定性。本文提出一种新型计算策略，通过将生物医学问题数据集分层为k折交叉验证，并将领域知识解释技术嵌入当前最先进的可解释机器学习框架中。该方法能够提升模型稳定性、建立信任机制，并为训练后的可解释机器学习模型生成的输出结果提供解释。具体而言，模型输出（如聚合特征权重重要性）可通过通路功能富集、药物靶向及老药新用数据库等技术手段，与更深层次的领域知识解释相衔接。此外，在可解释机器学习框架选择前后组织终端用户及临床医生参与焦点小组讨论，有助于引导可检验假设的提出、优化性能指标，并在生物医学领域构建可信赖且可用的可解释机器学习解决方案。本研究突显了将先进计算技术与领域知识解释相结合，在复杂生物医学数据集背景下提升可解释机器学习解决方案有效性的潜力。