In clinical scenarios, multi-specialist consultation could significantly benefit the diagnosis, especially for intricate cases. This inspires us to explore a "multi-expert joint diagnosis" mechanism to upgrade the existing "single expert" framework commonly seen in the current literature. To this end, we propose METransformer, a method to realize this idea with a transformer-based backbone. The key design of our method is the introduction of multiple learnable "expert" tokens into both the transformer encoder and decoder. In the encoder, each expert token interacts with both vision tokens and other expert tokens to learn to attend different image regions for image representation. These expert tokens are encouraged to capture complementary information by an orthogonal loss that minimizes their overlap. In the decoder, each attended expert token guides the cross-attention between input words and visual tokens, thus influencing the generated report. A metrics-based expert voting strategy is further developed to generate the final report. By the multi-experts concept, our model enjoys the merits of an ensemble-based approach but through a manner that is computationally more efficient and supports more sophisticated interactions among experts. Experimental results demonstrate the promising performance of our proposed model on two widely used benchmarks. Last but not least, the framework-level innovation makes our work ready to incorporate advances on existing "single-expert" models to further improve its performance.
翻译:在临床场景中,多专家会诊能显著提升诊断效果,尤其对于复杂病例。这一现象启发我们探索"多专家联合诊断"机制,以升级现有文献中普遍采用"单一专家"框架的研究范式。为此,本文提出METransformer方法,通过基于Transformer的骨干网络实现该理念。其核心设计在于向Transformer编码器和解码器中引入多个可学习的"专家"Token。在编码器中,每个专家Token既与视觉Token交互,也与其他专家Token交互,从而学习关注不同图像区域以构建图像表征。通过正交损失函数最小化专家Token间的重叠,促使它们捕获互补信息。在解码器中,每个被关注的专家Token引导输入词元与视觉Token间的交叉注意力机制,进而影响生成的报告。我们进一步开发了基于量度的专家投票策略以生成最终报告。通过多专家概念设计,本模型兼具集成方法的优势,但计算效率更高,且支持专家间更复杂的交互机制。实验结果表明,该模型在两个广泛采用的基准数据集上均展现出优越性能。尤为重要的是,本工作提供的框架级创新使其能够直接整合现有"单专家"模型的改进成果,从而进一步提升性能。