Med-UniC: Unifying Cross-Lingual Medical Vision-Language Pre-Training by Diminishing Bias

The scarcity of data presents a critical obstacle to the efficacy of medical visionlanguage pre-training (VLP). A potential solution lies in the combination of datasets from various language communities. Nevertheless, the main challenge stems from the complexity of integrating diverse syntax and semantics, language-specific medical terminology, and culture-specific implicit knowledge. Therefore, one crucial aspect to consider is the presence of community bias caused by different languages. This paper presents a novel framework named Unifying Cross-Lingual Medical Vision-Language Pre-Training (Med-UniC), designed to integrate multimodal medical data from the two most prevalent languages, English and Spanish. Specifically, we propose Cross-lingual Text Alignment Regularization (CTR) to explicitly unify cross-lingual semantic representations of medical reports originating from diverse language communities. CTR is optimized through latent language disentanglement, rendering our optimization objective to not depend on negative samples, thereby significantly mitigating the bias from determining positive-negative sample pairs within analogous medical reports. Furthermore, it ensures that the cross-lingual representation is not biased toward any specific language community. Med-UniC reaches superior performance across 5 medical image tasks and 10 datasets encompassing over 30 diseases, offering a versatile framework for unifying multi-modal medical data within diverse linguistic communities. The experimental outcomes highlight the presence of community bias in cross-lingual VLP. Reducing this bias enhances the performance not only in vision-language tasks but also in uni-modal visual tasks.

翻译：摘要：数据稀缺是医学视觉语言预训练（VLP）面临的关键障碍。一种潜在解决方案在于整合来自不同语言社区的数据集。然而，主要挑战源于融合多样化的句法和语义、语言特定的医学术语以及文化特定的隐性知识的复杂性。因此，需考虑的一个关键方面是由不同语言引起的社区偏差。本文提出了一种名为“统一跨语言医学视觉语言预训练”（Med-UniC）的新框架，旨在整合来自两种最常用语言（英语和西班牙语）的多模态医学数据。具体而言，我们提出了跨语言文本对齐正则化（CTR），以显式统一源自不同语言社区的医学报告的跨语言语义表示。CTR通过潜在语言解缠进行优化，使我们的优化目标不依赖于负样本，从而显著减轻了在相似医学报告中确定正负样本对时的偏差。此外，这确保了跨语言表示不会偏向任何特定语言社区。Med-UniC在跨越5项医学图像任务和10个数据集（涵盖30余种疾病）中达到了卓越性能，为统一不同语言社区内的多模态医学数据提供了一个多功能框架。实验结果凸显了跨语言VLP中社区偏差的存在。减少这种偏差不仅提升了视觉语言任务的性能，而且增强了对单模态视觉任务的表现。