Advancing representation learning in specialized fields like medicine remains challenging due to the scarcity of expert annotations for text and images. To tackle this issue, we present a novel two-stage framework designed to extract high-quality factual statements from free-text radiology reports in order to improve the representations of text encoders and, consequently, their performance on various downstream tasks. In the first stage, we propose a \textit{Fact Extractor} that leverages large language models (LLMs) to identify factual statements from well-curated domain-specific datasets. In the second stage, we introduce a \textit{Fact Encoder} (CXRFE) based on a BERT model fine-tuned with objective functions designed to improve its representations using the extracted factual data. Our framework also includes a new embedding-based metric (CXRFEScore) for evaluating chest X-ray text generation systems, leveraging both stages of our approach. Extensive evaluations show that our fact extractor and encoder outperform current state-of-the-art methods in tasks such as sentence ranking, natural language inference, and label extraction from radiology reports. Additionally, our metric proves to be more robust and effective than existing metrics commonly used in the radiology report generation literature. The code of this project is available at \url{https://github.com/PabloMessina/CXR-Fact-Encoder}.
翻译:在医学等专业领域推进表示学习仍面临挑战,主要源于文本与图像专家标注的稀缺性。为解决这一问题,我们提出一种新颖的两阶段框架,旨在从自由文本放射学报告中提取高质量事实陈述,以改进文本编码器的表示能力,从而提升其在各类下游任务中的性能。在第一阶段,我们提出一种基于大型语言模型的**事实提取器**,通过精心构建的领域特定数据集识别事实陈述。在第二阶段,我们引入基于BERT模型的**事实编码器**,该模型通过针对提取事实数据设计的损失函数进行微调,以优化其表示能力。本框架还包含一种新的基于嵌入的评估指标,该指标融合了我们方法的两个阶段,用于评估胸部X射线文本生成系统。大量实验表明,我们的事实提取器与编码器在句子排序、自然语言推理及放射学报告标签提取等任务中均优于当前最先进方法。此外,我们的评估指标相比放射学报告生成文献中常用的现有指标表现出更强的鲁棒性与有效性。本项目代码已发布于\url{https://github.com/PabloMessina/CXR-Fact-Encoder}。