Medical dialogue summarization is challenging due to the unstructured nature of medical conversations, the use of medical terminology in gold summaries, and the need to identify key information across multiple symptom sets. We present a novel system for the Dialogue2Note Medical Summarization tasks in the MEDIQA 2023 Shared Task. Our approach for section-wise summarization (Task A) is a two-stage process of selecting semantically similar dialogues and using the top-k similar dialogues as in-context examples for GPT-4. For full-note summarization (Task B), we use a similar solution with k=1. We achieved 3rd place in Task A (2nd among all teams), 4th place in Task B Division Wise Summarization (2nd among all teams), 15th place in Task A Section Header Classification (9th among all teams), and 8th place among all teams in Task B. Our results highlight the effectiveness of few-shot prompting for this task, though we also identify several weaknesses of prompting-based approaches. We compare GPT-4 performance with several finetuned baselines. We find that GPT-4 summaries are more abstractive and shorter. We make our code publicly available.
翻译:医疗对话摘要因对话的非结构化特性、金标准摘要中医疗术语的使用以及跨多组症状集识别关键信息的需要而具有挑战性。我们为 MEDIQA 2023 共享任务中的 Dialogue2Note 医疗摘要任务提出了一种创新系统。针对章节级摘要(任务 A),我们采用两阶段流程:首先选择语义相似的对话,然后利用前 k 个相似对话作为 GPT-4 的上下文示例。对于全笔记摘要(任务 B),我们采用类似方案,其中 k=1。我们在任务 A 中取得第三名(在所有团队中排名第二),在任务 B 分类摘要中取得第四名(在所有团队中排名第二),在任务 A 章节标题分类中取得第十五名(在所有团队中排名第九),并在任务 B 中位列所有团队的第八名。我们的结果凸显了小样本提示在此任务中的有效性,但我们也发现了基于提示的方法的若干不足。我们将 GPT-4 的性能与多个微调基线进行了比较,发现 GPT-4 生成的摘要更具抽象性且更简短。我们已公开代码。