Med-EASi: Finely Annotated Dataset and Models for Controllable Simplification of Medical Texts

Automatic medical text simplification can assist providers with patient-friendly communication and make medical texts more accessible, thereby improving health literacy. But curating a quality corpus for this task requires the supervision of medical experts. In this work, we present $\textbf{Med-EASi}$ ($\underline{\textbf{Med}}$ical dataset for $\underline{\textbf{E}}$laborative and $\underline{\textbf{A}}$bstractive $\underline{\textbf{Si}}$mplification), a uniquely crowdsourced and finely annotated dataset for supervised simplification of short medical texts. Its $\textit{expert-layman-AI collaborative}$ annotations facilitate $\textit{controllability}$ over text simplification by marking four kinds of textual transformations: elaboration, replacement, deletion, and insertion. To learn medical text simplification, we fine-tune T5-large with four different styles of input-output combinations, leading to two control-free and two controllable versions of the model. We add two types of $\textit{controllability}$ into text simplification, by using a multi-angle training approach: $\textit{position-aware}$, which uses in-place annotated inputs and outputs, and $\textit{position-agnostic}$, where the model only knows the contents to be edited, but not their positions. Our results show that our fine-grained annotations improve learning compared to the unannotated baseline. Furthermore, $\textit{position-aware}$ control generates better simplification than the $\textit{position-agnostic}$ one. The data and code are available at https://github.com/Chandrayee/CTRL-SIMP.

翻译：自动医学文本简化能够协助医疗服务提供者进行更友好的患者沟通，提升医学文本的可理解性，从而改善健康素养。然而，为此任务构建优质语料库需要医学专家的监督。本文提出 $\textbf{Med-EASi}$（$\underline{\textbf{Med}}$ical dataset for $\underline{\textbf{E}}$laborative and $\underline{\textbf{A}}$bstractive $\underline{\textbf{Si}}$mplification），一个通过独特众包方式构建、针对短医学文本有监督简化任务的精细标注数据集。其 $\textit{专家-非专业人士-人工智能协同}$ 标注通过标记四种文本变换类型（扩展、替换、删除、插入）来促进文本简化的$\textit{可控性}$。为学习医学文本简化，我们以四种不同风格的输入-输出组合微调T5-large模型，产生两个无控制版本和两个可控版本。我们通过多角度训练方法为文本简化引入两种$\textit{可控性}$类型：$\textit{位置感知型}$（使用原位标注的输入和输出）和$\textit{位置无关型}$（模型仅知待编辑内容，不知其位置）。结果表明，相较于未标注基线，我们的细粒度标注能提升学习效果。此外，$\textit{位置感知型}$控制比$\textit{位置无关型}$生成更优的简化结果。数据和代码已开源至 https://github.com/Chandrayee/CTRL-SIMP。