Text simplification aims to make technical texts more accessible to laypeople but often results in deletion of information and vagueness. This work proposes InfoLossQA, a framework to characterize and recover simplification-induced information loss in form of question-and-answer (QA) pairs. Building on the theory of Question Under Discussion, the QA pairs are designed to help readers deepen their knowledge of a text. We conduct a range of experiments with this framework. First, we collect a dataset of 1,000 linguist-curated QA pairs derived from 104 LLM simplifications of scientific abstracts of medical studies. Our analyses of this data reveal that information loss occurs frequently, and that the QA pairs give a high-level overview of what information was lost. Second, we devise two methods for this task: end-to-end prompting of open-source and commercial language models, and a natural language inference pipeline. With a novel evaluation framework considering the correctness of QA pairs and their linguistic suitability, our expert evaluation reveals that models struggle to reliably identify information loss and applying similar standards as humans at what constitutes information loss.
翻译:文本简化旨在使技术文本更易被非专业人士理解,但常导致信息删减和模糊性。本研究提出InfoLossQA框架,以问答对形式刻画并恢复简化引起的信息损失。基于"讨论中的问题"理论,该问答对设计旨在帮助读者深化对文本的理解。我们利用该框架开展一系列实验:首先,基于104个医学研究科学摘要的大语言模型简化版本,收集了由语言学家整理的1000个问答对数据集。数据分析表明信息损失频繁发生,且问答对能呈现损失信息的高层概览。其次,我们针对该任务设计了两种方法:面向开源及商业语言模型的端到端提示方法,以及自然语言推断流水线方法。通过考虑问答对正确性和语言学适用性的新型评估框架,专家评估发现模型难以可靠识别信息损失,且在与人类判断信息损失的标准一致性上存在困难。