Text simplification aims to make technical texts more accessible to laypeople but often results in deletion of information and vagueness. This work proposes InfoLossQA, a framework to characterize and recover simplification-induced information loss in form of question-and-answer (QA) pairs. Building on the theory of Question Under Discussion, the QA pairs are designed to help readers deepen their knowledge of a text. We conduct a range of experiments with this framework. First, we collect a dataset of 1,000 linguist-curated QA pairs derived from 104 LLM simplifications of scientific abstracts of medical studies. Our analyses of this data reveal that information loss occurs frequently, and that the QA pairs give a high-level overview of what information was lost. Second, we devise two methods for this task: end-to-end prompting of open-source and commercial language models, and a natural language inference pipeline. With a novel evaluation framework considering the correctness of QA pairs and their linguistic suitability, our expert evaluation reveals that models struggle to reliably identify information loss and applying similar standards as humans at what constitutes information loss.
翻译:文本简化旨在使技术文本更易于非专业人士理解,但往往导致信息删除和表述模糊。本文提出InfoLossQA框架,以问答对的形式表征和恢复由简化引起的信息损失。该框架基于"讨论中的问题"理论构建,所设计的问答对旨在帮助读者深化对文本的理解。我们利用该框架开展了一系列实验。首先,我们收集了包含1000个语言学专家标注问答对的数据集,这些问答对源自104个LLM对医学研究科学摘要的简化文本。数据分析表明信息损失频繁发生,而问答对能宏观展示损失的信息类型。其次,我们设计了两种处理方法:对开源及商业语言模型的端到端提示方法,以及自然语言推理流程。通过构建同时考虑问答对正确性与语言适切性的新型评估框架,专家评估显示现有模型难以可靠识别信息损失,且在判定信息损失标准方面与人类存在差异。