When Should LLMs Be Less Specific? Selective Abstraction for Reliable Long-Form Text Generation

LLMs are widely used, yet they remain prone to factual errors that erode user trust and limit adoption in high-risk settings. One approach to mitigate this risk is to equip models with uncertainty estimation mechanisms that abstain when confidence is low. However, this binary "all-or-nothing" approach is excessively restrictive in long-form settings, often discarding valuable information. We introduce Selective Abstraction (SA), a framework that enables LLMs to trade specificity for reliability by selectively reducing the detail of uncertain content. We first formalize SA through the lenses of selective risk and coverage. We then propose Atom-wise Selective Abstraction, a claim-level instantiation that decomposes responses into atomic claims (short, self-contained statements each expressing a single fact) and replaces uncertain atoms with higher confidence, less specific abstractions. To evaluate this framework, we develop a novel end-to-end pipeline for open-ended generation that instantiates risk as factual correctness and measures coverage using an information-theoretic measure of retained information. Across six open-source models on the FactScore and LongFact-Objects benchmarks, atom-wise SA consistently outperforms existing baselines, improving the area under the risk-coverage curve (AURC) by up to 27.73% over claim removal, demonstrating that reducing specificity can boost accuracy and reliability while preserving most of their original meaning.

翻译：大语言模型已得到广泛应用，但其仍易产生事实性错误，这会削弱用户信任并限制其在高风险场景中的采用。降低这种风险的一种方法是让模型具备不确定性估计机制，在置信度较低时选择弃答。然而，这种“全有或全无”的二元方法在长文本生成场景中过于严格，往往会丢弃有价值的信息。我们提出了选择性抽象框架，该框架通过有选择地减少不确定内容的细节，使大语言模型能够在具体性和可靠性之间进行权衡。我们首先从选择性风险和覆盖度的角度对选择性抽象进行了形式化定义。随后，我们提出了原子级选择性抽象方法，这是一种基于声明的实例化方案：该方法将模型回复分解为原子声明（每个原子声明为表达单一事实的简短自包含语句），并将不确定的原子替换为置信度更高、具体性更低的抽象表述。为评估该框架，我们开发了一种面向开放式生成的新型端到端流程，该流程将风险实例化为事实正确性，并采用信息论指标衡量保留信息量以度量覆盖度。在FactScore和LongFact-Objects基准测试中，对六个开源模型的实验表明，原子级选择性抽象方法始终优于现有基线，其风险-覆盖曲线下面积较声明删除方法最高提升27.73%，这证明降低具体性能够在保留原始语义的同时有效提升准确性与可靠性。