Meeting summarization with large language models (LLMs) remains error-prone, often producing outputs with hallucinations, omissions, and irrelevancies. We present FRAME, a modular pipeline that reframes summarization as a semantic enrichment task. FRAME extracts and scores salient facts, organizes them thematically, and uses these to enrich an outline into an abstractive summary. To personalize summaries, we introduce SCOPE, a reason-out-loud protocol that has the model build a reasoning trace by answering nine questions before content selection. For evaluation, we propose P-MESA, a multi-dimensional, reference-free evaluation framework to assess if a summary fits a target reader. P-MESA reliably identifies error instances, achieving >= 89% balanced accuracy against human annotations and strongly aligns with human severity ratings (r >= 0.70). On QMSum and FAME, FRAME reduces hallucination and omission by 2 out of 5 points (measured with MESA), while SCOPE improves knowledge fit and goal alignment over prompt-only baselines. Our findings advocate for rethinking summarization to improve control, faithfulness, and personalization.
翻译:基于大语言模型(LLMs)的会议摘要仍易出错,常产生包含幻觉、遗漏和无关内容的输出。我们提出FRAME,一种模块化流程,将摘要重构为语义增强任务。FRAME提取并评分关键事实,按主题组织,并利用这些事实将大纲增强为抽象摘要。为实现个性化摘要,我们引入SCOPE,一种出声推理协议,使模型在内容选择前通过回答九个问题构建推理轨迹。为进行评估,我们提出P-MESA,一个多维、无参考的评估框架,用于判断摘要是否符合目标读者需求。P-MESA能可靠识别错误实例,在人工标注上达到≥89%的平衡准确率,并与人类严重性评分高度一致(r ≥ 0.70)。在QMSum和FAME数据集上,FRAME将幻觉和遗漏减少了五分之二(通过MESA测量),而SCOPE在知识契合度和目标对齐方面优于仅使用提示的基线方法。我们的研究主张重新思考摘要方法,以提升可控性、忠实度和个性化程度。