Large Language Models (LLMs) often struggle with creative generation, and multi-agent frameworks that improve reasoning through interaction can paradoxically hinder creativity by inducing content homogenization. We introduce LLM Review, a peer-review-inspired framework implementing Blind Peer Review: agents exchange targeted feedback while revising independently, preserving divergent creative trajectories. To enable rigorous evaluation, we propose SciFi-100, a science fiction writing dataset with a unified framework combining LLM-as-a-judge scoring, human annotation, and rule-based novelty metrics. Experiments demonstrate that LLM Review consistently outperforms multi-agent baselines, and smaller models with our framework can surpass larger single-agent models, suggesting interaction structure may substitute for model scale.
翻译:大型语言模型(LLM)在创意生成方面常面临困难,而通过交互提升推理能力的多智能体框架,却可能因导致内容同质化而抑制创造力。我们提出了LLM Review,一个受同行评审启发的框架,它实现了盲审机制:智能体在独立修改过程中交换针对性反馈,从而保持多样化的创意轨迹。为了进行严谨评估,我们构建了SciFi-100——一个包含统一评估框架的科幻写作数据集,该框架融合了LLM-as-a-judge评分、人工标注和基于规则的新颖性指标。实验表明,LLM Review持续优于多智能体基线方法,且采用本框架的较小模型能够超越更大的单智能体模型,这暗示交互结构可能成为模型规模的替代方案。