Evaluating and optimising authorial style in long-form story generation remains challenging because style is often assessed with ad hoc prompting and is frequently conflated with overall writing quality. We propose a two-stage pipeline. First, we train a dedicated style-similarity judge by fine-tuning a sentence-transformer with authorship-verification supervision, and calibrate its similarity outputs into a bounded $[0,1]$ reward. Second, we use this judge as the primary reward in Group Relative Policy Optimization (GRPO) to fine-tune an 8B story generator for style-conditioned writing, avoiding the accept/reject supervision required by Direct Preference Optimization (DPO). Across four target authors (Mark Twain, Jane Austen, Charles Dickens, Thomas Hardy), the GRPO-trained 8B model achieves higher style scores than open-weight baselines, with an average style score of 0.893 across authors. These results suggest that AV-calibrated reward modelling provides a practical mechanism for controllable style transfer in long-form generation under a moderate model size and training budget.
翻译:评估和优化长篇故事生成中的作者风格仍具挑战性,因为风格常通过临时提示进行评估,且常与整体写作质量相混淆。我们提出一个两阶段流程。首先,通过使用作者身份验证监督微调句子转换器,训练一个专用的风格相似性评判器,并将其相似性输出校准为有界的$[0,1]$奖励值。其次,将该评判器作为群体相对策略优化(GRPO)中的主要奖励,对8B参数的故事生成器进行风格条件化写作的微调,避免了直接偏好优化(DPO)所需的接受/拒绝监督。在四位目标作者(Mark Twain, Jane Austen, Charles Dickens, Thomas Hardy)的测试中,经GRPO训练的8B模型获得了高于开源权重基线的风格分数,跨作者平均风格分数达0.893。这些结果表明,在中等模型规模和训练预算下,基于作者身份验证校准的奖励建模为长篇生成中的可控风格迁移提供了实用机制。