The use of copyrighted books for training AI models has led to numerous lawsuits from authors concerned about AI's ability to generate derivative content. Yet it's unclear if these models can generate high quality literary text while emulating authors' styles. To answer this we conducted a preregistered study comparing MFA-trained expert writers with three frontier AI models: ChatGPT, Claude & Gemini in writing up to 450 word excerpts emulating 50 award-winning authors' diverse styles. In blind pairwise evaluations by 159 representative expert & lay readers, AI-generated text from in-context prompting was strongly disfavored by experts for both stylistic fidelity (OR=0.16, p<10^-8) & writing quality (OR=0.13, p<10^-7) but showed mixed results with lay readers. However, fine-tuning ChatGPT on individual authors' complete works completely reversed these findings: experts now favored AI-generated text for stylistic fidelity (OR=8.16, p<10^-13) & writing quality (OR=1.87, p=0.010), with lay readers showing similar shifts. These effects generalize across authors & styles. The fine-tuned outputs were rarely flagged as AI-generated (3% rate v. 97% for in-context prompting) by best AI detectors. Mediation analysis shows this reversal occurs because fine-tuning eliminates detectable AI stylistic quirks (e.g., cliche density) that penalize in-context outputs. While we do not account for additional costs of human effort required to transform raw AI output into cohesive, publishable prose, the median fine-tuning & inference cost of $81 per author represents a dramatic 99.7% reduction compared to typical professional writer compensation. Author-specific fine-tuning thus enables non-verbatim AI writing that readers prefer to expert human writing, providing empirical evidence directly relevant to copyright's fourth fair-use factor, the "effect upon the potential market or value" of the source works.
翻译:使用受版权保护的书籍训练AI模型已引发多起作者诉讼,其担忧AI生成衍生内容的能力。然而,这些模型能否在模仿作者风格的同时生成高质量的文学文本尚不明确。为探究此问题,我们开展了一项预先注册的研究,将MFA培养的专业作家与三种前沿AI模型(ChatGPT、Claude和Gemini)进行对比,要求它们模仿50位获奖作家的多样风格撰写不超过450词的文本片段。通过159位具有代表性的专家与普通读者进行盲法配对评估,基于上下文提示生成的AI文本在风格保真度(OR=0.16,p<10^-8)和写作质量(OR=0.13,p<10^-7)上均受到专家的强烈否定,但在普通读者中结果不一。然而,使用单个作者的全部作品对ChatGPT进行微调后,结果完全逆转:专家在风格保真度(OR=8.16,p<10^-13)和写作质量(OR=1.87,p=0.010)上均更偏好AI生成文本,普通读者也呈现类似转变。这些效应在不同作者和风格中具有普适性。经微调的输出被最佳AI检测器识别为AI生成的概率极低(3%,而上下文提示生成文本的识别率为97%)。中介分析表明,这种逆转源于微调消除了可检测的AI风格缺陷(如陈词滥调密度),这些缺陷降低了上下文提示生成文本的评价。尽管我们未计入将原始AI输出转化为连贯、可发表文本所需的人力成本,但每位作者的中位数微调与推理成本(81美元)相较于典型专业作家报酬大幅降低了99.7%。因此,针对特定作者的微调使AI能够生成非逐字复制的文本,且读者评价优于专业人类作品,这为版权合理使用第四要素——对原作“潜在市场或价值的影响”——提供了直接相关的实证依据。