This paper presents a systematic investigation into the constrained generation capabilities of large language models (LLMs) in producing Songci, a classical Chinese poetry form characterized by strict structural, tonal, and rhyme constraints defined by Cipai templates. We first develop a comprehensive, multi-faceted evaluation framework that includes: (i) a formal conformity score, (ii) automated quality assessment using LLMs, (iii) human evaluation, and (iv) classification-based probing tasks. Using this framework, we evaluate the generative performance of 18 LLMs, including 3 proprietary models and 15 open-source models across 4 families, under five prompting strategies: zero-shot, one-shot, completion-based, instruction-based, and chain-of-thought. Finally, we propose a Generate-Critic architecture in which the evaluation framework functions as an automated critic. Leveraging the critic's feedback as a scoring function for best-of-N selection, we fine-tune 3 lightweight open-source LLMs via supervised fine-tuning (SFT), resulting in improvements of up to 5.88% in formal conformity. Our findings offer new insights into the generative strengths and limitations of LLMs in producing culturally significant and formally constrained literary texts.
翻译:本文对大语言模型在生成宋词——一种由词牌模板定义,具有严格结构、平仄和押韵约束的中国古典诗歌形式——方面的约束生成能力进行了系统性研究。我们首先开发了一个全面的多维度评估框架,包括:(i)形式符合度评分,(ii)使用LLM的自动化质量评估,(iii)人工评估,以及(iv)基于分类的探测任务。利用该框架,我们评估了18个大语言模型(包括3个专有模型和来自4个系列的15个开源模型)在五种提示策略下的生成性能:零样本、单样本、基于补全、基于指令和思维链。最后,我们提出了一种生成-批评家架构,其中评估框架充当自动化批评家。利用批评家的反馈作为最佳N选择(best-of-N selection)的评分函数,我们通过监督微调对3个轻量级开源大语言模型进行了微调,使形式符合度最高提升了5.88%。我们的研究结果为理解大语言模型在生成具有文化意义和严格形式约束的文学文本方面的生成优势与局限性提供了新的见解。