Despite the remarkable generative capabilities of language models in producing naturalistic language, their effectiveness on explicit manipulation and generation of linguistic structures remain understudied. In this paper, we investigate the task of generating new sentences preserving a given semantic structure, following the FrameNet formalism. We propose a framework to produce novel frame-semantically annotated sentences following an overgenerate-and-filter approach. Our results show that conditioning on rich, explicit semantic information tends to produce generations with high human acceptance, under both prompting and finetuning. Our generated frame-semantic structured annotations are effective at training data augmentation for frame-semantic role labeling in low-resource settings; however, we do not see benefits under higher resource settings. Our study concludes that while generating high-quality, semantically rich data might be within reach, the downstream utility of such generations remains to be seen, highlighting the outstanding challenges with automating linguistic annotation tasks.
翻译:尽管语言模型在生成自然语言方面展现出卓越的能力,但其在显式操纵和生成语言结构方面的有效性仍未得到充分研究。本文研究在遵循FrameNet形式体系的前提下,生成保留给定语义结构的新句子的任务。我们提出了一个框架,采用过度生成再过滤的方法来生成新颖的框架语义标注句子。我们的结果表明,在提示和微调两种方式下,基于丰富、显式的语义信息进行条件生成,往往能产生具有较高人工接受度的生成结果。我们生成的框架语义结构化标注在低资源环境下,对于框架语义角色标注的训练数据增强是有效的;然而,在资源更充足的设置下,我们并未观察到益处。我们的研究得出结论:虽然生成高质量、语义丰富的数据可能触手可及,但此类生成结果的下游效用仍有待观察,这凸显了语言标注任务自动化所面临的突出挑战。