The thematic fit estimation task measures semantic arguments' compatibility with a specific semantic role for a specific predicate. We investigate if LLMs have consistent, expressible knowledge of event arguments' thematic fit by experimenting with various prompt designs, manipulating input context, reasoning, and output forms. We set a new state-of-the-art on thematic fit benchmarks, but show that closed and open weight LLMs respond differently to our prompting strategies: Closed models achieve better scores overall and benefit from multi-step reasoning, but they perform worse at filtering out generated sentences incompatible with the specified predicate, role, and argument.
翻译:主题适配性估计任务旨在衡量语义论元与特定谓词中特定语义角色的兼容程度。本文通过设计多种提示方案,操控输入语境、推理过程及输出形式,探究大语言模型是否对事件论元的主题适配性具有一致且可表达的知识。我们在主题适配性基准测试中取得了新的最优性能,但发现闭源与开源权重的大语言模型对我们的提示策略响应存在差异:闭源模型整体得分更高且能从多步推理中获益,但在过滤与指定谓词、角色及论元不兼容的生成句方面表现较差。